🇹🇷 Turkish Gibberish Sentence Detection (Fine-Tuned)

This model detects whether a given Turkish text is clean or gibberish.

How to Get Started with the Model

from transformers import pipeline

pipe = pipeline("text-classification", model="yeniguno/turkish-gibberish-detection-ft")

examples = [
    "bugün hava çok güzel, dışarı çıkalım mı?",
    "asdfghjk qwe!!! 🙃🙃🙃",
    "bgn asdqwe güzel qqqqqqqqqq"
]

for text in examples:
    print(text, "->", pipe(text)[0])

Model Details

Model Description

  • Base model: TURKCELL/gibberish-sentence-detection-model-tr
  • Language: Turkish
  • Task: Binary Text Classification (Gibberish Detection)
  • Labels:
    • 0 → ok — meaningful Turkish text
    • 1 → gibberish — meaningless or noisy text (nonsense, random keyboard input, malformed words)

Uses

This model is designed to be used in LLM guardrail systems as an input quality scanner.
Since LLM inference is computationally and financially expensive, it is inefficient to process meaningless or malformed text.

By running this model before sending user input to an LLM, you can automatically detect and filter gibberish or nonsensical text — preventing unnecessary API calls and improving overall system efficiency.

Typical use cases include:

  • Pre-filtering user messages in chatbots or virtual assistants
  • Guardrail modules in enterprise LLM applications
  • Quality control for large-scale text ingestion pipelines
  • Spam and noise detection in user-generated content

If the input is classified as gibberish, it can be safely discarded or handled separately without invoking the LLM.

Training Details

Training Data

Dataset: yeniguno/turkish-gibberish-detection

Label Count Description
0 (ok) 651,431 valid, meaningful Turkish text
1 (gibberish) 699,999 random keyboard strings, misspelled or malformed text

All samples are lowercased and cleaned, with no newline or tab characters.

Evaluation

Split Accuracy Macro-F1 F1(ok) F1(gibberish)
Base model 0.6257 0.6254 0.61 0.64
Fine-tuned model 0.7369 0.7340 0.76 0.71

Test set size: 202,669 sentences
Evaluation metrics: Accuracy, Macro-F1, per-class Precision/Recall/F1

Downloads last month
7
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yeniguno/turkish-gibberish-detection-ft

Finetuned
(1)
this model

Dataset used to train yeniguno/turkish-gibberish-detection-ft