🇹🇷 Turkish Gibberish Sentence Detection (Fine-Tuned)
This model detects whether a given Turkish text is clean or gibberish.
How to Get Started with the Model
from transformers import pipeline
pipe = pipeline("text-classification", model="yeniguno/turkish-gibberish-detection-ft")
examples = [
"bugün hava çok güzel, dışarı çıkalım mı?",
"asdfghjk qwe!!! 🙃🙃🙃",
"bgn asdqwe güzel qqqqqqqqqq"
]
for text in examples:
print(text, "->", pipe(text)[0])
Model Details
Model Description
- Base model:
TURKCELL/gibberish-sentence-detection-model-tr - Language: Turkish
- Task: Binary Text Classification (Gibberish Detection)
- Labels:
0 → ok— meaningful Turkish text1 → gibberish— meaningless or noisy text (nonsense, random keyboard input, malformed words)
Uses
This model is designed to be used in LLM guardrail systems as an input quality scanner.
Since LLM inference is computationally and financially expensive, it is inefficient to process meaningless or malformed text.
By running this model before sending user input to an LLM, you can automatically detect and filter gibberish or nonsensical text — preventing unnecessary API calls and improving overall system efficiency.
Typical use cases include:
- Pre-filtering user messages in chatbots or virtual assistants
- Guardrail modules in enterprise LLM applications
- Quality control for large-scale text ingestion pipelines
- Spam and noise detection in user-generated content
If the input is classified as gibberish, it can be safely discarded or handled separately without invoking the LLM.
Training Details
Training Data
Dataset: yeniguno/turkish-gibberish-detection
| Label | Count | Description |
|---|---|---|
| 0 (ok) | 651,431 | valid, meaningful Turkish text |
| 1 (gibberish) | 699,999 | random keyboard strings, misspelled or malformed text |
All samples are lowercased and cleaned, with no newline or tab characters.
Evaluation
| Split | Accuracy | Macro-F1 | F1(ok) | F1(gibberish) |
|---|---|---|---|---|
| Base model | 0.6257 | 0.6254 | 0.61 | 0.64 |
| Fine-tuned model | 0.7369 | 0.7340 | 0.76 | 0.71 |
Test set size: 202,669 sentences
Evaluation metrics: Accuracy, Macro-F1, per-class Precision/Recall/F1
- Downloads last month
- 7