---
library_name: transformers
tags:
- text-classification
- nlp
- turkish
- gibberish
- gibberish-classification
license: apache-2.0
datasets:
- yeniguno/turkish-gibberish-detection
language:
- tr
base_model:
- TURKCELL/gibberish-sentence-detection-model-tr
pipeline_tag: text-classification
---

# 🇹🇷 Turkish Gibberish Sentence Detection (Fine-Tuned)

This model detects whether a given **Turkish text is clean or gibberish**.  

## How to Get Started with the Model

```python
from transformers import pipeline

pipe = pipeline("text-classification", model="yeniguno/turkish-gibberish-detection-ft")

examples = [
    "bugün hava çok güzel, dışarı çıkalım mı?",
    "asdfghjk qwe!!! 🙃🙃🙃",
    "bgn asdqwe güzel qqqqqqqqqq"
]

for text in examples:
    print(text, "->", pipe(text)[0])
```

## Model Details

### Model Description

- **Base model:** `TURKCELL/gibberish-sentence-detection-model-tr`  
- **Language:** Turkish  
- **Task:** Binary Text Classification (Gibberish Detection)  
- **Labels:**  
  - `0 → ok` — meaningful Turkish text  
  - `1 → gibberish` — meaningless or noisy text (nonsense, random keyboard input, malformed words)  

## Uses

This model is designed to be used in **LLM guardrail systems** as an **input quality scanner**.  
Since LLM inference is computationally and financially expensive, it is inefficient to process meaningless or malformed text.  

By running this model **before** sending user input to an LLM, you can automatically detect and filter **gibberish or nonsensical text** — preventing unnecessary API calls and improving overall system efficiency.  

Typical use cases include:
- **Pre-filtering** user messages in chatbots or virtual assistants  
- **Guardrail modules** in enterprise LLM applications  
- **Quality control** for large-scale text ingestion pipelines  
- **Spam and noise detection** in user-generated content

If the input is classified as **gibberish**, it can be safely discarded or handled separately without invoking the LLM.

## Training Details

### Training Data

**Dataset:** [yeniguno/turkish-gibberish-detection](https://huggingface.co/datasets/yeniguno/turkish-gibberish-detection)

| Label | Count | Description |
|:------|------:|:-------------|
| 0 (ok) | 651,431 | valid, meaningful Turkish text |
| 1 (gibberish) | 699,999 | random keyboard strings, misspelled or malformed text |

All samples are **lowercased and cleaned**, with no newline or tab characters.

## Evaluation

| Split | Accuracy | Macro-F1 | F1(ok) | F1(gibberish) |
|:------|:---------:|:--------:|:------:|:--------------:|
| **Base model** | 0.6257 | 0.6254 | 0.61 | 0.64 |
| **Fine-tuned model** | **0.7369** | **0.7340** | 0.76 | 0.71 |

**Test set size:** 202,669 sentences  
**Evaluation metrics:** Accuracy, Macro-F1, per-class Precision/Recall/F1