--- library_name: transformers tags: - text-classification - nlp - turkish - gibberish - gibberish-classification license: apache-2.0 datasets: - yeniguno/turkish-gibberish-detection language: - tr base_model: - TURKCELL/gibberish-sentence-detection-model-tr pipeline_tag: text-classification --- # 🇹🇷 Turkish Gibberish Sentence Detection (Fine-Tuned) This model detects whether a given **Turkish text is clean or gibberish**. ## How to Get Started with the Model ```python from transformers import pipeline pipe = pipeline("text-classification", model="yeniguno/turkish-gibberish-detection-ft") examples = [ "bugün hava çok güzel, dışarı çıkalım mı?", "asdfghjk qwe!!! 🙃🙃🙃", "bgn asdqwe güzel qqqqqqqqqq" ] for text in examples: print(text, "->", pipe(text)[0]) ``` ## Model Details ### Model Description - **Base model:** `TURKCELL/gibberish-sentence-detection-model-tr` - **Language:** Turkish - **Task:** Binary Text Classification (Gibberish Detection) - **Labels:** - `0 → ok` — meaningful Turkish text - `1 → gibberish` — meaningless or noisy text (nonsense, random keyboard input, malformed words) ## Uses This model is designed to be used in **LLM guardrail systems** as an **input quality scanner**. Since LLM inference is computationally and financially expensive, it is inefficient to process meaningless or malformed text. By running this model **before** sending user input to an LLM, you can automatically detect and filter **gibberish or nonsensical text** — preventing unnecessary API calls and improving overall system efficiency. Typical use cases include: - **Pre-filtering** user messages in chatbots or virtual assistants - **Guardrail modules** in enterprise LLM applications - **Quality control** for large-scale text ingestion pipelines - **Spam and noise detection** in user-generated content If the input is classified as **gibberish**, it can be safely discarded or handled separately without invoking the LLM. ## Training Details ### Training Data **Dataset:** [yeniguno/turkish-gibberish-detection](https://huggingface.co/datasets/yeniguno/turkish-gibberish-detection) | Label | Count | Description | |:------|------:|:-------------| | 0 (ok) | 651,431 | valid, meaningful Turkish text | | 1 (gibberish) | 699,999 | random keyboard strings, misspelled or malformed text | All samples are **lowercased and cleaned**, with no newline or tab characters. ## Evaluation | Split | Accuracy | Macro-F1 | F1(ok) | F1(gibberish) | |:------|:---------:|:--------:|:------:|:--------------:| | **Base model** | 0.6257 | 0.6254 | 0.61 | 0.64 | | **Fine-tuned model** | **0.7369** | **0.7340** | 0.76 | 0.71 | **Test set size:** 202,669 sentences **Evaluation metrics:** Accuracy, Macro-F1, per-class Precision/Recall/F1