ettin-encoder-150M-TR-HD
Model Description
ettin-encoder-150M-TR-HD is a highly efficient 150M parameter model fine-tuned for Turkish hallucination detection in Retrieval-Augmented Generation (RAG) systems. This model offers an optimal balance between performance and efficiency, achieving strong results across all task types while maintaining computational efficiency suitable for production deployment.
This model is part of the Turk-LettuceDetect project, which adapts the LettuceDetect framework for Turkish language applications. It performs token-level binary classification to identify unsupported claims in Turkish text generated by RAG systems.
Model Details
- Model Type: Encoder-based transformer for token classification
- Parameters: 150M
- Language: Turkish
- Task: Hallucination Detection (Token-Level Binary Classification)
- Framework: LettuceDetect
- Base Model: ettin-encoder-150M-TR
- Fine-tuned on: RAGTruth-TR dataset
Performance Highlights
Example-Level Performance (Whole Dataset)
- F1-Score: 62.64% (outperforms GPT-4.1's 53.97%)
- Precision: 73.43% (nearly double GPT-4.1's 37.09%)
- Recall: 54.62% (balanced, avoiding false positive overload)
- AUROC: 82.66% (significantly higher than LLM baselines)
Task-Specific Performance
Data2txt Task (Exceptional Performance):
- F1-Score: 75.61%
- Precision: 87.33% (highest precision among all evaluated models)
- Recall: 66.67%
- AUROC: 82.70%
QA Task:
- F1-Score: 53.18%
- Precision: 49.46%
- Recall: 57.50%
- AUROC: 83.10%
Summary Task:
- F1-Score: 26.28%
- Precision: 50.00%
- Recall: 17.82%
- AUROC: 68.42%
Token-Level Performance (Whole Dataset)
- F1-Score: 38.40%
- Precision: 51.04%
- Recall: 30.78%
- AUROC: 64.76%
Token-Level Task Performance:
- QA: F1 46.05%, AUROC 70.45%
- Data2txt: F1 39.07%, AUROC 64.47%
- Summary: F1 16.19%, AUROC 54.84%
Key Advantages
- Optimal Efficiency-Performance Balance: 150M parameters provide enhanced performance while maintaining efficiency
- Exceptional Precision: 73.43% precision (whole dataset) and 87.33% in Data2txt tasks
- Strong Discriminative Power: 82.66% AUROC demonstrates superior ability to distinguish hallucinations
- Production-Ready: Fast inference suitable for real-time RAG pipelines
- Superior to LLMs: Outperforms GPT-4.1 and Mistral Small in balanced metrics
Intended Use
This model is designed for:
- Turkish RAG Systems: Detecting hallucinations in generated Turkish text
- Production Deployment: Real-time hallucination detection with emphasis on precision
- Data2txt Applications: Exceptional performance in data-to-text generation scenarios
- Balanced Performance Requirements: Applications requiring both precision and recall
Training Data
The model was fine-tuned on RAGTruth-TR, a Turkish translation of the RAGTruth benchmark dataset:
- Training Samples: 17,790 examples
- Test Samples: 2,700 examples
- Task Types: Question Answering (MS MARCO), Data-to-Text (Yelp reviews), Summarization (CNN/Daily Mail)
- Annotation: Token-level hallucination labels preserved during translation
Evaluation Data
The model was evaluated on the RAGTruth-TR test set across three task types:
- Summary: 900 examples
- Data2txt: 900 examples
- QA: 900 examples
- Whole Dataset: 2,700 examples
Limitations
- Token-Level Performance: Token-level F1-scores (38.40%) are lower than larger specialized models (71-78%)
- Summary Task: Lower performance in summarization tasks (26.28% F1) with low recall (17.82%)
- Language Specificity: Trained specifically for Turkish; performance on other languages not evaluated
- Domain Specificity: Optimized for RAG scenarios; may not generalize to other hallucination detection contexts
Recommendations
Use this model when:
- Enhanced precision is required (73% vs. 61% for 32M model)
- Data2txt tasks are primary use case (87.33% precision, 75.61% F1)
- Slightly higher computational resources are available
- Balanced performance with efficiency is the goal
- Precision is more critical than recall
Consider alternatives when:
- Maximum accuracy is required (use ModernBERT: 78% F1)
- Summary tasks are primary use case (consider larger models)
- Maximum efficiency is critical (use 32M model)
How to Use
from lettucedetect import TransformerDetector
# Load the model
detector = TransformerDetector.from_pretrained(
"newmindai/ettin-encoder-150M-TR-HD"
)
# Detect hallucinations
context = "Your source document text..."
question = "Your question..."
answer = "Generated answer text..."
result = detector.detect(
context=context,
question=question,
answer=answer
)
# Access token-level predictions
hallucinated_tokens = result.get_hallucinated_spans()
Citation
If you use this model, please cite:
@misc{taş2025turklettucedetecthallucinationdetectionmodels,
title={Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications},
author={Selva Taş and Mahmut El Huseyni and Özay Ezerceli and Reyhan Bayraktar and Fatma Betül Terzioğlu},
year={2025},
eprint={2509.17671},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.17671},
}
References
Model Card Contact
For questions or issues, please open an issue on the project repository.
- Downloads last month
- 7