ettin-encoder-32M-TR-HD
Model Description
ettin-encoder-32M-TR-HD is an ultra-efficient 32M parameter model fine-tuned for Turkish hallucination detection in Retrieval-Augmented Generation (RAG) systems. Despite its minimal size, this model achieves competitive performance, outperforming large language models like GPT-4.1 and Mistral Small in balanced metrics while offering significant computational efficiency advantages.
This model is part of the Turk-LettuceDetect project, which adapts the LettuceDetect framework for Turkish language applications. It performs token-level binary classification to identify unsupported claims in Turkish text generated by RAG systems.
Model Details
- Model Type: Encoder-based transformer for token classification
- Parameters: 32M
- Language: Turkish
- Task: Hallucination Detection (Token-Level Binary Classification)
- Framework: LettuceDetect
- Base Model: ettin-encoder-32M-TR
- Fine-tuned on: RAGTruth-TR dataset
Performance Highlights
Example-Level Performance (Whole Dataset)
- F1-Score: 61.35% (outperforms GPT-4.1's 53.97%)
- Precision: 61.38% (vs. GPT-4.1's 37.09%)
- Recall: 61.32% (balanced, avoiding false positive overload)
- AUROC: 78.24% (vs. GPT-4.1's 54.45%)
Task-Specific Performance
Data2txt Task:
- F1-Score: 75.75%
- Precision: 85.96%
- Recall: 67.70%
- AUROC: 82.11%
QA Task:
- F1-Score: 52.51%
- Precision: 41.37%
- Recall: 71.88%
- AUROC: 81.98%
Summary Task:
- F1-Score: 34.31%
- Precision: 33.98%
- Recall: 34.65%
- AUROC: 64.06%
Token-Level Performance (Whole Dataset)
- F1-Score: 38.33%
- Precision: 40.65%
- Recall: 36.27%
- AUROC: 67.00%
Token-Level Task Performance:
- QA: AUROC 76.32% (strongest token-level performance)
- Data2txt: F1 37.87%, AUROC 64.10%
- Summary: F1 16.47%, AUROC 55.99%
Key Advantages
- Ultra-Efficient: 32M parameters enable deployment on resource-constrained devices
- Balanced Performance: Avoids the extreme precision-recall imbalance of LLMs
- Production-Ready: Fast inference (30-60 examples/second) suitable for real-time RAG pipelines
- Cost-Effective: Minimal computational requirements reduce operational costs
- Superior to LLMs: Outperforms GPT-4.1 and Mistral Small in balanced metrics despite using <0.01% of their parameters
Intended Use
This model is designed for:
- Turkish RAG Systems: Detecting hallucinations in generated Turkish text
- Production Deployment: Real-time hallucination detection in high-throughput pipelines
- Resource-Constrained Environments: Edge devices and cost-sensitive applications
- Token-Level Analysis: Fine-grained identification of unsupported claims
Training Data
The model was fine-tuned on RAGTruth-TR, a Turkish translation of the RAGTruth benchmark dataset:
- Training Samples: 17,790 examples
- Test Samples: 2,700 examples
- Task Types: Question Answering (MS MARCO), Data-to-Text (Yelp reviews), Summarization (CNN/Daily Mail)
- Annotation: Token-level hallucination labels preserved during translation
Evaluation Data
The model was evaluated on the RAGTruth-TR test set across three task types:
- Summary: 900 examples
- Data2txt: 900 examples
- QA: 900 examples
- Whole Dataset: 2,700 examples
Limitations
- Token-Level Performance: While competitive at example-level, token-level F1-scores (38.33%) are lower than larger specialized models (71-78%)
- Summary Task: Lower performance in summarization tasks (34.31% F1) compared to Data2txt and QA
- Language Specificity: Trained specifically for Turkish; performance on other languages not evaluated
- Domain Specificity: Optimized for RAG scenarios; may not generalize to other hallucination detection contexts
Recommendations
Use this model when:
- Maximum efficiency and minimal resource usage are priorities
- Moderate performance (61% F1) is acceptable for your use case
- Edge deployment or mobile applications are required
- Cost optimization is critical
- Data2txt or QA tasks are primary use cases
Consider larger models (ettin-encoder-150M-TR, ModernBERT) when:
- Maximum accuracy is required
- Token-level precision is critical
- Summary tasks are primary use case
- Computational resources are abundant
How to Use
from lettucedetect import TransformerDetector
# Load the model
detector = TransformerDetector.from_pretrained(
"newmindai/ettin-encoder-32M-TR-HD"
)
# Detect hallucinations
context = "Your source document text..."
question = "Your question..."
answer = "Generated answer text..."
result = detector.detect(
context=context,
question=question,
answer=answer
)
# Access token-level predictions
hallucinated_tokens = result.get_hallucinated_spans()
Citation
If you use this model, please cite:
@misc{taş2025turklettucedetecthallucinationdetectionmodels,
title={Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications},
author={Selva Taş and Mahmut El Huseyni and Özay Ezerceli and Reyhan Bayraktar and Fatma Betül Terzioğlu},
year={2025},
eprint={2509.17671},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.17671},
}
References
Model Card Contact
For questions or issues, please open an issue on the project repository.
- Downloads last month
- 11