ettin-encoder-150M-TR-HD

Model Description

ettin-encoder-150M-TR-HD is a highly efficient 150M parameter model fine-tuned for Turkish hallucination detection in Retrieval-Augmented Generation (RAG) systems. This model offers an optimal balance between performance and efficiency, achieving strong results across all task types while maintaining computational efficiency suitable for production deployment.

This model is part of the Turk-LettuceDetect project, which adapts the LettuceDetect framework for Turkish language applications. It performs token-level binary classification to identify unsupported claims in Turkish text generated by RAG systems.

Model Details

Model Type: Encoder-based transformer for token classification
Parameters: 150M
Language: Turkish
Task: Hallucination Detection (Token-Level Binary Classification)
Framework: LettuceDetect
Base Model: ettin-encoder-150M-TR
Fine-tuned on: RAGTruth-TR dataset

Performance Highlights

Example-Level Performance (Whole Dataset)

F1-Score: 62.64% (outperforms GPT-4.1's 53.97%)
Precision: 73.43% (nearly double GPT-4.1's 37.09%)
Recall: 54.62% (balanced, avoiding false positive overload)
AUROC: 82.66% (significantly higher than LLM baselines)

Task-Specific Performance

Data2txt Task (Exceptional Performance):

F1-Score: 75.61%
Precision: 87.33% (highest precision among all evaluated models)
Recall: 66.67%
AUROC: 82.70%

QA Task:

F1-Score: 53.18%
Precision: 49.46%
Recall: 57.50%
AUROC: 83.10%

Summary Task:

F1-Score: 26.28%
Precision: 50.00%
Recall: 17.82%
AUROC: 68.42%

Token-Level Performance (Whole Dataset)

F1-Score: 38.40%
Precision: 51.04%
Recall: 30.78%
AUROC: 64.76%

Token-Level Task Performance:

QA: F1 46.05%, AUROC 70.45%
Data2txt: F1 39.07%, AUROC 64.47%
Summary: F1 16.19%, AUROC 54.84%

Key Advantages

Optimal Efficiency-Performance Balance: 150M parameters provide enhanced performance while maintaining efficiency
Exceptional Precision: 73.43% precision (whole dataset) and 87.33% in Data2txt tasks
Strong Discriminative Power: 82.66% AUROC demonstrates superior ability to distinguish hallucinations
Production-Ready: Fast inference suitable for real-time RAG pipelines
Superior to LLMs: Outperforms GPT-4.1 and Mistral Small in balanced metrics

Intended Use

This model is designed for:

Turkish RAG Systems: Detecting hallucinations in generated Turkish text
Production Deployment: Real-time hallucination detection with emphasis on precision
Data2txt Applications: Exceptional performance in data-to-text generation scenarios
Balanced Performance Requirements: Applications requiring both precision and recall

Training Data

The model was fine-tuned on RAGTruth-TR, a Turkish translation of the RAGTruth benchmark dataset:

Training Samples: 17,790 examples
Test Samples: 2,700 examples
Task Types: Question Answering (MS MARCO), Data-to-Text (Yelp reviews), Summarization (CNN/Daily Mail)
Annotation: Token-level hallucination labels preserved during translation

Evaluation Data

The model was evaluated on the RAGTruth-TR test set across three task types:

Summary: 900 examples
Data2txt: 900 examples
QA: 900 examples
Whole Dataset: 2,700 examples

Limitations

Token-Level Performance: Token-level F1-scores (38.40%) are lower than larger specialized models (71-78%)
Summary Task: Lower performance in summarization tasks (26.28% F1) with low recall (17.82%)
Language Specificity: Trained specifically for Turkish; performance on other languages not evaluated
Domain Specificity: Optimized for RAG scenarios; may not generalize to other hallucination detection contexts

Recommendations

Use this model when:

Enhanced precision is required (73% vs. 61% for 32M model)
Data2txt tasks are primary use case (87.33% precision, 75.61% F1)
Slightly higher computational resources are available
Balanced performance with efficiency is the goal
Precision is more critical than recall

Consider alternatives when:

Maximum accuracy is required (use ModernBERT: 78% F1)
Summary tasks are primary use case (consider larger models)
Maximum efficiency is critical (use 32M model)

How to Use

from lettucedetect import TransformerDetector

# Load the model
detector = TransformerDetector.from_pretrained(
    "newmindai/ettin-encoder-150M-TR-HD"
)

# Detect hallucinations
context = "Your source document text..."
question = "Your question..."
answer = "Generated answer text..."

result = detector.detect(
    context=context,
    question=question,
    answer=answer
)

# Access token-level predictions
hallucinated_tokens = result.get_hallucinated_spans()

Citation

If you use this model, please cite:

@misc{taş2025turklettucedetecthallucinationdetectionmodels,
      title={Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications}, 
      author={Selva Taş and Mahmut El Huseyni and Özay Ezerceli and Reyhan Bayraktar and Fatma Betül Terzioğlu},
      year={2025},
      eprint={2509.17671},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.17671}, 
}

References

LettuceDetect Framework

Model Card Contact

For questions or issues, please open an issue on the project repository.

Downloads last month: 7

Safetensors

Model size

0.1B params

Tensor type

F32

Collection including newmindai/ettin-encoder-150M-TR-HD

Turkish Hallucination Detection Models

Collection

13 items • Updated 5 days ago • 6