ettin-encoder-150M-TR-HD

Model Description

ettin-encoder-150M-TR-HD is a highly efficient 150M parameter model fine-tuned for Turkish hallucination detection in Retrieval-Augmented Generation (RAG) systems. This model offers an optimal balance between performance and efficiency, achieving strong results across all task types while maintaining computational efficiency suitable for production deployment.

This model is part of the Turk-LettuceDetect project, which adapts the LettuceDetect framework for Turkish language applications. It performs token-level binary classification to identify unsupported claims in Turkish text generated by RAG systems.

Model Details

  • Model Type: Encoder-based transformer for token classification
  • Parameters: 150M
  • Language: Turkish
  • Task: Hallucination Detection (Token-Level Binary Classification)
  • Framework: LettuceDetect
  • Base Model: ettin-encoder-150M-TR
  • Fine-tuned on: RAGTruth-TR dataset

Performance Highlights

Example-Level Performance (Whole Dataset)

  • F1-Score: 62.64% (outperforms GPT-4.1's 53.97%)
  • Precision: 73.43% (nearly double GPT-4.1's 37.09%)
  • Recall: 54.62% (balanced, avoiding false positive overload)
  • AUROC: 82.66% (significantly higher than LLM baselines)

Task-Specific Performance

Data2txt Task (Exceptional Performance):

  • F1-Score: 75.61%
  • Precision: 87.33% (highest precision among all evaluated models)
  • Recall: 66.67%
  • AUROC: 82.70%

QA Task:

  • F1-Score: 53.18%
  • Precision: 49.46%
  • Recall: 57.50%
  • AUROC: 83.10%

Summary Task:

  • F1-Score: 26.28%
  • Precision: 50.00%
  • Recall: 17.82%
  • AUROC: 68.42%

Token-Level Performance (Whole Dataset)

  • F1-Score: 38.40%
  • Precision: 51.04%
  • Recall: 30.78%
  • AUROC: 64.76%

Token-Level Task Performance:

  • QA: F1 46.05%, AUROC 70.45%
  • Data2txt: F1 39.07%, AUROC 64.47%
  • Summary: F1 16.19%, AUROC 54.84%

Key Advantages

  1. Optimal Efficiency-Performance Balance: 150M parameters provide enhanced performance while maintaining efficiency
  2. Exceptional Precision: 73.43% precision (whole dataset) and 87.33% in Data2txt tasks
  3. Strong Discriminative Power: 82.66% AUROC demonstrates superior ability to distinguish hallucinations
  4. Production-Ready: Fast inference suitable for real-time RAG pipelines
  5. Superior to LLMs: Outperforms GPT-4.1 and Mistral Small in balanced metrics

Intended Use

This model is designed for:

  • Turkish RAG Systems: Detecting hallucinations in generated Turkish text
  • Production Deployment: Real-time hallucination detection with emphasis on precision
  • Data2txt Applications: Exceptional performance in data-to-text generation scenarios
  • Balanced Performance Requirements: Applications requiring both precision and recall

Training Data

The model was fine-tuned on RAGTruth-TR, a Turkish translation of the RAGTruth benchmark dataset:

  • Training Samples: 17,790 examples
  • Test Samples: 2,700 examples
  • Task Types: Question Answering (MS MARCO), Data-to-Text (Yelp reviews), Summarization (CNN/Daily Mail)
  • Annotation: Token-level hallucination labels preserved during translation

Evaluation Data

The model was evaluated on the RAGTruth-TR test set across three task types:

  • Summary: 900 examples
  • Data2txt: 900 examples
  • QA: 900 examples
  • Whole Dataset: 2,700 examples

Limitations

  1. Token-Level Performance: Token-level F1-scores (38.40%) are lower than larger specialized models (71-78%)
  2. Summary Task: Lower performance in summarization tasks (26.28% F1) with low recall (17.82%)
  3. Language Specificity: Trained specifically for Turkish; performance on other languages not evaluated
  4. Domain Specificity: Optimized for RAG scenarios; may not generalize to other hallucination detection contexts

Recommendations

Use this model when:

  • Enhanced precision is required (73% vs. 61% for 32M model)
  • Data2txt tasks are primary use case (87.33% precision, 75.61% F1)
  • Slightly higher computational resources are available
  • Balanced performance with efficiency is the goal
  • Precision is more critical than recall

Consider alternatives when:

  • Maximum accuracy is required (use ModernBERT: 78% F1)
  • Summary tasks are primary use case (consider larger models)
  • Maximum efficiency is critical (use 32M model)

How to Use

from lettucedetect import TransformerDetector

# Load the model
detector = TransformerDetector.from_pretrained(
    "newmindai/ettin-encoder-150M-TR-HD"
)

# Detect hallucinations
context = "Your source document text..."
question = "Your question..."
answer = "Generated answer text..."

result = detector.detect(
    context=context,
    question=question,
    answer=answer
)

# Access token-level predictions
hallucinated_tokens = result.get_hallucinated_spans()

Citation

If you use this model, please cite:

@misc{taş2025turklettucedetecthallucinationdetectionmodels,
      title={Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications}, 
      author={Selva Taş and Mahmut El Huseyni and Özay Ezerceli and Reyhan Bayraktar and Fatma Betül Terzioğlu},
      year={2025},
      eprint={2509.17671},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.17671}, 
}

References

Model Card Contact

For questions or issues, please open an issue on the project repository.

Downloads last month
7
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including newmindai/ettin-encoder-150M-TR-HD