ettin-encoder-32M-TR-HD

Model Description

ettin-encoder-32M-TR-HD is an ultra-efficient 32M parameter model fine-tuned for Turkish hallucination detection in Retrieval-Augmented Generation (RAG) systems. Despite its minimal size, this model achieves competitive performance, outperforming large language models like GPT-4.1 and Mistral Small in balanced metrics while offering significant computational efficiency advantages.

This model is part of the Turk-LettuceDetect project, which adapts the LettuceDetect framework for Turkish language applications. It performs token-level binary classification to identify unsupported claims in Turkish text generated by RAG systems.

Model Details

  • Model Type: Encoder-based transformer for token classification
  • Parameters: 32M
  • Language: Turkish
  • Task: Hallucination Detection (Token-Level Binary Classification)
  • Framework: LettuceDetect
  • Base Model: ettin-encoder-32M-TR
  • Fine-tuned on: RAGTruth-TR dataset

Performance Highlights

Example-Level Performance (Whole Dataset)

  • F1-Score: 61.35% (outperforms GPT-4.1's 53.97%)
  • Precision: 61.38% (vs. GPT-4.1's 37.09%)
  • Recall: 61.32% (balanced, avoiding false positive overload)
  • AUROC: 78.24% (vs. GPT-4.1's 54.45%)

Task-Specific Performance

Data2txt Task:

  • F1-Score: 75.75%
  • Precision: 85.96%
  • Recall: 67.70%
  • AUROC: 82.11%

QA Task:

  • F1-Score: 52.51%
  • Precision: 41.37%
  • Recall: 71.88%
  • AUROC: 81.98%

Summary Task:

  • F1-Score: 34.31%
  • Precision: 33.98%
  • Recall: 34.65%
  • AUROC: 64.06%

Token-Level Performance (Whole Dataset)

  • F1-Score: 38.33%
  • Precision: 40.65%
  • Recall: 36.27%
  • AUROC: 67.00%

Token-Level Task Performance:

  • QA: AUROC 76.32% (strongest token-level performance)
  • Data2txt: F1 37.87%, AUROC 64.10%
  • Summary: F1 16.47%, AUROC 55.99%

Key Advantages

  1. Ultra-Efficient: 32M parameters enable deployment on resource-constrained devices
  2. Balanced Performance: Avoids the extreme precision-recall imbalance of LLMs
  3. Production-Ready: Fast inference (30-60 examples/second) suitable for real-time RAG pipelines
  4. Cost-Effective: Minimal computational requirements reduce operational costs
  5. Superior to LLMs: Outperforms GPT-4.1 and Mistral Small in balanced metrics despite using <0.01% of their parameters

Intended Use

This model is designed for:

  • Turkish RAG Systems: Detecting hallucinations in generated Turkish text
  • Production Deployment: Real-time hallucination detection in high-throughput pipelines
  • Resource-Constrained Environments: Edge devices and cost-sensitive applications
  • Token-Level Analysis: Fine-grained identification of unsupported claims

Training Data

The model was fine-tuned on RAGTruth-TR, a Turkish translation of the RAGTruth benchmark dataset:

  • Training Samples: 17,790 examples
  • Test Samples: 2,700 examples
  • Task Types: Question Answering (MS MARCO), Data-to-Text (Yelp reviews), Summarization (CNN/Daily Mail)
  • Annotation: Token-level hallucination labels preserved during translation

Evaluation Data

The model was evaluated on the RAGTruth-TR test set across three task types:

  • Summary: 900 examples
  • Data2txt: 900 examples
  • QA: 900 examples
  • Whole Dataset: 2,700 examples

Limitations

  1. Token-Level Performance: While competitive at example-level, token-level F1-scores (38.33%) are lower than larger specialized models (71-78%)
  2. Summary Task: Lower performance in summarization tasks (34.31% F1) compared to Data2txt and QA
  3. Language Specificity: Trained specifically for Turkish; performance on other languages not evaluated
  4. Domain Specificity: Optimized for RAG scenarios; may not generalize to other hallucination detection contexts

Recommendations

Use this model when:

  • Maximum efficiency and minimal resource usage are priorities
  • Moderate performance (61% F1) is acceptable for your use case
  • Edge deployment or mobile applications are required
  • Cost optimization is critical
  • Data2txt or QA tasks are primary use cases

Consider larger models (ettin-encoder-150M-TR, ModernBERT) when:

  • Maximum accuracy is required
  • Token-level precision is critical
  • Summary tasks are primary use case
  • Computational resources are abundant

How to Use

from lettucedetect import TransformerDetector

# Load the model
detector = TransformerDetector.from_pretrained(
    "newmindai/ettin-encoder-32M-TR-HD"
)

# Detect hallucinations
context = "Your source document text..."
question = "Your question..."
answer = "Generated answer text..."

result = detector.detect(
    context=context,
    question=question,
    answer=answer
)

# Access token-level predictions
hallucinated_tokens = result.get_hallucinated_spans()

Citation

If you use this model, please cite:

@misc{taş2025turklettucedetecthallucinationdetectionmodels,
      title={Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications}, 
      author={Selva Taş and Mahmut El Huseyni and Özay Ezerceli and Reyhan Bayraktar and Fatma Betül Terzioğlu},
      year={2025},
      eprint={2509.17671},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.17671}, 
}

References

Model Card Contact

For questions or issues, please open an issue on the project repository.

Downloads last month
11
Safetensors
Model size
32M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including newmindai/ettin-encoder-32M-TR-HD