ettin-encoder-32M-TR-HD

Model Description

ettin-encoder-32M-TR-HD is an ultra-efficient 32M parameter model fine-tuned for Turkish hallucination detection in Retrieval-Augmented Generation (RAG) systems. Despite its minimal size, this model achieves competitive performance, outperforming large language models like GPT-4.1 and Mistral Small in balanced metrics while offering significant computational efficiency advantages.

This model is part of the Turk-LettuceDetect project, which adapts the LettuceDetect framework for Turkish language applications. It performs token-level binary classification to identify unsupported claims in Turkish text generated by RAG systems.

Model Details

Model Type: Encoder-based transformer for token classification
Parameters: 32M
Language: Turkish
Task: Hallucination Detection (Token-Level Binary Classification)
Framework: LettuceDetect
Base Model: ettin-encoder-32M-TR
Fine-tuned on: RAGTruth-TR dataset

Performance Highlights

Example-Level Performance (Whole Dataset)

F1-Score: 61.35% (outperforms GPT-4.1's 53.97%)
Precision: 61.38% (vs. GPT-4.1's 37.09%)
Recall: 61.32% (balanced, avoiding false positive overload)
AUROC: 78.24% (vs. GPT-4.1's 54.45%)

Task-Specific Performance

Data2txt Task:

F1-Score: 75.75%
Precision: 85.96%
Recall: 67.70%
AUROC: 82.11%

QA Task:

F1-Score: 52.51%
Precision: 41.37%
Recall: 71.88%
AUROC: 81.98%

Summary Task:

F1-Score: 34.31%
Precision: 33.98%
Recall: 34.65%
AUROC: 64.06%

Token-Level Performance (Whole Dataset)

F1-Score: 38.33%
Precision: 40.65%
Recall: 36.27%
AUROC: 67.00%

Token-Level Task Performance:

QA: AUROC 76.32% (strongest token-level performance)
Data2txt: F1 37.87%, AUROC 64.10%
Summary: F1 16.47%, AUROC 55.99%

Key Advantages

Ultra-Efficient: 32M parameters enable deployment on resource-constrained devices
Balanced Performance: Avoids the extreme precision-recall imbalance of LLMs
Production-Ready: Fast inference (30-60 examples/second) suitable for real-time RAG pipelines
Cost-Effective: Minimal computational requirements reduce operational costs
Superior to LLMs: Outperforms GPT-4.1 and Mistral Small in balanced metrics despite using <0.01% of their parameters

Intended Use

This model is designed for:

Turkish RAG Systems: Detecting hallucinations in generated Turkish text
Production Deployment: Real-time hallucination detection in high-throughput pipelines
Resource-Constrained Environments: Edge devices and cost-sensitive applications
Token-Level Analysis: Fine-grained identification of unsupported claims

Training Data

The model was fine-tuned on RAGTruth-TR, a Turkish translation of the RAGTruth benchmark dataset:

Training Samples: 17,790 examples
Test Samples: 2,700 examples
Task Types: Question Answering (MS MARCO), Data-to-Text (Yelp reviews), Summarization (CNN/Daily Mail)
Annotation: Token-level hallucination labels preserved during translation

Evaluation Data

The model was evaluated on the RAGTruth-TR test set across three task types:

Summary: 900 examples
Data2txt: 900 examples
QA: 900 examples
Whole Dataset: 2,700 examples

Limitations

Token-Level Performance: While competitive at example-level, token-level F1-scores (38.33%) are lower than larger specialized models (71-78%)
Summary Task: Lower performance in summarization tasks (34.31% F1) compared to Data2txt and QA
Language Specificity: Trained specifically for Turkish; performance on other languages not evaluated
Domain Specificity: Optimized for RAG scenarios; may not generalize to other hallucination detection contexts

Recommendations

Use this model when:

Maximum efficiency and minimal resource usage are priorities
Moderate performance (61% F1) is acceptable for your use case
Edge deployment or mobile applications are required
Cost optimization is critical
Data2txt or QA tasks are primary use cases

Consider larger models (ettin-encoder-150M-TR, ModernBERT) when:

Maximum accuracy is required
Token-level precision is critical
Summary tasks are primary use case
Computational resources are abundant

How to Use

from lettucedetect import TransformerDetector

# Load the model
detector = TransformerDetector.from_pretrained(
    "newmindai/ettin-encoder-32M-TR-HD"
)

# Detect hallucinations
context = "Your source document text..."
question = "Your question..."
answer = "Generated answer text..."

result = detector.detect(
    context=context,
    question=question,
    answer=answer
)

# Access token-level predictions
hallucinated_tokens = result.get_hallucinated_spans()

Citation

If you use this model, please cite:

@misc{taş2025turklettucedetecthallucinationdetectionmodels,
      title={Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications}, 
      author={Selva Taş and Mahmut El Huseyni and Özay Ezerceli and Reyhan Bayraktar and Fatma Betül Terzioğlu},
      year={2025},
      eprint={2509.17671},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.17671}, 
}

References

LettuceDetect Framework

Model Card Contact

For questions or issues, please open an issue on the project repository.

Downloads last month: 11

Safetensors

Model size

32M params

Tensor type

F32

Collection including newmindai/ettin-encoder-32M-TR-HD

Turkish Hallucination Detection Models

Collection

13 items • Updated 6 days ago • 6