NeoDictaBERT Role Classifier for Hebrew Manuscripts
Model: alexgoldberg/neodictabert-role-classifier-hebrew-manuscripts
State-of-the-art role classification model for Hebrew manuscript persons. Achieves 90.64% accuracy using novel two-input format.
Model Description
Fine-tuned from dicta-il/neodictabert-bilingual on 4,339 Hebrew manuscript role samples.
Key Innovation: Two-input format with explicit person marking:
Input: [PERSON: person_name] + text
This reduces ambiguity when multiple persons appear in text.
Architecture: NeoBERT (363M parameters, 612B training tokens)
Categories: 6 manuscript roles
- TRANSCRIBER (מעתיק) - 34.2%
- OWNER (בעלים) - 26.8%
- AUTHOR (מחבר) - 25.1%
- CENSOR - 8.0%
- TRANSLATOR (מתרגם) - 4.5%
- COMMENTATOR (מעיר) - 1.3%
Performance
Test Set: 545 samples (6 balanced categories)
| Metric | Score |
|---|---|
| Accuracy | 90.64% ⭐ |
| F1 (weighted) | 90.64% |
| Precision (weighted) | 91% |
Per-Category F1 Scores:
- TRANSLATOR: 100% (32 samples) - Perfect!
- CENSOR: 96% (49 samples)
- AUTHOR: 92% (162 samples)
- TRANSCRIBER: 87% (169 samples)
- OWNER: 85% (126 samples)
- COMMENTATOR: 71% (7 samples)
Comparison:
- HalleluBERT two-input: 89.36% → +1.28% improvement
- Single-input format: 87.52% → +3.12% with two-input innovation
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "alexgoldberg/neodictabert-role-classifier-hebrew-manuscripts"
tokenizer = AutoTokenizer.from_pretrained("dicta-il/neodictabert-bilingual")
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Two-input format: [PERSON: name] + text
person_name = "משה בן יעקב"
text = "נכתב על ידי משה בן יעקב"
# Format input (critical for performance!)
formatted_input = f"[PERSON: {person_name}] {text}"
# Tokenize
inputs = tokenizer(
formatted_input,
return_tensors="pt",
padding=True,
truncation=True,
max_length=128
)
# Predict
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits[0]
probabilities = torch.softmax(logits, dim=0)
predicted_class = torch.argmax(probabilities).item()
confidence = probabilities[predicted_class].item()
role = model.config.id2label[predicted_class]
print(f"Person: {person_name}")
print(f"Role: {role}")
print(f"Confidence: {confidence:.2%}")
# Output:
# Person: משה בן יעקב
# Role: TRANSCRIBER
# Confidence: 99%
Important: Always use the two-input format for best results!
Training Data
Source: 4,339 person-role pairs from Hebrew manuscript catalogs
Method: Extracted from MARC records with validated roles
- Person names from 700$a fields
- Roles from 700$e fields
- Consolidated from 45 raw terms to 6 clean categories using AI
Format: Two-input with explicit person marking
[PERSON: person_name] + full_text
Role Distribution:
- TRANSCRIBER: 1,485 samples (34.2%)
- OWNER: 1,164 samples (26.8%)
- AUTHOR: 1,087 samples (25.1%)
- CENSOR: 348 samples (8.0%)
- TRANSLATOR: 197 samples (4.5%)
- COMMENTATOR: 58 samples (1.3%)
Data Splits:
- Train: 4,339 samples
- Validation: 552 samples
- Test: 545 samples
Training Procedure
Hyperparameters:
- Base model: dicta-il/neodictabert-bilingual
- Training samples: 4,339
- Epochs: 5
- Batch size: 8 (with gradient accumulation = 4, effective: 32)
- Learning rate: 2e-5
- LR scheduler: Linear with warmup (10%)
- Weight decay: 0.01
- Max sequence length: 128
- Optimizer: AdamW
- Early stopping patience: 2
- Random seed: 42
Hardware: Apple M1 Mac with MPS acceleration
Training time: ~20 minutes
Intended Use
Primary: Role classification for persons extracted from Hebrew manuscripts
Workflow:
- Extract person names using NER model
- For each person: Format as
[PERSON: name] + text - Classify role using this model
- Output: Person + Role + Confidence
Best practices:
- Always use two-input format
- Provide surrounding context (sentence or paragraph)
- Keep input under 128 tokens for best results
- Use confidence scores to flag uncertain predictions
Limitations
Category limitations:
- Only 6 role categories (may not capture all nuances)
- COMMENTATOR underperforms (only 58 training samples)
- No "OTHER" category (forces classification into 6 roles)
Domain specificity:
- Optimized for manuscript-specific role terminology
- May not generalize to other document types
- Hebrew manuscript context assumed
Input format dependency:
- Requires two-input format for stated performance
- Without
[PERSON: name]marking: performance drops ~3% - Person name must be correctly identified
Context requirements:
- Needs surrounding text with role clues
- Person name alone insufficient
- Works best with 20-100 tokens of context
Multi-person challenges:
- When multiple persons in text, must classify each separately
- Context ambiguity can occur
- Tested primarily on 1-2 person scenarios
Bias and Fairness
Training data bias:
- Historical manuscript collections over-represent certain periods
- Gender imbalance (more male authors/transcribers in historical records)
- Geographic bias toward Middle Eastern and European manuscripts
Role distribution bias:
- TRANSCRIBER, OWNER, AUTHOR well-represented (>1,000 samples each)
- COMMENTATOR, TRANSLATOR under-represented (<200 samples)
- This affects per-category performance
Language bias:
- Primarily Hebrew terminology
- Latin names (censors) represented but other languages minimal
Mitigation strategies:
- Report per-category performance transparently
- Provide confidence scores
- Recommend human review for low-confidence predictions
- Acknowledge limitations in model card
Environmental Impact
Carbon footprint: Minimal
- Training on local M1 Mac (consumer hardware)
- No cloud GPU usage
- ~20 minutes training time
- Estimated: <0.1 kWh total energy
Compute efficiency:
- Gradient accumulation reduces memory needs
- Short sequences (128 tokens) enable fast training
- Consumer hardware accessible to all researchers
Citation
@article{goldberg2025classifier,
title={NeoDictaBERT Role Classifier for Hebrew Manuscripts},
author={Goldberg, Alexander},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2025},
url={https://huggingface.co/alexgoldberg/neodictabert-role-classifier-hebrew-manuscripts}
}
Base model:
@misc{shmidman2025neodictabert,
title={NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew},
author={Shmidman, Shaltiel and Shmidman, Avi and Koppel, Moshe},
year={2025},
eprint={2510.20386},
archivePrefix={arXiv}
}
License
MIT License (compatible with NeoDictaBERT's CC-BY-4.0)
Model Card Contact
- Author: Alexander Goldberg
- Institution: [Your Institution]
- Paper: [arXiv link]
- Code: [GitHub repository]
Last updated: November 2025