YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

NeoDictaBERT Role Classifier for Hebrew Manuscripts

Model: alexgoldberg/neodictabert-role-classifier-hebrew-manuscripts

State-of-the-art role classification model for Hebrew manuscript persons. Achieves 90.64% accuracy using novel two-input format.

Model Description

Fine-tuned from dicta-il/neodictabert-bilingual on 4,339 Hebrew manuscript role samples.

Key Innovation: Two-input format with explicit person marking:

Input: [PERSON: person_name] + text

This reduces ambiguity when multiple persons appear in text.

Architecture: NeoBERT (363M parameters, 612B training tokens)

Categories: 6 manuscript roles

TRANSCRIBER (מעתיק) - 34.2%
OWNER (בעלים) - 26.8%
AUTHOR (מחבר) - 25.1%
CENSOR - 8.0%
TRANSLATOR (מתרגם) - 4.5%
COMMENTATOR (מעיר) - 1.3%

Performance

Test Set: 545 samples (6 balanced categories)

Metric	Score
Accuracy	90.64% ⭐
F1 (weighted)	90.64%
Precision (weighted)	91%

Per-Category F1 Scores:

TRANSLATOR: 100% (32 samples) - Perfect!
CENSOR: 96% (49 samples)
AUTHOR: 92% (162 samples)
TRANSCRIBER: 87% (169 samples)
OWNER: 85% (126 samples)
COMMENTATOR: 71% (7 samples)

Comparison:

HalleluBERT two-input: 89.36% → +1.28% improvement
Single-input format: 87.52% → +3.12% with two-input innovation

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "alexgoldberg/neodictabert-role-classifier-hebrew-manuscripts"
tokenizer = AutoTokenizer.from_pretrained("dicta-il/neodictabert-bilingual")
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Two-input format: [PERSON: name] + text
person_name = "משה בן יעקב"
text = "נכתב על ידי משה בן יעקב"

# Format input (critical for performance!)
formatted_input = f"[PERSON: {person_name}] {text}"

# Tokenize
inputs = tokenizer(
    formatted_input,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=128
)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits[0]
    probabilities = torch.softmax(logits, dim=0)
    predicted_class = torch.argmax(probabilities).item()
    confidence = probabilities[predicted_class].item()

role = model.config.id2label[predicted_class]

print(f"Person: {person_name}")
print(f"Role: {role}")
print(f"Confidence: {confidence:.2%}")

# Output:
# Person: משה בן יעקב
# Role: TRANSCRIBER
# Confidence: 99%

Important: Always use the two-input format for best results!

Training Data

Source: 4,339 person-role pairs from Hebrew manuscript catalogs

Method: Extracted from MARC records with validated roles

Person names from 700$a fields
Roles from 700$e fields
Consolidated from 45 raw terms to 6 clean categories using AI

Format: Two-input with explicit person marking

[PERSON: person_name] + full_text

Role Distribution:

TRANSCRIBER: 1,485 samples (34.2%)
OWNER: 1,164 samples (26.8%)
AUTHOR: 1,087 samples (25.1%)
CENSOR: 348 samples (8.0%)
TRANSLATOR: 197 samples (4.5%)
COMMENTATOR: 58 samples (1.3%)

Data Splits:

Train: 4,339 samples
Validation: 552 samples
Test: 545 samples

Training Procedure

Hyperparameters:

Base model: dicta-il/neodictabert-bilingual
Training samples: 4,339
Epochs: 5
Batch size: 8 (with gradient accumulation = 4, effective: 32)
Learning rate: 2e-5
LR scheduler: Linear with warmup (10%)
Weight decay: 0.01
Max sequence length: 128
Optimizer: AdamW
Early stopping patience: 2
Random seed: 42

Hardware: Apple M1 Mac with MPS acceleration

Training time: ~20 minutes

Intended Use

Primary: Role classification for persons extracted from Hebrew manuscripts

Workflow:

Extract person names using NER model
For each person: Format as [PERSON: name] + text
Classify role using this model
Output: Person + Role + Confidence

Best practices:

Always use two-input format
Provide surrounding context (sentence or paragraph)
Keep input under 128 tokens for best results
Use confidence scores to flag uncertain predictions

Limitations

Category limitations:

Only 6 role categories (may not capture all nuances)
COMMENTATOR underperforms (only 58 training samples)
No "OTHER" category (forces classification into 6 roles)

Domain specificity:

Optimized for manuscript-specific role terminology
May not generalize to other document types
Hebrew manuscript context assumed

Input format dependency:

Requires two-input format for stated performance
Without [PERSON: name] marking: performance drops ~3%
Person name must be correctly identified

Context requirements:

Needs surrounding text with role clues
Person name alone insufficient
Works best with 20-100 tokens of context

Multi-person challenges:

When multiple persons in text, must classify each separately
Context ambiguity can occur
Tested primarily on 1-2 person scenarios

Bias and Fairness

Training data bias:

Historical manuscript collections over-represent certain periods
Gender imbalance (more male authors/transcribers in historical records)
Geographic bias toward Middle Eastern and European manuscripts

Role distribution bias:

TRANSCRIBER, OWNER, AUTHOR well-represented (>1,000 samples each)
COMMENTATOR, TRANSLATOR under-represented (<200 samples)
This affects per-category performance

Language bias:

Primarily Hebrew terminology
Latin names (censors) represented but other languages minimal

Mitigation strategies:

Report per-category performance transparently
Provide confidence scores
Recommend human review for low-confidence predictions
Acknowledge limitations in model card

Environmental Impact

Carbon footprint: Minimal

Training on local M1 Mac (consumer hardware)
No cloud GPU usage
~20 minutes training time
Estimated: <0.1 kWh total energy

Compute efficiency:

Gradient accumulation reduces memory needs
Short sequences (128 tokens) enable fast training
Consumer hardware accessible to all researchers

Citation

@article{goldberg2025classifier,
  title={NeoDictaBERT Role Classifier for Hebrew Manuscripts},
  author={Goldberg, Alexander},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025},
  url={https://huggingface.co/alexgoldberg/neodictabert-role-classifier-hebrew-manuscripts}
}

Base model:

@misc{shmidman2025neodictabert,
  title={NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew},
  author={Shmidman, Shaltiel and Shmidman, Avi and Koppel, Moshe},
  year={2025},
  eprint={2510.20386},
  archivePrefix={arXiv}
}

License

MIT License (compatible with NeoDictaBERT's CC-BY-4.0)

Model Card Contact

Author: Alexander Goldberg
Institution: [Your Institution]
Paper: [arXiv link]
Code: [GitHub repository]

Last updated: November 2025

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support