YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

NeoDictaBERT Role Classifier for Hebrew Manuscripts

Model: alexgoldberg/neodictabert-role-classifier-hebrew-manuscripts

State-of-the-art role classification model for Hebrew manuscript persons. Achieves 90.64% accuracy using novel two-input format.


Model Description

Fine-tuned from dicta-il/neodictabert-bilingual on 4,339 Hebrew manuscript role samples.

Key Innovation: Two-input format with explicit person marking:

Input: [PERSON: person_name] + text

This reduces ambiguity when multiple persons appear in text.

Architecture: NeoBERT (363M parameters, 612B training tokens)

Categories: 6 manuscript roles

  1. TRANSCRIBER (מעתיק) - 34.2%
  2. OWNER (בעלים) - 26.8%
  3. AUTHOR (מחבר) - 25.1%
  4. CENSOR - 8.0%
  5. TRANSLATOR (מתרגם) - 4.5%
  6. COMMENTATOR (מעיר) - 1.3%

Performance

Test Set: 545 samples (6 balanced categories)

Metric Score
Accuracy 90.64%
F1 (weighted) 90.64%
Precision (weighted) 91%

Per-Category F1 Scores:

  • TRANSLATOR: 100% (32 samples) - Perfect!
  • CENSOR: 96% (49 samples)
  • AUTHOR: 92% (162 samples)
  • TRANSCRIBER: 87% (169 samples)
  • OWNER: 85% (126 samples)
  • COMMENTATOR: 71% (7 samples)

Comparison:

  • HalleluBERT two-input: 89.36% → +1.28% improvement
  • Single-input format: 87.52% → +3.12% with two-input innovation

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "alexgoldberg/neodictabert-role-classifier-hebrew-manuscripts"
tokenizer = AutoTokenizer.from_pretrained("dicta-il/neodictabert-bilingual")
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Two-input format: [PERSON: name] + text
person_name = "משה בן יעקב"
text = "נכתב על ידי משה בן יעקב"

# Format input (critical for performance!)
formatted_input = f"[PERSON: {person_name}] {text}"

# Tokenize
inputs = tokenizer(
    formatted_input,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=128
)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits[0]
    probabilities = torch.softmax(logits, dim=0)
    predicted_class = torch.argmax(probabilities).item()
    confidence = probabilities[predicted_class].item()

role = model.config.id2label[predicted_class]

print(f"Person: {person_name}")
print(f"Role: {role}")
print(f"Confidence: {confidence:.2%}")

# Output:
# Person: משה בן יעקב
# Role: TRANSCRIBER
# Confidence: 99%

Important: Always use the two-input format for best results!


Training Data

Source: 4,339 person-role pairs from Hebrew manuscript catalogs

Method: Extracted from MARC records with validated roles

  • Person names from 700$a fields
  • Roles from 700$e fields
  • Consolidated from 45 raw terms to 6 clean categories using AI

Format: Two-input with explicit person marking

[PERSON: person_name] + full_text

Role Distribution:

  • TRANSCRIBER: 1,485 samples (34.2%)
  • OWNER: 1,164 samples (26.8%)
  • AUTHOR: 1,087 samples (25.1%)
  • CENSOR: 348 samples (8.0%)
  • TRANSLATOR: 197 samples (4.5%)
  • COMMENTATOR: 58 samples (1.3%)

Data Splits:

  • Train: 4,339 samples
  • Validation: 552 samples
  • Test: 545 samples

Training Procedure

Hyperparameters:

  • Base model: dicta-il/neodictabert-bilingual
  • Training samples: 4,339
  • Epochs: 5
  • Batch size: 8 (with gradient accumulation = 4, effective: 32)
  • Learning rate: 2e-5
  • LR scheduler: Linear with warmup (10%)
  • Weight decay: 0.01
  • Max sequence length: 128
  • Optimizer: AdamW
  • Early stopping patience: 2
  • Random seed: 42

Hardware: Apple M1 Mac with MPS acceleration

Training time: ~20 minutes


Intended Use

Primary: Role classification for persons extracted from Hebrew manuscripts

Workflow:

  1. Extract person names using NER model
  2. For each person: Format as [PERSON: name] + text
  3. Classify role using this model
  4. Output: Person + Role + Confidence

Best practices:

  • Always use two-input format
  • Provide surrounding context (sentence or paragraph)
  • Keep input under 128 tokens for best results
  • Use confidence scores to flag uncertain predictions

Limitations

Category limitations:

  • Only 6 role categories (may not capture all nuances)
  • COMMENTATOR underperforms (only 58 training samples)
  • No "OTHER" category (forces classification into 6 roles)

Domain specificity:

  • Optimized for manuscript-specific role terminology
  • May not generalize to other document types
  • Hebrew manuscript context assumed

Input format dependency:

  • Requires two-input format for stated performance
  • Without [PERSON: name] marking: performance drops ~3%
  • Person name must be correctly identified

Context requirements:

  • Needs surrounding text with role clues
  • Person name alone insufficient
  • Works best with 20-100 tokens of context

Multi-person challenges:

  • When multiple persons in text, must classify each separately
  • Context ambiguity can occur
  • Tested primarily on 1-2 person scenarios

Bias and Fairness

Training data bias:

  • Historical manuscript collections over-represent certain periods
  • Gender imbalance (more male authors/transcribers in historical records)
  • Geographic bias toward Middle Eastern and European manuscripts

Role distribution bias:

  • TRANSCRIBER, OWNER, AUTHOR well-represented (>1,000 samples each)
  • COMMENTATOR, TRANSLATOR under-represented (<200 samples)
  • This affects per-category performance

Language bias:

  • Primarily Hebrew terminology
  • Latin names (censors) represented but other languages minimal

Mitigation strategies:

  • Report per-category performance transparently
  • Provide confidence scores
  • Recommend human review for low-confidence predictions
  • Acknowledge limitations in model card

Environmental Impact

Carbon footprint: Minimal

  • Training on local M1 Mac (consumer hardware)
  • No cloud GPU usage
  • ~20 minutes training time
  • Estimated: <0.1 kWh total energy

Compute efficiency:

  • Gradient accumulation reduces memory needs
  • Short sequences (128 tokens) enable fast training
  • Consumer hardware accessible to all researchers

Citation

@article{goldberg2025classifier,
  title={NeoDictaBERT Role Classifier for Hebrew Manuscripts},
  author={Goldberg, Alexander},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025},
  url={https://huggingface.co/alexgoldberg/neodictabert-role-classifier-hebrew-manuscripts}
}

Base model:

@misc{shmidman2025neodictabert,
  title={NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew},
  author={Shmidman, Shaltiel and Shmidman, Avi and Koppel, Moshe},
  year={2025},
  eprint={2510.20386},
  archivePrefix={arXiv}
}

License

MIT License (compatible with NeoDictaBERT's CC-BY-4.0)


Model Card Contact

  • Author: Alexander Goldberg
  • Institution: [Your Institution]
  • Paper: [arXiv link]
  • Code: [GitHub repository]

Last updated: November 2025

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support