Udmurt Morphological Tagger (BERT-based, two-stage fine-tuning)

This model performs morphological tagging for the Udmurt language.
It was trained using a two-stage fine-tuning procedure combining automatically ambiguously annotated and manually disambiguated data (see paper for the details).

The model reached 93.25 % token accuracy (85.7 % on ambiguous tokens) and is released together with training code.

Model Overview

Task: Morphological tagging (token classification)
Architecture: AutoModelForTokenClassification
Base model: cis-lmu/glot500-base
Languages: Udmurt
Label format: concatenation of POS + morphological features (e.g., N,nom,sg, V,pst,3pl), inherited from the training data
Special handling: labels assigned only to the first subtoken of each word; other subtokens masked during training loss computation.

Methodology

Two-Stage Fine-Tuning Pipeline

Pre-Fine-Tuning (PFT):
Uses automatically ambiguously labeled data (AML) — texts annotated by a rule-based analyzer without contextual disambiguation.
Introduces a modified multi-label cross-entropy (MLCE) loss, allowing multiple pseudo-correct labels for ambiguous tokens.
Task Fine-Tuning (FT):
Uses manually disambiguated labeled data (MDL) to teach precise selection within homonymous tag groups.
Vocabulary Adaptation (VA):
Applies Vocabulary Initialization with Partial Inheritance (Samenko et al., 2021) to better match Udmurt subword segmentation.
Optimal vocabulary size: 32 K WordPiece tokens (fertility ≈ 1.18). [Not used in this model, but was improving accuracy in other training setups with backbone models without Udmurt pre-training]

Data

Dataset	Type	Description	Volume
Train-AML	Automatically labeled	Udmurt corpus (Arkhangelskiy 2019) annotated with uniparser-grammar-udm; includes ambiguous labels	558 K tokens / 64 K sentences
Train-MDL / Valid-MDL / Test-MDL	Manually disambiguated	The Udmurt corpus of LingvoDoc (Normanskaja et al. 2022), 80/10/10 split	100 K tokens / 12 K sentences

Evaluation Results

Backbone	Setup	Token Accuracy	Ambiguous Token Acc
mBERT	FT	86.28 %	77.04 %
mBERT	VA + PFT + FT	91.38 %	81.54 %
ruBERT	VA + PFT + FT	91.24 %	81.00 %
Glot500-m	PFT + FT (this model)	93.25 %	85.70 %

How to Use

Example usage:

from transformers import pipeline

tagger = pipeline(
    task="token-classification",
    model="ulyanaisaeva/bert-morph-tagger-udmurt",
    aggregation_strategy="first"
)

text = "Example sentence."
preds = tagger(text)

Batch inference with clean predictions:

from tqdm import tqdm

def morph_tag_batch(sentences, batch_size=16):
    all_results = []
    for i in tqdm(range(0, len(sentences), batch_size)):
        batch = sentences[i:i+batch_size]
        preds = tagger(batch)
        all_results.extend([
            [(p["word"], p["entity_group"]) for p in sent_preds]
            for sent_preds in preds
        ])
    return all_results

result = morph_tag_batch(sentences)

Intended Use & Limitations

Intended Use

Morphological tagging for Udmurt texts.
Research and educational applications in computational morphology.
Transferable methodology for other low-resource languages.

Limitations

Performance validated on one language (Udmurt) only.
Accuracy may drop on noisy, code-switched, or non-standard orthography.

Disclaimer. For model selection and validation, a subset of the LingvoDoc Udmurt corpus was used. Although this dataset provides high-quality manual disambiguation, it was not the subject of independent verification within this study. Therefore, the released model may implicitly inherit potential inconsistencies or annotation artifacts present in that corpus. The goal of this research was to explore fine-tuning methodology, not to audit or benchmark the dataset itself. Users applying this model for linguistic analysis should be aware that occasional systematic biases might reflect corpus-specific labeling patterns rather than model deficiencies.

Citation

If you use this model or training methodology, please cite:

@inproceedings{isaeva-etal-2025-combining,
    title = "Combining Automated and Manual Data for Effective Downstream Fine-Tuning of Transformers for Low-Resource Language Applications",
    author = "Isaeva, Ulyana  and
      Astafurov, Danil  and
      Martynov, Nikita",
    editor = "Fei, Hao  and
      Tu, Kewei  and
      Zhang, Yuhui  and
      Hu, Xiang  and
      Han, Wenjuan  and
      Jia, Zixia  and
      Zheng, Zilong  and
      Cao, Yixin  and
      Zhang, Meishan  and
      Lu, Wei  and
      Siddharth, N.  and
      {\O}vrelid, Lilja  and
      Xue, Nianwen  and
      Zhang, Yue",
    booktitle = "Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)",
    month = aug,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.xllm-1.9/",
    doi = "10.18653/v1/2025.xllm-1.9",
    pages = "86--90",
    ISBN = "979-8-89176-286-2",
    abstract = "This paper addresses the constraints of down-stream applications of pre-trained language models (PLMs) for low-resource languages. These constraints are pre-train data deficiency preventing a low-resource language from being well represented in a PLM and inaccessibility of high-quality task-specific data annotation that limits task learning. We propose to use automatically labeled texts combined with manually annotated data in a two-stage task fine-tuning approach. The experiments revealed that utilizing such methodology combined with vocabulary adaptation may compensate for the absence of a targeted PLM or the deficiency of manually annotated data. The methodology is validated on the morphological tagging task for the Udmurt language. We publish our best model that achieved 93.25{\%} token accuracy on HuggingFace Hub along with the training code1."
}

Downloads last month: 26

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for ulyanaisaeva/bert-morph-tagger-udmurt

Base model

cis-lmu/glot500-base

Finetuned

(22)

this model

Evaluation results

Token Accuracy (all) on Udmurt Morphological Dataset (Tsakorpus + LingvoDoc)
test set self-reported

0.932
Token Accuracy (ambiguous) on Udmurt Morphological Dataset (Tsakorpus + LingvoDoc)
test set self-reported

0.857

View on Papers With Code