Udmurt Morphological Tagger (BERT-based, two-stage fine-tuning)

This model performs morphological tagging for the Udmurt language.
It was trained using a two-stage fine-tuning procedure combining automatically ambiguously annotated and manually disambiguated data (see paper for the details).

The model reached 93.25 % token accuracy (85.7 % on ambiguous tokens) and is released together with training code.


Model Overview

  • Task: Morphological tagging (token classification)
  • Architecture: AutoModelForTokenClassification
  • Base model: cis-lmu/glot500-base
  • Languages: Udmurt
  • Label format: concatenation of POS + morphological features (e.g., N,nom,sg, V,pst,3pl), inherited from the training data
  • Special handling: labels assigned only to the first subtoken of each word; other subtokens masked during training loss computation.

Methodology

Two-Stage Fine-Tuning Pipeline

  1. Pre-Fine-Tuning (PFT):
    Uses automatically ambiguously labeled data (AML) โ€” texts annotated by a rule-based analyzer without contextual disambiguation.
    Introduces a modified multi-label cross-entropy (MLCE) loss, allowing multiple pseudo-correct labels for ambiguous tokens.

  2. Task Fine-Tuning (FT):
    Uses manually disambiguated labeled data (MDL) to teach precise selection within homonymous tag groups.

  3. Vocabulary Adaptation (VA):
    Applies Vocabulary Initialization with Partial Inheritance (Samenko et al., 2021) to better match Udmurt subword segmentation.
    Optimal vocabulary size: 32 K WordPiece tokens (fertility โ‰ˆ 1.18). [Not used in this model, but was improving accuracy in other training setups with backbone models without Udmurt pre-training]


Data

Dataset Type Description Volume
Train-AML Automatically labeled Udmurt corpus (Arkhangelskiy 2019) annotated with uniparser-grammar-udm; includes ambiguous labels 558 K tokens / 64 K sentences
Train-MDL / Valid-MDL / Test-MDL Manually disambiguated The Udmurt corpus of LingvoDoc (Normanskaja et al. 2022), 80/10/10 split 100 K tokens / 12 K sentences

Evaluation Results

Backbone Setup Token Accuracy Ambiguous Token Acc
mBERT FT 86.28 % 77.04 %
mBERT VA + PFT + FT 91.38 % 81.54 %
ruBERT VA + PFT + FT 91.24 % 81.00 %
Glot500-m PFT + FT (this model) 93.25 % 85.70 %

How to Use

Example usage:

from transformers import pipeline

tagger = pipeline(
    task="token-classification",
    model="ulyanaisaeva/bert-morph-tagger-udmurt",
    aggregation_strategy="first"
)

text = "Example sentence."
preds = tagger(text)

Batch inference with clean predictions:

from tqdm import tqdm

def morph_tag_batch(sentences, batch_size=16):
    all_results = []
    for i in tqdm(range(0, len(sentences), batch_size)):
        batch = sentences[i:i+batch_size]
        preds = tagger(batch)
        all_results.extend([
            [(p["word"], p["entity_group"]) for p in sent_preds]
            for sent_preds in preds
        ])
    return all_results

result = morph_tag_batch(sentences)

Intended Use & Limitations

Intended Use

  • Morphological tagging for Udmurt texts.
  • Research and educational applications in computational morphology.
  • Transferable methodology for other low-resource languages.

Limitations

  • Performance validated on one language (Udmurt) only.
  • Accuracy may drop on noisy, code-switched, or non-standard orthography.

Disclaimer. For model selection and validation, a subset of the LingvoDoc Udmurt corpus was used. Although this dataset provides high-quality manual disambiguation, it was not the subject of independent verification within this study. Therefore, the released model may implicitly inherit potential inconsistencies or annotation artifacts present in that corpus. The goal of this research was to explore fine-tuning methodology, not to audit or benchmark the dataset itself. Users applying this model for linguistic analysis should be aware that occasional systematic biases might reflect corpus-specific labeling patterns rather than model deficiencies.


Citation

If you use this model or training methodology, please cite:

@inproceedings{isaeva-etal-2025-combining,
    title = "Combining Automated and Manual Data for Effective Downstream Fine-Tuning of Transformers for Low-Resource Language Applications",
    author = "Isaeva, Ulyana  and
      Astafurov, Danil  and
      Martynov, Nikita",
    editor = "Fei, Hao  and
      Tu, Kewei  and
      Zhang, Yuhui  and
      Hu, Xiang  and
      Han, Wenjuan  and
      Jia, Zixia  and
      Zheng, Zilong  and
      Cao, Yixin  and
      Zhang, Meishan  and
      Lu, Wei  and
      Siddharth, N.  and
      {\O}vrelid, Lilja  and
      Xue, Nianwen  and
      Zhang, Yue",
    booktitle = "Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)",
    month = aug,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.xllm-1.9/",
    doi = "10.18653/v1/2025.xllm-1.9",
    pages = "86--90",
    ISBN = "979-8-89176-286-2",
    abstract = "This paper addresses the constraints of down-stream applications of pre-trained language models (PLMs) for low-resource languages. These constraints are pre-train data deficiency preventing a low-resource language from being well represented in a PLM and inaccessibility of high-quality task-specific data annotation that limits task learning. We propose to use automatically labeled texts combined with manually annotated data in a two-stage task fine-tuning approach. The experiments revealed that utilizing such methodology combined with vocabulary adaptation may compensate for the absence of a targeted PLM or the deficiency of manually annotated data. The methodology is validated on the morphological tagging task for the Udmurt language. We publish our best model that achieved 93.25{\%} token accuracy on HuggingFace Hub along with the training code1."
}
Downloads last month
26
Safetensors
Model size
0.4B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ulyanaisaeva/bert-morph-tagger-udmurt

Finetuned
(22)
this model

Evaluation results