Udmurt Morphological Tagger (BERT-based, two-stage fine-tuning)
This model performs morphological tagging for the Udmurt language.
It was trained using a two-stage fine-tuning procedure combining automatically ambiguously annotated and manually disambiguated data (see paper for the details).
The model reached 93.25 % token accuracy (85.7 % on ambiguous tokens) and is released together with training code.
Model Overview
- Task: Morphological tagging (token classification)
- Architecture: AutoModelForTokenClassification
- Base model: cis-lmu/glot500-base
- Languages: Udmurt
- Label format: concatenation of POS + morphological features (e.g.,
N,nom,sg,V,pst,3pl), inherited from the training data - Special handling: labels assigned only to the first subtoken of each word; other subtokens masked during training loss computation.
Methodology
Two-Stage Fine-Tuning Pipeline
Pre-Fine-Tuning (PFT):
Uses automatically ambiguously labeled data (AML) โ texts annotated by a rule-based analyzer without contextual disambiguation.
Introduces a modified multi-label cross-entropy (MLCE) loss, allowing multiple pseudo-correct labels for ambiguous tokens.Task Fine-Tuning (FT):
Uses manually disambiguated labeled data (MDL) to teach precise selection within homonymous tag groups.Vocabulary Adaptation (VA):
Applies Vocabulary Initialization with Partial Inheritance (Samenko et al., 2021) to better match Udmurt subword segmentation.
Optimal vocabulary size: 32 K WordPiece tokens (fertility โ 1.18). [Not used in this model, but was improving accuracy in other training setups with backbone models without Udmurt pre-training]
Data
| Dataset | Type | Description | Volume |
|---|---|---|---|
| Train-AML | Automatically labeled | Udmurt corpus (Arkhangelskiy 2019) annotated with uniparser-grammar-udm; includes ambiguous labels | 558 K tokens / 64 K sentences |
| Train-MDL / Valid-MDL / Test-MDL | Manually disambiguated | The Udmurt corpus of LingvoDoc (Normanskaja et al. 2022), 80/10/10 split | 100 K tokens / 12 K sentences |
Evaluation Results
| Backbone | Setup | Token Accuracy | Ambiguous Token Acc |
|---|---|---|---|
| mBERT | FT | 86.28 % | 77.04 % |
| mBERT | VA + PFT + FT | 91.38 % | 81.54 % |
| ruBERT | VA + PFT + FT | 91.24 % | 81.00 % |
| Glot500-m | PFT + FT (this model) | 93.25 % | 85.70 % |
How to Use
Example usage:
from transformers import pipeline
tagger = pipeline(
task="token-classification",
model="ulyanaisaeva/bert-morph-tagger-udmurt",
aggregation_strategy="first"
)
text = "Example sentence."
preds = tagger(text)
Batch inference with clean predictions:
from tqdm import tqdm
def morph_tag_batch(sentences, batch_size=16):
all_results = []
for i in tqdm(range(0, len(sentences), batch_size)):
batch = sentences[i:i+batch_size]
preds = tagger(batch)
all_results.extend([
[(p["word"], p["entity_group"]) for p in sent_preds]
for sent_preds in preds
])
return all_results
result = morph_tag_batch(sentences)
Intended Use & Limitations
Intended Use
- Morphological tagging for Udmurt texts.
- Research and educational applications in computational morphology.
- Transferable methodology for other low-resource languages.
Limitations
- Performance validated on one language (Udmurt) only.
- Accuracy may drop on noisy, code-switched, or non-standard orthography.
Disclaimer. For model selection and validation, a subset of the LingvoDoc Udmurt corpus was used. Although this dataset provides high-quality manual disambiguation, it was not the subject of independent verification within this study. Therefore, the released model may implicitly inherit potential inconsistencies or annotation artifacts present in that corpus. The goal of this research was to explore fine-tuning methodology, not to audit or benchmark the dataset itself. Users applying this model for linguistic analysis should be aware that occasional systematic biases might reflect corpus-specific labeling patterns rather than model deficiencies.
Citation
If you use this model or training methodology, please cite:
@inproceedings{isaeva-etal-2025-combining,
title = "Combining Automated and Manual Data for Effective Downstream Fine-Tuning of Transformers for Low-Resource Language Applications",
author = "Isaeva, Ulyana and
Astafurov, Danil and
Martynov, Nikita",
editor = "Fei, Hao and
Tu, Kewei and
Zhang, Yuhui and
Hu, Xiang and
Han, Wenjuan and
Jia, Zixia and
Zheng, Zilong and
Cao, Yixin and
Zhang, Meishan and
Lu, Wei and
Siddharth, N. and
{\O}vrelid, Lilja and
Xue, Nianwen and
Zhang, Yue",
booktitle = "Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)",
month = aug,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.xllm-1.9/",
doi = "10.18653/v1/2025.xllm-1.9",
pages = "86--90",
ISBN = "979-8-89176-286-2",
abstract = "This paper addresses the constraints of down-stream applications of pre-trained language models (PLMs) for low-resource languages. These constraints are pre-train data deficiency preventing a low-resource language from being well represented in a PLM and inaccessibility of high-quality task-specific data annotation that limits task learning. We propose to use automatically labeled texts combined with manually annotated data in a two-stage task fine-tuning approach. The experiments revealed that utilizing such methodology combined with vocabulary adaptation may compensate for the absence of a targeted PLM or the deficiency of manually annotated data. The methodology is validated on the morphological tagging task for the Udmurt language. We publish our best model that achieved 93.25{\%} token accuracy on HuggingFace Hub along with the training code1."
}
- Downloads last month
- 26
Model tree for ulyanaisaeva/bert-morph-tagger-udmurt
Base model
cis-lmu/glot500-baseEvaluation results
- Token Accuracy (all) on Udmurt Morphological Dataset (Tsakorpus + LingvoDoc)test set self-reported0.932
- Token Accuracy (ambiguous) on Udmurt Morphological Dataset (Tsakorpus + LingvoDoc)test set self-reported0.857