Model Card for LT-MLKM-modernBERT
Table of Contents
- Model Details
- How to Get Started with the Model
- Uses
- Risks, Biases, and Limitations
- Training Details
- Evaluation
- Citation
- License
Model Details
Model name: LT-MLKM-modernBERT
Project: “Development of the General Lithuanian Language Corpus and Vectorized Lithuanian Language Models” carried out by the State Digital Solutions Agency (SDSA) (Contract No. VDU-S-1684). The SDSA project manager is A. Rakauskas, and the supplier group leader is Assoc. Prof. Dr. A. Utka.
Architecture: ModernBERT from NVIDIA.
Model description: LT-MLKM-modernBERT is a Lithuanian masked language model developed as part of the national project “Development of a general Lithuanian language corpus and vectorized models.” The model builds on the ModernBERT-base architecture and was pre-trained on the BLKT Lithuanian Text Corpus Stage 3, which includes over 1.87 billion words and 49 billion training tokens from diverse Lithuanian sources such as news, legal, academic, and public sector texts. With a context length of 8,192 tokens, it efficiently processes long documents while maintaining linguistic precision and coherence. The model advances the project’s goal of providing high-quality Lithuanian language resources and pre-trained neural models to support AI, research, and digital innovation.
It serves as a base model for fine-tuning and domain adaptation in both public and private sector projects that require robust Lithuanian text processing. However, the model is not task-specialized; to address specific downstream tasks, downstream tasks require fine-tuning.
It uses the ModernBertForMaskedLM implementation from the Hugging Face Transformers library (v4.54.1) with bfloat16 precision for efficient training and inference. The model employs a custom Lithuanian tokenizer specially built for this project, featuring a vocabulary size of 64,000 tokens optimized for Lithuanian morphology and subword segmentation. It supports a maximum context length of 8,192 tokens, allowing effective modelling of long documents and complex sentence structures.
How to Get Started with the Model
The model is used with the transformer library (AutoModelForMaskedLM). Input masked Lithuanian texts; output – probabilities and predictions of masked [MASK] gaps.
Simple Python code for inference:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("VSSA-SDSA/LT-MLKM-modernBERT")
model = AutoModelForMaskedLM.from_pretrained("VSSA-SDSA/LT-MLKM-modernBERT")
# Example text with a masked token
text = f"Aš gimiau Lietuvoje bei labai ją {tokenizer.mask_token} bei gerbiu."
inputs = tokenizer(text, return_tensors="pt")
mask_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
with torch.no_grad():
outputs = model(**inputs)
# Get prediction for masked position
mask_logits = outputs.logits[0, mask_index, :]
top_token_id = torch.argmax(mask_logits, dim=-1)
predicted_word = tokenizer.decode(top_token_id)
print("Predicted word:", predicted_word)
Uses
Intended use & limitations: LT-MLKM-modernBERT is intended for research, development, and deployment of Lithuanian language applications, including text completion, classification, clustering, and other natural language understanding tasks. It serves as a base model for fine-tuning and domain adaptation in both public and private sector projects that require robust Lithuanian text processing. To apply the model to specific downstream tasks in Lithuanian text processing, it must be fine-tuned for the intended purpose.
Risks, biases, and limitations
The model is not optimized for generative or conversational use and may produce incomplete or contextually inconsistent outputs if used outside masked language modelling tasks.
While trained on a large and diverse corpus, its knowledge reflects the linguistic and topical distribution of publicly available Lithuanian sources from 2025 and earlier. See also Safety, bias, and risk.
Safety, bias & risk: A template-based masked language modelling bias evaluation was applied, where analysis and evaluation were performed using the attribute mask filling principle. Four templates adapted for the Lithuanian language were used: {SUBJ} is [MASK], {SUBJ} works as [MASK], {SUBJ} became [MASK], {SUBJ} studied to become [MASK].
Evaluation tests of the LT-MLKM-modernBERT model did not show any significant indication of bias. For example, although the model more often associates technical abilities and activity characteristics with male subjects, and social and emotional aspects with female subjects, these differences are not pronounced and remain varied within categories.
Training Details
Training data: The model was trained on the BLKT Lithuanian Text Corpus Stage 3, a large-scale, curated collection of contemporary Lithuanian texts compiled under the national project “Development of the General Lithuanian Language Corpus and Vectorized Lithuanian Language Models” carried out by the State Digital Solutions Agency (SDSA) (Contract No. VDU-S-1684). The training dataset contains over 1.87 billion words and 5.1 million documents, sourced from diverse domains including news portals, government publications, encyclopaedic and academic sources, and regional media.
Table 1. Distribution of texts across sources
| No. | Sources | Type | Subtype | Alpha words | % |
|---|---|---|---|---|---|
| 1 | delfi.lt | zin | news portals | 702,957,035 | 37.445% |
| 2 | lrytas.lt | zin | news portals | 406,814,522 | 21.670% |
| 3 | lrt.lt | zin | news portals | 234,786,835 | 12.506% |
| 4 | ve.lt | zin | news portals | 157,808,789 | 8.406% |
| 5 | eur-lex.europa.eu | dok | EU documents | 105,893,846 | 5.641% |
| 6 | GOV_LT | neg | internet texts | 89,489,638 | 4.767% |
| 7 | lrs.lt.stenogramos | sak | transcriptions | 79,213,991 | 4.220% |
| 8 | lrs.lt | neg | other | 39,671,302 | 2.113% |
| 9 | Vikipedija | neg | scientific | 34,742,507 | 1.851% |
| 10 | VDU CRIS | neg | scientific | 7,399,476 | 0.394% |
| 11 | urm.lt | neg | internet texts | 5,972,814 | 0.318% |
| 12 | Švenčionių kraštas | zin | news portals | 5,035,426 | 0.268% |
| 13 | lrkt.lt | dok | LT documents | 4,163,254 | 0.222% |
| 14 | e-tar.lt | dok | LT documents | 2,291,061 | 0.122% |
| 15 | Lituanistika_DB | neg | scientific | 929,106 | 0.049% |
| 16 | Gargždapilis | neg | internet texts | 98,983 | 0.005% |
| 17 | Humanitarų meka | neg | internet texts | 50,288 | 0.003% |
| Total | 1,877,318,873 | 100.000% |
Table 2. Distribution of texts across text types
| Text types | Alpha words | % |
|---|---|---|
| dok | 112,348,161 | 6.0% |
| neg | 178,354,114 | 9.5% |
| sak | 79,213,991 | 4.2% |
| zin | 1,507,402,607 | 80.3% |
| Total | 1,877,318,873 | 100.00% |
The dataset was pre-processed to normalize text, remove duplicates, and analyze linguistic quality. In total, the model was exposed to approximately 49 billion tokens.
Training procedure
- Masking: Masked language modelling with a 15% masking rate, uniform masking, and an overlap ratio of 0.5.
- Optimizer: AdamW (β₂ = 0.98, ε = 1e-6)
- Learning rate: 1.5e-4 with linear warmup (5,000 steps)
- Weight decay: 0.01
- Precision: bfloat16
- Training steps: 125,000 with gradient accumulation.
- Hardware: 4 × NVIDIA A100-SXM4-80GB GPUs
- Training time: 210 hours
- Total compute: 43,890 TFLOP-hours
Evaluation
Evaluation metrics: cross-entropy, perplexity, and GLUE (NER task).
Evaluation details:
- Cross-entropy: Measured the difference between the actual word distribution and the predicted distribution.
- Perplexity: Indicated the size of the "effective" vocabulary from which the model statistically selects the next text unit.
- GLUE (NER): The model was fine-tuned on the Lithuanian part of the MultiLeg-dataset for Named Entity Recognition.
NER evaluation results:
Exact match:
| Model | Precision | Recall | F1-score |
|---|---|---|---|
| LT-MLKM-modernBERT | 0.913 | 0.843 | 0.876 |
Overlap:
| Model | Precision | Recall | F1-score |
|---|---|---|---|
| LT-MLKM-modernBERT | 0.947 | 0.872 | 0.908 |
Union-based:
| Model | Precision | Recall | F1-score | Matthews Correlation Coefficient (MCC) |
|---|---|---|---|---|
| LT-MLKM-modernBERT | 0.949 | 0.871 | 0.908 | 0.935 |
NER error analysis: The most common errors were incorrectly identified start and end of organization names and confusion between the start of an organization name and a person's name.
Citation
If you use LT-MLKM-modernBERT or any part of this repository in your research or deployment, please cite as follows (BibTeX):
@misc{SDSA_LT-MLKM-modernBERT_2025,
title= {{LT-MLKM-modernBERT}: Lithuanian ModernBERT Language Model},
author = {{State Digital Solutions Agency (SDSA)}},
year = {2025},
howpublished = {\url{https://huggingface.co/VSSA-SDSA/LT-MLKM-modernBERT}},
note = {Developed by Vytautas Magnus University (VMU), UAB Neurotechnology, UAB Tilde informacinės technologijos, MB Krilas}
}
License
Copyright (c) 2025 State Digital Solutions Agency (SDSA)
Developed by Vytautas Magnus University (VMU), UAB Neurotechnology, UAB Tilde informacinės technologijos, MB Krilas
Licensed under the Apache License, Version 2.0
Notice: Funded by Economic Recovery and Resilience Facility "New Generation Lithuania" Plan
- Downloads last month
- 18