Model Card for LT-MLKM-modernBERT

Model Details
How to Get Started with the Model
Uses
Risks, Biases, and Limitations
Training Details
Evaluation
Citation
License

Model Details

Model name: LT-MLKM-modernBERT

Project: “Development of the General Lithuanian Language Corpus and Vectorized Lithuanian Language Models” carried out by the State Digital Solutions Agency (SDSA) (Contract No. VDU-S-1684). The SDSA project manager is A. Rakauskas, and the supplier group leader is Assoc. Prof. Dr. A. Utka.

Architecture: ModernBERT from NVIDIA.

Model description: LT-MLKM-modernBERT is a Lithuanian masked language model developed as part of the national project “Development of a general Lithuanian language corpus and vectorized models.” The model builds on the ModernBERT-base architecture and was pre-trained on the BLKT Lithuanian Text Corpus Stage 3, which includes over 1.87 billion words and 49 billion training tokens from diverse Lithuanian sources such as news, legal, academic, and public sector texts. With a context length of 8,192 tokens, it efficiently processes long documents while maintaining linguistic precision and coherence. The model advances the project’s goal of providing high-quality Lithuanian language resources and pre-trained neural models to support AI, research, and digital innovation.

It serves as a base model for fine-tuning and domain adaptation in both public and private sector projects that require robust Lithuanian text processing. However, the model is not task-specialized; to address specific downstream tasks, downstream tasks require fine-tuning. It uses the ModernBertForMaskedLM implementation from the Hugging Face Transformers library (v4.54.1) with bfloat16 precision for efficient training and inference. The model employs a custom Lithuanian tokenizer specially built for this project, featuring a vocabulary size of 64,000 tokens optimized for Lithuanian morphology and subword segmentation. It supports a maximum context length of 8,192 tokens, allowing effective modelling of long documents and complex sentence structures.

How to Get Started with the Model

The model is used with the transformer library (AutoModelForMaskedLM). Input masked Lithuanian texts; output – probabilities and predictions of masked [MASK] gaps.

Simple Python code for inference:

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("VSSA-SDSA/LT-MLKM-modernBERT")
model = AutoModelForMaskedLM.from_pretrained("VSSA-SDSA/LT-MLKM-modernBERT")

# Example text with a masked token
text = f"Aš gimiau Lietuvoje bei labai ją {tokenizer.mask_token} bei gerbiu."

inputs = tokenizer(text, return_tensors="pt")
mask_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

with torch.no_grad():
    outputs = model(**inputs)

# Get prediction for masked position
mask_logits = outputs.logits[0, mask_index, :]
top_token_id = torch.argmax(mask_logits, dim=-1)
predicted_word = tokenizer.decode(top_token_id)

print("Predicted word:", predicted_word)

Uses

Intended use & limitations: LT-MLKM-modernBERT is intended for research, development, and deployment of Lithuanian language applications, including text completion, classification, clustering, and other natural language understanding tasks. It serves as a base model for fine-tuning and domain adaptation in both public and private sector projects that require robust Lithuanian text processing. To apply the model to specific downstream tasks in Lithuanian text processing, it must be fine-tuned for the intended purpose.

Risks, biases, and limitations

The model is not optimized for generative or conversational use and may produce incomplete or contextually inconsistent outputs if used outside masked language modelling tasks.

While trained on a large and diverse corpus, its knowledge reflects the linguistic and topical distribution of publicly available Lithuanian sources from 2025 and earlier. See also Safety, bias, and risk.

Safety, bias & risk: A template-based masked language modelling bias evaluation was applied, where analysis and evaluation were performed using the attribute mask filling principle. Four templates adapted for the Lithuanian language were used: {SUBJ} is [MASK], {SUBJ} works as [MASK], {SUBJ} became [MASK], {SUBJ} studied to become [MASK].

Evaluation tests of the LT-MLKM-modernBERT model did not show any significant indication of bias. For example, although the model more often associates technical abilities and activity characteristics with male subjects, and social and emotional aspects with female subjects, these differences are not pronounced and remain varied within categories.

Training Details

Training data: The model was trained on the BLKT Lithuanian Text Corpus Stage 3, a large-scale, curated collection of contemporary Lithuanian texts compiled under the national project “Development of the General Lithuanian Language Corpus and Vectorized Lithuanian Language Models” carried out by the State Digital Solutions Agency (SDSA) (Contract No. VDU-S-1684). The training dataset contains over 1.87 billion words and 5.1 million documents, sourced from diverse domains including news portals, government publications, encyclopaedic and academic sources, and regional media.

Table 1. Distribution of texts across sources

No.	Sources	Type	Subtype	Alpha words	%
1	delfi.lt	zin	news portals	702,957,035	37.445%
2	lrytas.lt	zin	news portals	406,814,522	21.670%
3	lrt.lt	zin	news portals	234,786,835	12.506%
4	ve.lt	zin	news portals	157,808,789	8.406%
5	eur-lex.europa.eu	dok	EU documents	105,893,846	5.641%
6	GOV_LT	neg	internet texts	89,489,638	4.767%
7	lrs.lt.stenogramos	sak	transcriptions	79,213,991	4.220%
8	lrs.lt	neg	other	39,671,302	2.113%
9	Vikipedija	neg	scientific	34,742,507	1.851%
10	VDU CRIS	neg	scientific	7,399,476	0.394%
11	urm.lt	neg	internet texts	5,972,814	0.318%
12	Švenčionių kraštas	zin	news portals	5,035,426	0.268%
13	lrkt.lt	dok	LT documents	4,163,254	0.222%
14	e-tar.lt	dok	LT documents	2,291,061	0.122%
15	Lituanistika_DB	neg	scientific	929,106	0.049%
16	Gargždapilis	neg	internet texts	98,983	0.005%
17	Humanitarų meka	neg	internet texts	50,288	0.003%
	Total			1,877,318,873	100.000%

Table 2. Distribution of texts across text types

Text types	Alpha words	%
dok	112,348,161	6.0%
neg	178,354,114	9.5%
sak	79,213,991	4.2%
zin	1,507,402,607	80.3%
Total	1,877,318,873	100.00%

The dataset was pre-processed to normalize text, remove duplicates, and analyze linguistic quality. In total, the model was exposed to approximately 49 billion tokens.

Training procedure

Masking: Masked language modelling with a 15% masking rate, uniform masking, and an overlap ratio of 0.5.
Optimizer: AdamW (β₂ = 0.98, ε = 1e-6)
Learning rate: 1.5e-4 with linear warmup (5,000 steps)
Weight decay: 0.01
Precision: bfloat16
Training steps: 125,000 with gradient accumulation.
Hardware: 4 × NVIDIA A100-SXM4-80GB GPUs
Training time: 210 hours
Total compute: 43,890 TFLOP-hours

Evaluation

Evaluation metrics: cross-entropy, perplexity, and GLUE (NER task).

Evaluation details:

Cross-entropy: Measured the difference between the actual word distribution and the predicted distribution.
Perplexity: Indicated the size of the "effective" vocabulary from which the model statistically selects the next text unit.
GLUE (NER): The model was fine-tuned on the Lithuanian part of the MultiLeg-dataset for Named Entity Recognition.

NER evaluation results:

Exact match:

Model	Precision	Recall	F1-score
LT-MLKM-modernBERT	0.913	0.843	0.876

Overlap:

Model	Precision	Recall	F1-score
LT-MLKM-modernBERT	0.947	0.872	0.908

Union-based:

Model	Precision	Recall	F1-score	Matthews Correlation Coefficient (MCC)
LT-MLKM-modernBERT	0.949	0.871	0.908	0.935

NER error analysis: The most common errors were incorrectly identified start and end of organization names and confusion between the start of an organization name and a person's name.

Citation

If you use LT-MLKM-modernBERT or any part of this repository in your research or deployment, please cite as follows (BibTeX):

@misc{SDSA_LT-MLKM-modernBERT_2025,
title= {{LT-MLKM-modernBERT}: Lithuanian ModernBERT Language Model},
author = {{State Digital Solutions Agency (SDSA)}},
year = {2025},
howpublished = {\url{https://huggingface.co/VSSA-SDSA/LT-MLKM-modernBERT}},
note = {Developed by Vytautas Magnus University (VMU), UAB Neurotechnology, UAB Tilde informacinės technologijos, MB Krilas}
}

License

Developed by Vytautas Magnus University (VMU), UAB Neurotechnology, UAB Tilde informacinės technologijos, MB Krilas

Licensed under the Apache License, Version 2.0

Notice: Funded by Economic Recovery and Resilience Facility "New Generation Lithuania" Plan

Downloads last month: 18

Safetensors

Model size

0.2B params

Tensor type

BF16

VSSA-SDSA
/

LT-MLKM-modernBERT