Model Card for LT-MLKM-modernBERT

Table of Contents

Model Details

Model name: LT-MLKM-modernBERT

Project: “Development of the General Lithuanian Language Corpus and Vectorized Lithuanian Language Models” carried out by the State Digital Solutions Agency (SDSA) (Contract No. VDU-S-1684). The SDSA project manager is A. Rakauskas, and the supplier group leader is Assoc. Prof. Dr. A. Utka.

Architecture: ModernBERT from NVIDIA.

Model description: LT-MLKM-modernBERT is a Lithuanian masked language model developed as part of the national project “Development of a general Lithuanian language corpus and vectorized models.” The model builds on the ModernBERT-base architecture and was pre-trained on the BLKT Lithuanian Text Corpus Stage 3, which includes over 1.87 billion words and 49 billion training tokens from diverse Lithuanian sources such as news, legal, academic, and public sector texts. With a context length of 8,192 tokens, it efficiently processes long documents while maintaining linguistic precision and coherence. The model advances the project’s goal of providing high-quality Lithuanian language resources and pre-trained neural models to support AI, research, and digital innovation.

It serves as a base model for fine-tuning and domain adaptation in both public and private sector projects that require robust Lithuanian text processing. However, the model is not task-specialized; to address specific downstream tasks, downstream tasks require fine-tuning. It uses the ModernBertForMaskedLM implementation from the Hugging Face Transformers library (v4.54.1) with bfloat16 precision for efficient training and inference. The model employs a custom Lithuanian tokenizer specially built for this project, featuring a vocabulary size of 64,000 tokens optimized for Lithuanian morphology and subword segmentation. It supports a maximum context length of 8,192 tokens, allowing effective modelling of long documents and complex sentence structures.

How to Get Started with the Model

The model is used with the transformer library (AutoModelForMaskedLM). Input masked Lithuanian texts; output – probabilities and predictions of masked [MASK] gaps.

Simple Python code for inference:

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("VSSA-SDSA/LT-MLKM-modernBERT")
model = AutoModelForMaskedLM.from_pretrained("VSSA-SDSA/LT-MLKM-modernBERT")

# Example text with a masked token
text = f"Aš gimiau Lietuvoje bei labai ją {tokenizer.mask_token} bei gerbiu."

inputs = tokenizer(text, return_tensors="pt")
mask_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

with torch.no_grad():
    outputs = model(**inputs)

# Get prediction for masked position
mask_logits = outputs.logits[0, mask_index, :]
top_token_id = torch.argmax(mask_logits, dim=-1)
predicted_word = tokenizer.decode(top_token_id)

print("Predicted word:", predicted_word)

Uses

Intended use & limitations: LT-MLKM-modernBERT is intended for research, development, and deployment of Lithuanian language applications, including text completion, classification, clustering, and other natural language understanding tasks. It serves as a base model for fine-tuning and domain adaptation in both public and private sector projects that require robust Lithuanian text processing. To apply the model to specific downstream tasks in Lithuanian text processing, it must be fine-tuned for the intended purpose.

Risks, biases, and limitations

The model is not optimized for generative or conversational use and may produce incomplete or contextually inconsistent outputs if used outside masked language modelling tasks.

While trained on a large and diverse corpus, its knowledge reflects the linguistic and topical distribution of publicly available Lithuanian sources from 2025 and earlier. See also Safety, bias, and risk.

Safety, bias & risk: A template-based masked language modelling bias evaluation was applied, where analysis and evaluation were performed using the attribute mask filling principle. Four templates adapted for the Lithuanian language were used: {SUBJ} is [MASK], {SUBJ} works as [MASK], {SUBJ} became [MASK], {SUBJ} studied to become [MASK].

Evaluation tests of the LT-MLKM-modernBERT model did not show any significant indication of bias. For example, although the model more often associates technical abilities and activity characteristics with male subjects, and social and emotional aspects with female subjects, these differences are not pronounced and remain varied within categories.

Training Details

Training data: The model was trained on the BLKT Lithuanian Text Corpus Stage 3, a large-scale, curated collection of contemporary Lithuanian texts compiled under the national project “Development of the General Lithuanian Language Corpus and Vectorized Lithuanian Language Models” carried out by the State Digital Solutions Agency (SDSA) (Contract No. VDU-S-1684). The training dataset contains over 1.87 billion words and 5.1 million documents, sourced from diverse domains including news portals, government publications, encyclopaedic and academic sources, and regional media.

Table 1. Distribution of texts across sources

No. Sources Type Subtype Alpha words %
1 delfi.lt zin news portals 702,957,035 37.445%
2 lrytas.lt zin news portals 406,814,522 21.670%
3 lrt.lt zin news portals 234,786,835 12.506%
4 ve.lt zin news portals 157,808,789 8.406%
5 eur-lex.europa.eu dok EU documents 105,893,846 5.641%
6 GOV_LT neg internet texts 89,489,638 4.767%
7 lrs.lt.stenogramos sak transcriptions 79,213,991 4.220%
8 lrs.lt neg other 39,671,302 2.113%
9 Vikipedija neg scientific 34,742,507 1.851%
10 VDU CRIS neg scientific 7,399,476 0.394%
11 urm.lt neg internet texts 5,972,814 0.318%
12 Švenčionių kraštas zin news portals 5,035,426 0.268%
13 lrkt.lt dok LT documents 4,163,254 0.222%
14 e-tar.lt dok LT documents 2,291,061 0.122%
15 Lituanistika_DB neg scientific 929,106 0.049%
16 Gargždapilis neg internet texts 98,983 0.005%
17 Humanitarų meka neg internet texts 50,288 0.003%
Total 1,877,318,873 100.000%

Table 2. Distribution of texts across text types

Text types Alpha words %
dok 112,348,161 6.0%
neg 178,354,114 9.5%
sak 79,213,991 4.2%
zin 1,507,402,607 80.3%
Total 1,877,318,873 100.00%

The dataset was pre-processed to normalize text, remove duplicates, and analyze linguistic quality. In total, the model was exposed to approximately 49 billion tokens.

Training procedure

  • Masking: Masked language modelling with a 15% masking rate, uniform masking, and an overlap ratio of 0.5.
  • Optimizer: AdamW (β₂ = 0.98, ε = 1e-6)
  • Learning rate: 1.5e-4 with linear warmup (5,000 steps)
  • Weight decay: 0.01
  • Precision: bfloat16
  • Training steps: 125,000 with gradient accumulation.
  • Hardware: 4 × NVIDIA A100-SXM4-80GB GPUs
  • Training time: 210 hours
  • Total compute: 43,890 TFLOP-hours

Evaluation

Evaluation metrics: cross-entropy, perplexity, and GLUE (NER task).

Evaluation details:

  • Cross-entropy: Measured the difference between the actual word distribution and the predicted distribution.
  • Perplexity: Indicated the size of the "effective" vocabulary from which the model statistically selects the next text unit.
  • GLUE (NER): The model was fine-tuned on the Lithuanian part of the MultiLeg-dataset for Named Entity Recognition.

NER evaluation results:

Exact match:

Model Precision Recall F1-score
LT-MLKM-modernBERT 0.913 0.843 0.876

Overlap:

Model Precision Recall F1-score
LT-MLKM-modernBERT 0.947 0.872 0.908

Union-based:

Model Precision Recall F1-score Matthews Correlation Coefficient (MCC)
LT-MLKM-modernBERT 0.949 0.871 0.908 0.935

NER error analysis: The most common errors were incorrectly identified start and end of organization names and confusion between the start of an organization name and a person's name.

Citation

If you use LT-MLKM-modernBERT or any part of this repository in your research or deployment, please cite as follows (BibTeX):

@misc{SDSA_LT-MLKM-modernBERT_2025,
title= {{LT-MLKM-modernBERT}: Lithuanian ModernBERT Language Model},
author = {{State Digital Solutions Agency (SDSA)}},
year = {2025},
howpublished = {\url{https://huggingface.co/VSSA-SDSA/LT-MLKM-modernBERT}},
note = {Developed by Vytautas Magnus University (VMU), UAB Neurotechnology, UAB Tilde informacinės technologijos, MB Krilas}
}

License

Copyright (c) 2025 State Digital Solutions Agency (SDSA)

Developed by Vytautas Magnus University (VMU), UAB Neurotechnology, UAB Tilde informacinės technologijos, MB Krilas

Licensed under the Apache License, Version 2.0

Notice: Funded by Economic Recovery and Resilience Facility "New Generation Lithuania" Plan

Downloads last month
18
Safetensors
Model size
0.2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support