Kazakh RoBERTa NER Model
Model Description
This model is a fine-tuned version of kz-transformers/kaz-roberta-conversational for Named Entity Recognition (NER) in Kazakh language.
The model was trained on the KazNERD dataset augmented with crime-related news articles from the Kazakh News Articles Dataset.
Supported Entity Types
The model recognizes 25 entity types:
- ADAGE: Proverbs and sayings
- ART: Works of art
- CARDINAL: Numerical values
- CONTACT: Contact information
- CRIME_TYPE: Types of crimes (augmented category)
- DATE: Dates
- DISEASE: Disease names
- EVENT: Named events
- FACILITY: Buildings and facilities
- GPE: Geo-political entities
- LANGUAGE: Languages
- LAW: Laws and legal documents
- LOCATION: Locations
- MISCELLANEOUS: Miscellaneous entities
- MONEY: Monetary values
- NON_HUMAN: Non-human entities
- NORP: Nationalities, religious, or political groups
- ORDINAL: Ordinal numbers
- ORGANISATION: Organizations
- PERCENTAGE: Percentages
- PERSON: Person names
- POSITION: Job positions
- PRODUCT: Products
- PROJECT: Project names
- QUANTITY: Quantities
- TIME: Times
Training Data
- Base Dataset: KazNERD - Kazakh Named Entity Recognition Dataset
- Augmentation: 1,734 crime-related news articles from Kaggle Kazakh News Dataset
- Training Size: 92,602 sentences
- Validation Size: 11,675 sentences
- Test Size: 11,819 sentences
Training Procedure
Training Hyperparameters
- Base Model: kz-transformers/kaz-roberta-conversational
- Learning Rate: 3e-5
- Batch Size: 32 (train) / 64 (eval)
- Epochs: 10
- Optimizer: AdamW (fused)
- Weight Decay: 0.01
- Warmup Steps: 500
- LR Scheduler: Cosine
- Gradient Accumulation Steps: 2
- Mixed Precision: BF16
Results
| Metric | Value |
|---|---|
| Precision | 0.7685 |
| Recall | 0.8382 |
| F1 Score | 0.8019 |
Detailed Performance by Entity Type
| Entity Type | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| CARDINAL | 0.97 | 0.98 | 0.97 | 2824 |
| DATE | 0.96 | 0.98 | 0.97 | 2611 |
| GPE | 0.97 | 0.97 | 0.97 | 1742 |
| LANGUAGE | 1.00 | 0.98 | 0.99 | 46 |
| MONEY | 0.98 | 0.99 | 0.99 | 441 |
| NORP | 0.98 | 0.98 | 0.98 | 372 |
| ORDINAL | 0.96 | 0.96 | 0.96 | 386 |
| PERCENTAGE | 0.98 | 0.98 | 0.98 | 456 |
| POSITION | 0.93 | 0.97 | 0.95 | 603 |
| QUANTITY | 0.96 | 0.98 | 0.97 | 407 |
| PERSON | 0.77 | 0.84 | 0.80 | 2670 |
| ORGANISATION | 0.61 | 0.72 | 0.66 | 3893 |
| CRIME_TYPE | 0.28 | 0.37 | 0.32 | 1520 |
- CRIME_TYPE entities show significantly lower performance, with an F1 score of 0.32, showing that it sucks in identifying crime related terms.
- PERSON and ORGANISATION entities also show low performance, possibly due to the influence of crime-related examples in the training data.
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = "shoplikov/kaz-roberta-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
ner_pipeline = pipeline(
"ner",
model=model,
tokenizer=tokenizer,
aggregation_strategy="simple"
)
text = """Атырауда сот 23 жастағы Светлана Легкодимованың өліміне қатысты атышулы істі қарап, айыпталушыларға үкім шығарды. Сот 24 жастағы Р.Х. және 25 жастағы А.К.-ны аса ауыр қылмыс жасағаны үшін кінәлі деп таныды: ҚР Қылмыстық кодексінің 99-бабы 2-бөлігі 7) тармағы – "Алдын ала сөз байласу арқылы адамдар тобы жасаған кісі өлтіру". ҚР Қылмыстық кодексінің 202-бабы 2-бөлігі 6) тармағы – "Алдын ала сөз байласу арқылы адамдар тобы жасаған бөтеннің мүлкін қасақана жою". Айыптардың жиынтығы бойынша сот әрқайсына өмір бойына бас бостандығынан айыру жазасын тағайындады."""
results = ner_pipeline(text)
for entity in results:
print(f"{entity['word']}: {entity['entity_group']} (score: {entity['score']:.3f})")
# Атырауда: LOCATION (score: 0.998)
# Светлана Легкодимованың: PERSON (score: 1.000)
# Р.Х.: PERSON (score: 0.881)
# А.К.-: PERSON (score: 0.999)
# аса ауыр қылмыс: CRIME_TYPE (score: 0.999)
# ҚР Қылмыстық кодексінің 99-бабы 2-бөлігі 7) тармағы: LAW (score: 1.000)
# кісі өлтіру: CRIME_TYPE (score: 0.813)
# ҚР Қылмыстық кодексінің 202-бабы 2-бөлігі 6) тармағы: LAW (score: 1.000)
# бөтеннің мүлкін қасақана жою: CRIME_TYPE (score: 0.989)
Citations
Base Model
@misc{Sagyndyk2025KazRobertaConversational,
title = {Kaz-RoBERTa Conversational Technical Report},
author = {Beksultan Sagyndyk and Sanzhar Murzakhmetov and Kirill Yakunin},
year = {2025},
publisher = {TechRxiv},
doi = {10.36227/techrxiv.175942902.25827042/v1},
url = {https://doi.org/10.36227/techrxiv.175942902.25827042/v1}
}
Training Dataset
@inproceedings{yeshpanov-etal-2022-kaznerd,
title = "{K}az{NERD}: {K}azakh Named Entity Recognition Dataset",
author = "Yeshpanov, Rustem and Khassanov, Yerbolat and Varol, Huseyin Atakan",
booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
month = jun,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2022.lrec-1.44",
pages = "417--426"
}
License
This model is released under the CC BY 4.0 license, following the licensing of the KazNERD dataset.
Acknowledgments
- KazNERD dataset creators
- kz-transformers team for the base model
- Kazakh News Articles Dataset on Kaggle
- Downloads last month
- 18
Model tree for shoplikov/kaz-roberta-ner
Base model
kz-transformers/kaz-roberta-conversationalEvaluation results
- F1 Score on KazNERD (Augmented)self-reported0.802
- Precision on KazNERD (Augmented)self-reported0.768
- Recall on KazNERD (Augmented)self-reported0.838
- Accuracy on KazNERD (Augmented)self-reported0.956