Kazakh RoBERTa NER Model

Model Description

This model is a fine-tuned version of kz-transformers/kaz-roberta-conversational for Named Entity Recognition (NER) in Kazakh language.

The model was trained on the KazNERD dataset augmented with crime-related news articles from the Kazakh News Articles Dataset.

Supported Entity Types

The model recognizes 25 entity types:

ADAGE: Proverbs and sayings
ART: Works of art
CARDINAL: Numerical values
CONTACT: Contact information
CRIME_TYPE: Types of crimes (augmented category)
DATE: Dates
DISEASE: Disease names
EVENT: Named events
FACILITY: Buildings and facilities
GPE: Geo-political entities
LANGUAGE: Languages
LAW: Laws and legal documents
LOCATION: Locations
MISCELLANEOUS: Miscellaneous entities
MONEY: Monetary values
NON_HUMAN: Non-human entities
NORP: Nationalities, religious, or political groups
ORDINAL: Ordinal numbers
ORGANISATION: Organizations
PERCENTAGE: Percentages
PERSON: Person names
POSITION: Job positions
PRODUCT: Products
PROJECT: Project names
QUANTITY: Quantities
TIME: Times

Training Data

Base Dataset: KazNERD - Kazakh Named Entity Recognition Dataset
Augmentation: 1,734 crime-related news articles from Kaggle Kazakh News Dataset
Training Size: 92,602 sentences
Validation Size: 11,675 sentences
Test Size: 11,819 sentences

Training Procedure

Training Hyperparameters

Base Model: kz-transformers/kaz-roberta-conversational
Learning Rate: 3e-5
Batch Size: 32 (train) / 64 (eval)
Epochs: 10
Optimizer: AdamW (fused)
Weight Decay: 0.01
Warmup Steps: 500
LR Scheduler: Cosine
Gradient Accumulation Steps: 2
Mixed Precision: BF16

Results

Metric	Value
Precision	0.7685
Recall	0.8382
F1 Score	0.8019

Detailed Performance by Entity Type

Entity Type	Precision	Recall	F1-Score	Support
CARDINAL	0.97	0.98	0.97	2824
DATE	0.96	0.98	0.97	2611
GPE	0.97	0.97	0.97	1742
LANGUAGE	1.00	0.98	0.99	46
MONEY	0.98	0.99	0.99	441
NORP	0.98	0.98	0.98	372
ORDINAL	0.96	0.96	0.96	386
PERCENTAGE	0.98	0.98	0.98	456
POSITION	0.93	0.97	0.95	603
QUANTITY	0.96	0.98	0.97	407
PERSON	0.77	0.84	0.80	2670
ORGANISATION	0.61	0.72	0.66	3893
CRIME_TYPE	0.28	0.37	0.32	1520

CRIME_TYPE entities show significantly lower performance, with an F1 score of 0.32, showing that it sucks in identifying crime related terms.
PERSON and ORGANISATION entities also show low performance, possibly due to the influence of crime-related examples in the training data.

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "shoplikov/kaz-roberta-ner"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

ner_pipeline = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

text = """Атырауда сот 23 жастағы Светлана Легкодимованың өліміне қатысты атышулы істі қарап, айыпталушыларға үкім шығарды. Сот 24 жастағы Р.Х. және 25 жастағы А.К.-ны аса ауыр қылмыс жасағаны үшін кінәлі деп таныды: ҚР Қылмыстық кодексінің 99-бабы 2-бөлігі 7) тармағы – "Алдын ала сөз байласу арқылы адамдар тобы жасаған кісі өлтіру". ҚР Қылмыстық кодексінің 202-бабы 2-бөлігі 6) тармағы – "Алдын ала сөз байласу арқылы адамдар тобы жасаған бөтеннің мүлкін қасақана жою". Айыптардың жиынтығы бойынша сот әрқайсына өмір бойына бас бостандығынан айыру жазасын тағайындады."""
results = ner_pipeline(text)

for entity in results:
    print(f"{entity['word']}: {entity['entity_group']} (score: {entity['score']:.3f})")

#  Атырауда: LOCATION (score: 0.998)
#  Светлана Легкодимованың: PERSON (score: 1.000)
#  Р.Х.: PERSON (score: 0.881)
#  А.К.-: PERSON (score: 0.999)
#  аса ауыр қылмыс: CRIME_TYPE (score: 0.999)
#  ҚР Қылмыстық кодексінің 99-бабы 2-бөлігі 7) тармағы: LAW (score: 1.000)
#  кісі өлтіру: CRIME_TYPE (score: 0.813)
#  ҚР Қылмыстық кодексінің 202-бабы 2-бөлігі 6) тармағы: LAW (score: 1.000)
#  бөтеннің мүлкін қасақана жою: CRIME_TYPE (score: 0.989)

Citations

Base Model

@misc{Sagyndyk2025KazRobertaConversational,
  title  = {Kaz-RoBERTa Conversational Technical Report},
  author = {Beksultan Sagyndyk and Sanzhar Murzakhmetov and Kirill Yakunin},
  year   = {2025},
  publisher = {TechRxiv},
  doi    = {10.36227/techrxiv.175942902.25827042/v1},
  url    = {https://doi.org/10.36227/techrxiv.175942902.25827042/v1}
}

Training Dataset

@inproceedings{yeshpanov-etal-2022-kaznerd,
    title = "{K}az{NERD}: {K}azakh Named Entity Recognition Dataset",
    author = "Yeshpanov, Rustem and Khassanov, Yerbolat and Varol, Huseyin Atakan",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.44",
    pages = "417--426"
}

License

This model is released under the CC BY 4.0 license, following the licensing of the KazNERD dataset.

Acknowledgments

KazNERD dataset creators
kz-transformers team for the base model
Kazakh News Articles Dataset on Kaggle

Downloads last month: 18

Safetensors

Model size

82.9M params

Tensor type

F32

Model tree for shoplikov/kaz-roberta-ner

Base model

kz-transformers/kaz-roberta-conversational

Finetuned

(8)

this model

Evaluation results

F1 Score on KazNERD (Augmented)
self-reported

0.802
Precision on KazNERD (Augmented)
self-reported

0.768
Recall on KazNERD (Augmented)
self-reported

0.838
Accuracy on KazNERD (Augmented)
self-reported

0.956

View on Papers With Code