Kazakh RoBERTa NER Model

Model Description

This model is a fine-tuned version of kz-transformers/kaz-roberta-conversational for Named Entity Recognition (NER) in Kazakh language.

The model was trained on the KazNERD dataset augmented with crime-related news articles from the Kazakh News Articles Dataset.

Supported Entity Types

The model recognizes 25 entity types:

  • ADAGE: Proverbs and sayings
  • ART: Works of art
  • CARDINAL: Numerical values
  • CONTACT: Contact information
  • CRIME_TYPE: Types of crimes (augmented category)
  • DATE: Dates
  • DISEASE: Disease names
  • EVENT: Named events
  • FACILITY: Buildings and facilities
  • GPE: Geo-political entities
  • LANGUAGE: Languages
  • LAW: Laws and legal documents
  • LOCATION: Locations
  • MISCELLANEOUS: Miscellaneous entities
  • MONEY: Monetary values
  • NON_HUMAN: Non-human entities
  • NORP: Nationalities, religious, or political groups
  • ORDINAL: Ordinal numbers
  • ORGANISATION: Organizations
  • PERCENTAGE: Percentages
  • PERSON: Person names
  • POSITION: Job positions
  • PRODUCT: Products
  • PROJECT: Project names
  • QUANTITY: Quantities
  • TIME: Times

Training Data

  • Base Dataset: KazNERD - Kazakh Named Entity Recognition Dataset
  • Augmentation: 1,734 crime-related news articles from Kaggle Kazakh News Dataset
  • Training Size: 92,602 sentences
  • Validation Size: 11,675 sentences
  • Test Size: 11,819 sentences

Training Procedure

Training Hyperparameters

  • Base Model: kz-transformers/kaz-roberta-conversational
  • Learning Rate: 3e-5
  • Batch Size: 32 (train) / 64 (eval)
  • Epochs: 10
  • Optimizer: AdamW (fused)
  • Weight Decay: 0.01
  • Warmup Steps: 500
  • LR Scheduler: Cosine
  • Gradient Accumulation Steps: 2
  • Mixed Precision: BF16

Results

Metric Value
Precision 0.7685
Recall 0.8382
F1 Score 0.8019

Detailed Performance by Entity Type

Entity Type Precision Recall F1-Score Support
CARDINAL 0.97 0.98 0.97 2824
DATE 0.96 0.98 0.97 2611
GPE 0.97 0.97 0.97 1742
LANGUAGE 1.00 0.98 0.99 46
MONEY 0.98 0.99 0.99 441
NORP 0.98 0.98 0.98 372
ORDINAL 0.96 0.96 0.96 386
PERCENTAGE 0.98 0.98 0.98 456
POSITION 0.93 0.97 0.95 603
QUANTITY 0.96 0.98 0.97 407
PERSON 0.77 0.84 0.80 2670
ORGANISATION 0.61 0.72 0.66 3893
CRIME_TYPE 0.28 0.37 0.32 1520
  • CRIME_TYPE entities show significantly lower performance, with an F1 score of 0.32, showing that it sucks in identifying crime related terms.
  • PERSON and ORGANISATION entities also show low performance, possibly due to the influence of crime-related examples in the training data.

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "shoplikov/kaz-roberta-ner"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

ner_pipeline = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

text = """Атырауда сот 23 жастағы Светлана Легкодимованың өліміне қатысты атышулы істі қарап, айыпталушыларға үкім шығарды. Сот 24 жастағы Р.Х. және 25 жастағы А.К.-ны аса ауыр қылмыс жасағаны үшін кінәлі деп таныды: ҚР Қылмыстық кодексінің 99-бабы 2-бөлігі 7) тармағы – "Алдын ала сөз байласу арқылы адамдар тобы жасаған кісі өлтіру". ҚР Қылмыстық кодексінің 202-бабы 2-бөлігі 6) тармағы – "Алдын ала сөз байласу арқылы адамдар тобы жасаған бөтеннің мүлкін қасақана жою". Айыптардың жиынтығы бойынша сот әрқайсына өмір бойына бас бостандығынан айыру жазасын тағайындады."""
results = ner_pipeline(text)

for entity in results:
    print(f"{entity['word']}: {entity['entity_group']} (score: {entity['score']:.3f})")

#  Атырауда: LOCATION (score: 0.998)
#  Светлана Легкодимованың: PERSON (score: 1.000)
#  Р.Х.: PERSON (score: 0.881)
#  А.К.-: PERSON (score: 0.999)
#  аса ауыр қылмыс: CRIME_TYPE (score: 0.999)
#  ҚР Қылмыстық кодексінің 99-бабы 2-бөлігі 7) тармағы: LAW (score: 1.000)
#  кісі өлтіру: CRIME_TYPE (score: 0.813)
#  ҚР Қылмыстық кодексінің 202-бабы 2-бөлігі 6) тармағы: LAW (score: 1.000)
#  бөтеннің мүлкін қасақана жою: CRIME_TYPE (score: 0.989)

Citations

Base Model

@misc{Sagyndyk2025KazRobertaConversational,
  title  = {Kaz-RoBERTa Conversational Technical Report},
  author = {Beksultan Sagyndyk and Sanzhar Murzakhmetov and Kirill Yakunin},
  year   = {2025},
  publisher = {TechRxiv},
  doi    = {10.36227/techrxiv.175942902.25827042/v1},
  url    = {https://doi.org/10.36227/techrxiv.175942902.25827042/v1}
}

Training Dataset

@inproceedings{yeshpanov-etal-2022-kaznerd,
    title = "{K}az{NERD}: {K}azakh Named Entity Recognition Dataset",
    author = "Yeshpanov, Rustem and Khassanov, Yerbolat and Varol, Huseyin Atakan",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.44",
    pages = "417--426"
}

License

This model is released under the CC BY 4.0 license, following the licensing of the KazNERD dataset.

Acknowledgments

  • KazNERD dataset creators
  • kz-transformers team for the base model
  • Kazakh News Articles Dataset on Kaggle
Downloads last month
18
Safetensors
Model size
82.9M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shoplikov/kaz-roberta-ner

Finetuned
(8)
this model

Evaluation results