RuBERT-ruLaw

This model is a continued pretraining of DeepPavlov/rubert-base-cased on the RusLawOD dataset — a large corpus of Russian legal texts (court decisions and normative acts).
The goal of this training is to improve RuBERT’s performance on legal domain tasks such as classification, information extraction, and retrieval.

Repository: https://github.com/TryDotAtwo/ruBERT-ruLaw

Training Details

  • Base model: DeepPavlov/rubert-base-cased
  • Task: Masked Language Modeling (MLM)
  • Max sequence length: 512 tokens (stride 128)
  • Batch size: 160 per device
  • Gradient accumulation: 1
  • Epochs: 8 (3 in test mode)
  • Max steps: 40,000
  • Warmup steps: 2,000
  • Mixed precision: BF16 (on A100/H100)
  • Optimizer & scheduler: Default Hugging Face Trainer settings
  • Evaluation metric: eval_loss (best checkpoint loaded at end)
  • Hardware: 3× NVIDIA H200 GPUs
  • Final eval loss: < 0.3

Dataset

We use the RusLawOD dataset.
Before tokenization, empty or None entries are filtered out. Tokenization is performed with the RuBERT tokenizer with truncation and sliding window (stride 128) to maximize coverage of long documents.

Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_name = "TryDotAtwo/rubert-rulaw"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

Evaluation Overview

Models were tested on the sud-resh-benchmark legal texts using a masked language modeling setup. Tokens were randomly masked at varying probabilities (10–40%), and models predicted them using their pre-trained heads.

Note: The ruBERT-ruLaw model was pre-trained on legal texts such as laws and statutes, but not specifically on judicial decisions. The evaluation reflects how well it generalizes to predicting masked tokens in Russian court rulings.

  • Top-1 Accuracy: fraction of masked tokens predicted exactly.
  • Top-5 Accuracy: fraction of masked tokens predicted within the top 5 candidates.

Results reflect performance across all masked tokens, aggregated for the dataset.

MLM Accuracy Comparison

MLM Probability Metric ruBERT-ruLaw rubert-base-cased legal-bert-base-uncased
10% Top-1 81.0% 73.0% 45.3%
10% Top-5 92.2% 87.0% 77.2%
15% Top-1 78.8% 67.9% 45.3%
15% Top-5 90.8% 83.2% 76.7%
20% Top-1 76.3% 53.8% 45.0%
20% Top-5 89.0% 71.5% 75.9%
25% Top-1 73.6% 18.0% 44.4%
25% Top-5 87.0% 31.9% 75.0%
30% Top-1 70.4% 5.9% 43.8%
30% Top-5 84.6% 10.9% 74.0%
35% Top-1 66.9% 6.0% 42.9%
35% Top-5 81.9% 9.1% 72.9%
40% Top-1 62.9% 6.0% 41.9%
40% Top-5 78.5% 8.5% 71.7%

Citation

A paper describing the dataset and training process will be released on arXiv soon. [Link — coming soon]

Downloads last month
307
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TryDotAtwo/ruBERT-ruLaw

Finetuned
(61)
this model

Dataset used to train TryDotAtwo/ruBERT-ruLaw