RuBERT-ruLaw
This model is a continued pretraining of DeepPavlov/rubert-base-cased
on the RusLawOD dataset — a large corpus of Russian legal texts (court decisions and normative acts).
The goal of this training is to improve RuBERT’s performance on legal domain tasks such as classification, information extraction, and retrieval.
Repository: https://github.com/TryDotAtwo/ruBERT-ruLaw
Training Details
- Base model: DeepPavlov/rubert-base-cased
- Task: Masked Language Modeling (MLM)
- Max sequence length: 512 tokens (stride 128)
- Batch size: 160 per device
- Gradient accumulation: 1
- Epochs: 8 (3 in test mode)
- Max steps: 40,000
- Warmup steps: 2,000
- Mixed precision: BF16 (on A100/H100)
- Optimizer & scheduler: Default Hugging Face Trainer settings
- Evaluation metric:
eval_loss(best checkpoint loaded at end) - Hardware: 3× NVIDIA H200 GPUs
- Final eval loss: < 0.3
Dataset
We use the RusLawOD dataset.
Before tokenization, empty or None entries are filtered out. Tokenization is performed with the RuBERT tokenizer with truncation and sliding window (stride 128) to maximize coverage of long documents.
Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_name = "TryDotAtwo/rubert-rulaw"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
Evaluation Overview
Models were tested on the sud-resh-benchmark legal texts using a masked language modeling setup. Tokens were randomly masked at varying probabilities (10–40%), and models predicted them using their pre-trained heads.
Note: The ruBERT-ruLaw model was pre-trained on legal texts such as laws and statutes, but not specifically on judicial decisions. The evaluation reflects how well it generalizes to predicting masked tokens in Russian court rulings.
- Top-1 Accuracy: fraction of masked tokens predicted exactly.
- Top-5 Accuracy: fraction of masked tokens predicted within the top 5 candidates.
Results reflect performance across all masked tokens, aggregated for the dataset.
MLM Accuracy Comparison
| MLM Probability | Metric | ruBERT-ruLaw | rubert-base-cased | legal-bert-base-uncased |
|---|---|---|---|---|
| 10% | Top-1 | 81.0% | 73.0% | 45.3% |
| 10% | Top-5 | 92.2% | 87.0% | 77.2% |
| 15% | Top-1 | 78.8% | 67.9% | 45.3% |
| 15% | Top-5 | 90.8% | 83.2% | 76.7% |
| 20% | Top-1 | 76.3% | 53.8% | 45.0% |
| 20% | Top-5 | 89.0% | 71.5% | 75.9% |
| 25% | Top-1 | 73.6% | 18.0% | 44.4% |
| 25% | Top-5 | 87.0% | 31.9% | 75.0% |
| 30% | Top-1 | 70.4% | 5.9% | 43.8% |
| 30% | Top-5 | 84.6% | 10.9% | 74.0% |
| 35% | Top-1 | 66.9% | 6.0% | 42.9% |
| 35% | Top-5 | 81.9% | 9.1% | 72.9% |
| 40% | Top-1 | 62.9% | 6.0% | 41.9% |
| 40% | Top-5 | 78.5% | 8.5% | 71.7% |
Citation
A paper describing the dataset and training process will be released on arXiv soon. [Link — coming soon]
- Downloads last month
- 307
Model tree for TryDotAtwo/ruBERT-ruLaw
Base model
DeepPavlov/rubert-base-cased