Long Context XML-Roberta-large
A long‑context variant of XLM‑RoBERTa Large extended to 4096 tokens and continued‑pretrained on CC100 with RoBERTa‑style sequence packing and dynamic masking (MLM).
- Context window: 4096 tokens (implemented as max_position_embeddings ≈ 4098, i.e., 4096 effective tokens + specials)
- Objective: Masked Language Modeling (MLM) with mlm_probability=0.15
- Data: CC100 (all languages) via 🤗 Datasets streaming, weigthed interleaved across languages
- Packing: concatenate documents and insert
</s>separators. Fixed‑length blocks. - Positional embeddings: extended from the base model using interpolation then full model finetuned.
Intended uses
- General multilingual text representation and MLM pre‑training tasks
- Backbone for multilingual classification, retrieval, and sequence tagging
- Starting point for continued task‑specific fine‑tuning that benefits from longer context (documents, code, QA context passages)
How to use:
from transformers import (
pipeline,
AutoTokenizer,
AutoModelForMaskedLM,
)
ckpt = "pierre-tassel/longcontext-xlm-roberta-large-4096-tokens"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForMaskedLM.from_pretrained(ckpt)
text = "Paris is the capital of <mask>."
fill = pipeline("fill-mask", model=ckpt, tokenizer=ckpt)
print(fill(text))
Model architecture
- Base: FacebookAI/xlm-roberta-large
- Tokenizer: XLM‑R SentencePiece
- Max positions: extended to ~4098 (effective 4096 tokens usable)
Data
- Source: cc100 (all available language configs)
- Access mode: streaming=True
- Mixing: per‑language shuffled streams, then weighted sampling (alpha=0.3)
- Preprocessing: whitespace trim
- Packing: concatenate tokenized docs; insert between docs; emit fixed‑length blocks Because of interleaving and packing, individual blocks and batches can contain multiple languages.
Objective & masking
- Data collator: DataCollatorForLanguageModeling
- MLM with mlm_probability=0.15 Optimizer & schedule
- Optimizer: adamw_torch_fused
- Learning rate: 1e-4
- Weight decay: 0.01
- Warmup ratio: 0.005
- Max grad norm: 1.0 Sequence‑length curriculum
- Start block size: 768
- Increment: +256 tokens every 1024 steps
- Target: 4096
Positional embedding extension
- Extends absolute positional embeddings from the base model to the new max positions
- Method: interpolation of non‑pad rows with per‑dimension mean/std matching of the tail
- Preserves pad index semantics and gradient settings
- Downloads last month
- 13
Model tree for pierre-tassel/longcontext-xlm-roberta-large-4096-tokens
Base model
FacebookAI/xlm-roberta-large