Long Context XML-Roberta-large

A long‑context variant of XLM‑RoBERTa Large extended to 4096 tokens and continued‑pretrained on CC100 with RoBERTa‑style sequence packing and dynamic masking (MLM).

Context window: 4096 tokens (implemented as max_position_embeddings ≈ 4098, i.e., 4096 effective tokens + specials)
Objective: Masked Language Modeling (MLM) with mlm_probability=0.15
Data: CC100 (all languages) via 🤗 Datasets streaming, weigthed interleaved across languages
Packing: concatenate documents and insert </s> separators. Fixed‑length blocks.
Positional embeddings: extended from the base model using interpolation then full model finetuned.

Intended uses

General multilingual text representation and MLM pre‑training tasks
Backbone for multilingual classification, retrieval, and sequence tagging
Starting point for continued task‑specific fine‑tuning that benefits from longer context (documents, code, QA context passages)

How to use:

from transformers import (
    pipeline,
    AutoTokenizer,
    AutoModelForMaskedLM,
)
ckpt = "pierre-tassel/longcontext-xlm-roberta-large-4096-tokens"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForMaskedLM.from_pretrained(ckpt)

text = "Paris is the capital of <mask>."
fill = pipeline("fill-mask", model=ckpt, tokenizer=ckpt)
print(fill(text))

Model architecture

Base: FacebookAI/xlm-roberta-large
Tokenizer: XLM‑R SentencePiece
Max positions: extended to ~4098 (effective 4096 tokens usable)

Data

Source: cc100 (all available language configs)
Access mode: streaming=True
Mixing: per‑language shuffled streams, then weighted sampling (alpha=0.3)
Preprocessing: whitespace trim
Packing: concatenate tokenized docs; insert between docs; emit fixed‑length blocks Because of interleaving and packing, individual blocks and batches can contain multiple languages.

Objective & masking

Data collator: DataCollatorForLanguageModeling
MLM with mlm_probability=0.15 Optimizer & schedule
Optimizer: adamw_torch_fused
Learning rate: 1e-4
Weight decay: 0.01
Warmup ratio: 0.005
Max grad norm: 1.0 Sequence‑length curriculum
Start block size: 768
Increment: +256 tokens every 1024 steps
Target: 4096

Positional embedding extension

Extends absolute positional embeddings from the base model to the new max positions
Method: interpolation of non‑pad rows with per‑dimension mean/std matching of the tail
Preserves pad index semantics and gradient settings

Downloads last month: 13

Safetensors

Model size

0.6B params

Tensor type

BF16

Model tree for pierre-tassel/longcontext-xlm-roberta-large-4096-tokens

Base model

FacebookAI/xlm-roberta-large

Finetuned

(759)

this model

pierre-tassel
/

longcontext-xlm-roberta-large-4096-tokens

Long Context XML-Roberta-large

Model tree for pierre-tassel/longcontext-xlm-roberta-large-4096-tokens

Dataset used to train pierre-tassel/longcontext-xlm-roberta-large-4096-tokens