Long Context XML-Roberta-large

A long‑context variant of XLM‑RoBERTa Large extended to 4096 tokens and continued‑pretrained on CC100 with RoBERTa‑style sequence packing and dynamic masking (MLM).

  • Context window: 4096 tokens (implemented as max_position_embeddings ≈ 4098, i.e., 4096 effective tokens + specials)
  • Objective: Masked Language Modeling (MLM) with mlm_probability=0.15
  • Data: CC100 (all languages) via 🤗 Datasets streaming, weigthed interleaved across languages
  • Packing: concatenate documents and insert </s> separators. Fixed‑length blocks.
  • Positional embeddings: extended from the base model using interpolation then full model finetuned.

Intended uses

  • General multilingual text representation and MLM pre‑training tasks
  • Backbone for multilingual classification, retrieval, and sequence tagging
  • Starting point for continued task‑specific fine‑tuning that benefits from longer context (documents, code, QA context passages)

How to use:

from transformers import (
    pipeline,
    AutoTokenizer,
    AutoModelForMaskedLM,
)
ckpt = "pierre-tassel/longcontext-xlm-roberta-large-4096-tokens"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForMaskedLM.from_pretrained(ckpt)

text = "Paris is the capital of <mask>."
fill = pipeline("fill-mask", model=ckpt, tokenizer=ckpt)
print(fill(text))

Model architecture

  • Base: FacebookAI/xlm-roberta-large
  • Tokenizer: XLM‑R SentencePiece
  • Max positions: extended to ~4098 (effective 4096 tokens usable)

Data

  • Source: cc100 (all available language configs)
  • Access mode: streaming=True
  • Mixing: per‑language shuffled streams, then weighted sampling (alpha=0.3)
  • Preprocessing: whitespace trim
  • Packing: concatenate tokenized docs; insert between docs; emit fixed‑length blocks Because of interleaving and packing, individual blocks and batches can contain multiple languages.

Objective & masking

  • Data collator: DataCollatorForLanguageModeling
  • MLM with mlm_probability=0.15 Optimizer & schedule
  • Optimizer: adamw_torch_fused
  • Learning rate: 1e-4
  • Weight decay: 0.01
  • Warmup ratio: 0.005
  • Max grad norm: 1.0 Sequence‑length curriculum
  • Start block size: 768
  • Increment: +256 tokens every 1024 steps
  • Target: 4096

Positional embedding extension

  • Extends absolute positional embeddings from the base model to the new max positions
  • Method: interpolation of non‑pad rows with per‑dimension mean/std matching of the tail
  • Preserves pad index semantics and gradient settings
Downloads last month
13
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pierre-tassel/longcontext-xlm-roberta-large-4096-tokens

Finetuned
(759)
this model

Dataset used to train pierre-tassel/longcontext-xlm-roberta-large-4096-tokens