LIME-1B Model Card

Note: This model serves as proof that a single individual, without any team or institutional backing, can develop an LLM that demonstrates competitive results.

LIME-1B

LIME-1B is a 1B-parameter, decoder-only Transformer language model trained from scratch on English web data and then instruction-tuned on a curated mixture of assistant-style datasets with and without retrieval context. It is designed as a compact, practical base model for:

Building RAG systems (context + question → answer)
Assistant-style Q&A and task completion
Summarization, explanation, and rewriting tasks in English

⚠️ LIME-1B is not RLHF/DPO-aligned and does not have tool use or multi-turn chat training baked in. It is an instruction-tuned LM, not a fully aligned assistant like ChatGPT.

1. Model architecture

LIME-1B follows is a decoder-only Transformer with several quality-oriented design choices:

Component	Value
Architecture	Decoder-only Transformer
Parameters	1.0B
Layers (decoder blocks)	32
d_model	1536
FFN dimension (d_ff)	6144
Attention heads	24
Vocabulary size	50,000
Max sequence length	512 tokens
Positional encoding	Sinusoidal
Norm	RMSNorm
FFN	SiLU MLP
Attention	FlashAttention
Tying of embeddings	Output head tied to embedding
Precision (training)	Mixed fp32/bf16 (autocast) + grad clipping

2. Training data

2.1 Pretraining

The base model is pretrained as a standard causal language model on English web data:

Corpus: FineWeb-Edu (CC-MAIN-2025-05 split)
Language filter: English-only subset
Objective: next-token prediction (causal LM)
Token budget: 20B tokens
Context length: 512 tokens

2.2 Instruction fine-tuning (SFT)

After pretraining, the model is fine-tuned on a unified instruction schema:

[context (optional)] <user> instruction_text <assistant> response_text <eos>

SFT Data Mixture (~97k examples total):

Training Details

Hardware

GPUs: 8 × NVIDIA A100 80GB (data parallel)
Precision: bfloat16 with gradient clipping (max_norm = 1.0)

Pretraining

Objective: Cross-entropy loss on next-token prediction

Optimizer: AdamW

β₁ = 0.9
β₂ = 0.95
Weight decay applied to non-norm/non-bias parameters

Learning Rate Schedule:

Peak LR: ~5e-4
Polynomial decay to 5e-6
Warmup: ~5% of total steps

Instruction fine-tuning (SFT)

Objective: Cross-entropy loss on next-token prediction

Optimizer: AdamW

β₁ = 0.9
β₂ = 0.95
Weight decay applied to non-norm/non-bias parameters

Learning Rate Schedule:

Peak LR: 8e-5
Polynomial decay to 1e-5
Warmup: 10% of total steps

Usage

# Example usage
# pip install -U ukraine

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "anarlavrenov/LIME-1b"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

def build_prompt(context, question):
  uid = "<user>"
  aid = "<assistant>"
  return context + uid + question + aid

context = ""  # optional context
question = "Write five questions for a Data Scientist interview."
prompt = build_prompt(context, question)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
input_length = inputs['input_ids'].shape[1]

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    num_beams=4,
    early_stopping=True,
    repetition_penalty=1.15,
    no_repeat_ngram_size=3,
    min_new_tokens=16,
    do_sample=False,
    top_p=None,
    temperature=None,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

generated_tokens = outputs[0][input_length:]
output = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print(output)

# 1. Can you tell us about your experience with data analysis and modeling? 
# 2. How do you approach data cleaning and preprocessing? 
# 3. How do you approach data visualization and storytelling? 
# 4. Can you walk us through a time when you used data to solve a problem? 
# 5. How do you approach the ethical considerations of data science and machine learning?

If you use LIME-1B in academic work or public products, please consider citing the model and the underlying datasets (FineWeb-Edu, Dolly, No Robots, Aya, Alpaca, RAG_Multilingual, etc.) according to their respective licenses and documentation.

Anar Lavrenov

Feel free to reach out for questions, or feedback about LIME-1B!

Citation

@misc{lime1b2025,
  title         = {LIME-1B: A 1B-parameter English Causal Language Model},
  author        = {Anar Lavrenov},
  year          = {2025},
  howpublished  = {\url{https://huggingface.co/anarlavrenov/LIME-1B}}
}

Downloads last month: 151

Safetensors

Model size

1.0B params

Tensor type

F32

anarlavrenov
/

lime-1b