logo LIME-1B Model Card


Note: This model serves as proof that a single individual, without any team or institutional backing, can develop an LLM that demonstrates competitive results.


LIME-1B

LIME-1B is a 1B-parameter, decoder-only Transformer language model trained from scratch on English web data and then instruction-tuned on a curated mixture of assistant-style datasets with and without retrieval context. It is designed as a compact, practical base model for:

  • Building RAG systems (context + question → answer)
  • Assistant-style Q&A and task completion
  • Summarization, explanation, and rewriting tasks in English

⚠️ LIME-1B is not RLHF/DPO-aligned and does not have tool use or multi-turn chat training baked in. It is an instruction-tuned LM, not a fully aligned assistant like ChatGPT.


1. Model architecture

LIME-1B follows is a decoder-only Transformer with several quality-oriented design choices:

Component Value
Architecture Decoder-only Transformer
Parameters 1.0B
Layers (decoder blocks) 32
d_model 1536
FFN dimension (d_ff) 6144
Attention heads 24
Vocabulary size 50,000
Max sequence length 512 tokens
Positional encoding Sinusoidal
Norm RMSNorm
FFN SiLU MLP
Attention FlashAttention
Tying of embeddings Output head tied to embedding
Precision (training) Mixed fp32/bf16 (autocast) + grad clipping

2. Training data

2.1 Pretraining

The base model is pretrained as a standard causal language model on English web data:

  • Corpus: FineWeb-Edu (CC-MAIN-2025-05 split)
  • Language filter: English-only subset
  • Objective: next-token prediction (causal LM)
  • Token budget: 20B tokens
  • Context length: 512 tokens

2.2 Instruction fine-tuning (SFT)

After pretraining, the model is fine-tuned on a unified instruction schema:

[context (optional)] <user> instruction_text <assistant> response_text <eos>

SFT Data Mixture (~97k examples total):

Training Details

Hardware

  • GPUs: 8 × NVIDIA A100 80GB (data parallel)
  • Precision: bfloat16 with gradient clipping (max_norm = 1.0)

Pretraining

Objective: Cross-entropy loss on next-token prediction

Optimizer: AdamW

  • β₁ = 0.9
  • β₂ = 0.95
  • Weight decay applied to non-norm/non-bias parameters

Learning Rate Schedule:

  • Peak LR: ~5e-4
  • Polynomial decay to 5e-6
  • Warmup: ~5% of total steps

Instruction fine-tuning (SFT)

Objective: Cross-entropy loss on next-token prediction

Optimizer: AdamW

  • β₁ = 0.9
  • β₂ = 0.95
  • Weight decay applied to non-norm/non-bias parameters

Learning Rate Schedule:

  • Peak LR: 8e-5
  • Polynomial decay to 1e-5
  • Warmup: 10% of total steps

Usage

# Example usage
# pip install -U ukraine

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "anarlavrenov/LIME-1b"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

def build_prompt(context, question):
  uid = "<user>"
  aid = "<assistant>"
  return context + uid + question + aid

context = ""  # optional context
question = "Write five questions for a Data Scientist interview."
prompt = build_prompt(context, question)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
input_length = inputs['input_ids'].shape[1]

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    num_beams=4,
    early_stopping=True,
    repetition_penalty=1.15,
    no_repeat_ngram_size=3,
    min_new_tokens=16,
    do_sample=False,
    top_p=None,
    temperature=None,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

generated_tokens = outputs[0][input_length:]
output = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print(output)

# 1. Can you tell us about your experience with data analysis and modeling? 
# 2. How do you approach data cleaning and preprocessing? 
# 3. How do you approach data visualization and storytelling? 
# 4. Can you walk us through a time when you used data to solve a problem? 
# 5. How do you approach the ethical considerations of data science and machine learning?

If you use LIME-1B in academic work or public products, please consider citing the model and the underlying datasets (FineWeb-Edu, Dolly, No Robots, Aya, Alpaca, RAG_Multilingual, etc.) according to their respective licenses and documentation.

Anar Lavrenov

LinkedIn

Feel free to reach out for questions, or feedback about LIME-1B!

Citation

@misc{lime1b2025,
  title         = {LIME-1B: A 1B-parameter English Causal Language Model},
  author        = {Anar Lavrenov},
  year          = {2025},
  howpublished  = {\url{https://huggingface.co/anarlavrenov/LIME-1B}}
}
Downloads last month
151
Safetensors
Model size
1.0B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train anarlavrenov/lime-1b