Note: This model serves as proof that a single individual, without any team or institutional backing, can develop an LLM that demonstrates competitive results.
LIME-1B
LIME-1B is a 1B-parameter, decoder-only Transformer language model trained from scratch on English web data and then instruction-tuned on a curated mixture of assistant-style datasets with and without retrieval context. It is designed as a compact, practical base model for:
- Building RAG systems (context + question → answer)
- Assistant-style Q&A and task completion
- Summarization, explanation, and rewriting tasks in English
⚠️ LIME-1B is not RLHF/DPO-aligned and does not have tool use or multi-turn chat training baked in. It is an instruction-tuned LM, not a fully aligned assistant like ChatGPT.
1. Model architecture
LIME-1B follows is a decoder-only Transformer with several quality-oriented design choices:
| Component | Value |
|---|---|
| Architecture | Decoder-only Transformer |
| Parameters | 1.0B |
| Layers (decoder blocks) | 32 |
| d_model | 1536 |
| FFN dimension (d_ff) | 6144 |
| Attention heads | 24 |
| Vocabulary size | 50,000 |
| Max sequence length | 512 tokens |
| Positional encoding | Sinusoidal |
| Norm | RMSNorm |
| FFN | SiLU MLP |
| Attention | FlashAttention |
| Tying of embeddings | Output head tied to embedding |
| Precision (training) | Mixed fp32/bf16 (autocast) + grad clipping |
2. Training data
2.1 Pretraining
The base model is pretrained as a standard causal language model on English web data:
- Corpus: FineWeb-Edu (CC-MAIN-2025-05 split)
- Language filter: English-only subset
- Objective: next-token prediction (causal LM)
- Token budget: 20B tokens
- Context length: 512 tokens
2.2 Instruction fine-tuning (SFT)
After pretraining, the model is fine-tuned on a unified instruction schema:
[context (optional)] <user> instruction_text <assistant> response_text <eos>
SFT Data Mixture (~97k examples total):
- projecte-aina/RAG_Multilingual
- databricks/databricks-dolly-15k
- HuggingFaceH4/no_robots
- CohereLabs/aya_dataset
- yahma/alpaca-cleaned
Training Details
Hardware
- GPUs: 8 × NVIDIA A100 80GB (data parallel)
- Precision: bfloat16 with gradient clipping (max_norm = 1.0)
Pretraining
Objective: Cross-entropy loss on next-token prediction
Optimizer: AdamW
- β₁ = 0.9
- β₂ = 0.95
- Weight decay applied to non-norm/non-bias parameters
Learning Rate Schedule:
- Peak LR: ~5e-4
- Polynomial decay to 5e-6
- Warmup: ~5% of total steps
Instruction fine-tuning (SFT)
Objective: Cross-entropy loss on next-token prediction
Optimizer: AdamW
- β₁ = 0.9
- β₂ = 0.95
- Weight decay applied to non-norm/non-bias parameters
Learning Rate Schedule:
- Peak LR: 8e-5
- Polynomial decay to 1e-5
- Warmup: 10% of total steps
Usage
# Example usage
# pip install -U ukraine
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "anarlavrenov/LIME-1b"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
def build_prompt(context, question):
uid = "<user>"
aid = "<assistant>"
return context + uid + question + aid
context = "" # optional context
question = "Write five questions for a Data Scientist interview."
prompt = build_prompt(context, question)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
input_length = inputs['input_ids'].shape[1]
outputs = model.generate(
**inputs,
max_new_tokens=128,
num_beams=4,
early_stopping=True,
repetition_penalty=1.15,
no_repeat_ngram_size=3,
min_new_tokens=16,
do_sample=False,
top_p=None,
temperature=None,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
generated_tokens = outputs[0][input_length:]
output = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(output)
# 1. Can you tell us about your experience with data analysis and modeling?
# 2. How do you approach data cleaning and preprocessing?
# 3. How do you approach data visualization and storytelling?
# 4. Can you walk us through a time when you used data to solve a problem?
# 5. How do you approach the ethical considerations of data science and machine learning?
If you use LIME-1B in academic work or public products, please consider citing the model and the underlying datasets (FineWeb-Edu, Dolly, No Robots, Aya, Alpaca, RAG_Multilingual, etc.) according to their respective licenses and documentation.
Anar Lavrenov
Feel free to reach out for questions, or feedback about LIME-1B!
Citation
@misc{lime1b2025,
title = {LIME-1B: A 1B-parameter English Causal Language Model},
author = {Anar Lavrenov},
year = {2025},
howpublished = {\url{https://huggingface.co/anarlavrenov/LIME-1B}}
}
- Downloads last month
- 151
