---
license: mit
language:
- en
- hi
- kn
- te
- ta
- mr
base_model:
- microsoft/Phi-mini-MoE-instruct
library: transformers
pipeline_tag: text-generation
tags:
- Conversational
- Indic Dataset
- Multilingual
- MoE
datasets:
- SandLogicTechnologies/Indic_Chat_Dataset
---

# IndicPhi-mini: Adapting Phi-mini-MoE to Indic Languages with Curated Data

##  Overview
**IndicPhi-mini** is a fine-tuned version of **Microsoft’s Phi-mini-MoE**, a compact Mixture-of-Experts (MoE) model, adapted specifically for Indic languages. It is trained on a curated multilingual dataset of approximately 29 million high-quality samples, standardized into a conversational format from diverse sources. By leveraging efficient fine-tuning techniques such as **QLoRA-based quantization** and **LoRA adapters**, the model enhances Indic language capabilities while keeping resource usage practical. Evaluation on benchmark datasets shows consistent **3–4% accuracy** improvements across multiple Indic languages, demonstrating the effectiveness of targeted fine-tuning with curated data.
a compact Mixture-of-Experts (MoE) model

---

##  Key Contributions
-  Curated one of the **largest Indic corpora** to date: 561M samples → cleaned into **29M high-quality samples** across **13 Indic languages**.  
-  Fine-tuned **Phi-mini-MoE** (7.6B params, 2.4B active) using **QLoRA (4-bit)** and **LoRA adapters**, making training feasible on a single **A100-80GB GPU**.  
-  Achieved **+3–4% accuracy improvements** on major Indic benchmarks:
    - **ARC-Challenge-Indic** (reasoning tasks)  
    - **MMLU-Indic** (knowledge & domain understanding)  
-  Improved **generalization across multiple Indic languages** including Hindi, Kannada, Tamil, Telugu, Marathi, Bengali, Malayalam, Gujarati, Odia, Punjabi, Assamese, Sinhala, and Urdu.  

---

##  Model Architecture
- **Base model:** Phi-mini-MoE-Instruct (Microsoft)  
- **Parameters:** 7.6B total (2.4B active per token)  
- **Layers:** 32 decoder-only transformer blocks  
- **Attention:** Grouped Query Attention (GQA)  
- **Experts per layer:** 16 (Top-2 active per token)  
- **Context length:** 4096 tokens 

---

## Usage
To load the fine-tuned model:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "SandLogicTechnologies/IndicPhi-mini"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_4bit=True
)

prompt = "ग्रामीण क्षेत्रों में ऑनलाइन शिक्षा की समस्याएं क्या हैं?"  

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

```

##  Dataset Preparation
### Data Sources
- **Total collected:** 561M samples from **53 datasets** from Hugging Face.  
- **Languages covered:** 13 Indian languages which include Hindi, Kannada, Telugu, Tamil, Marathi, Malayalam, Gujarati, Bengali,Odia, Punjabi, Assamese, Sinhala, Urdu.  
- **Categories:** General text, translation, instruction, conversational.  

### Processing Pipeline
1. **Manual Filtering** – removed noisy, irrelevant, and malformed samples.  
2. **Preprocessing** – deduplication, language identification, normalization, minimum length filtering.  
3. **Format Conversion** – standardized into **UltraChat JSON schema** (multi-turn conversations).  

### Final Cleaned Dataset
- **Size:** 29M samples  

### Dataset Distribution (Final Cleaned)

| Language   | Samples   |
|------------|-----------|
| Hindi      | 4.63M     |
| Kannada    | 3.54M     |
| Telugu     | 3.72M     |
| Tamil      | 3.86M     |
| Marathi    | 3.79M     |
| Malayalam  | 2.81M     |
| Gujarati   | 2.94M     |
| Bengali    | 1.82M     |
| Odia       | 438K      |
| Punjabi    | 1.21M     |
| Assamese   | 185K      |
| Sinhala    | 64K       |
| Urdu       | 58K       |

**Total curated dataset:** ~29 million high-quality samples

---

### Training Details
- **Hardware:** 1 × NVIDIA A100-80GB  
- **Precision:** QLoRA (4-bit quantization)  
- **Batching:** Effective batch size 256 (32 × 8 gradient accumulation)  
- **Steps:** 8,500  
- **Optimizer:** AdamW (8-bit) + cosine LR schedule + 1k warmup steps  
- **LoRA configuration:**  
  - Layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj  
  - r=128, α=128, dropout=0  
- **Final training loss:** 0.48  

---

##  Evaluation & Results

### Benchmarks
1. **ARC-Challenge-Indic** (reasoning)  
2. **MMLU-Indic** (knowledge & domain understanding)  

### Improvements
- **ARC-Challenge-Indic**
  - Accuracy: **21.03 → 24.46 (+3.43%)**  
  - Normalized Accuracy: **24.69 → 28.86 (+4.17%)**  
- **MMLU-Indic**
  - Accuracy: **27.47 → 30.95 (+3.48%)**  

###  Results

#### ARC-Challenge-Indic

| Language   | Accuracy (Phi-mini-MoE) | Accuracy (IndicPhi-mini) |
|------------|-------------------------|--------------------------|
| Hindi      | 22.61                   | 26.17                    |
| Kannada    | 20.96                   | 25.83                    | 
| Tamil      | 20.78                   | 24.61                    |
| Telugu     | 20.70                   | 26.00                    | 
| Bengali    | 21.91                   | 25.04                    | 
| Gujarati   | 18.17                   | 21.30                    | 
| Malayalam  | 22.26                   | 23.91                    | 
| Marathi    | 19.65                   | 25.22                    |
| Odia       | 22.26                   | 24.17                    |

Accuracy: **(Phi-mini-MoE) 21.03 → (IndicPhi-mini) 24.46 (+3.43%)**  

**MMLU-Indic**

| Language   | Accuracy (Phi-mini-MoE) | Accuracy (Phi-mini-MoE)|
|------------|-------------------------|-------------------------|
| Hindi      | 28.01                   | 31.45                   |
| Kannada    | 26.74                   | 30.12                   |
| Tamil      | 27.53                   | 30.84                   |
| Telugu     | 27.20                   | 31.02                   |
| Bengali    | 28.36                   | 31.44                   |
| Gujarati   | 25.91                   | 29.28                   |
| Malayalam  | 26.65                   | 29.77                   |
| Marathi    | 27.12                   | 30.63                   |
| Odia       | 27.05                   | 30.45                   |
| Punjabi    | 26.42                   | 29.61                   |
| Assamese   | 25.98                   | 29.23                   |
| Sinhala    | 24.87                   | 27.66                   |
| Urdu       | 25.44                   | 28.71                   |

Accuracy: **(Phi-mini-MoE) 27.47 → (IndicPhi-mini) 30.95 (+3.48%)**

## Acknowledgments

The **Phi-mini-MoE-Instruct** models are based on the original work by **Microsoft** and fine-tuned by the **Sandlogic** development team.

Special thanks to:
- The [Microsoft](https://huggingface.co/microsoft) team for developing and releasing the [microsoft/Phi-mini-MoE-instruct](https://huggingface.co/microsoft/Phi-mini-MoE-instruct) model.
- The authors and organizations behind the **53 open-source datasets** that made this work possible.  
  The complete list of dataset sources and citations is available [here](https://github.com/sandlogic/SandLogic-Lexicons/blob/main/Images/dataset_citation.md).

---

## Contact
For any inquiries or support, please contact us at support@sandlogic.com or visit our [Website](https://www.sandlogic.com/).