--- license: mit language: - en - hi - kn - te - ta - mr base_model: - microsoft/Phi-mini-MoE-instruct library: transformers pipeline_tag: text-generation tags: - Conversational - Indic Dataset - Multilingual - MoE datasets: - SandLogicTechnologies/Indic_Chat_Dataset --- # IndicPhi-mini: Adapting Phi-mini-MoE to Indic Languages with Curated Data ## Overview **IndicPhi-mini** is a fine-tuned version of **Microsoft’s Phi-mini-MoE**, a compact Mixture-of-Experts (MoE) model, adapted specifically for Indic languages. It is trained on a curated multilingual dataset of approximately 29 million high-quality samples, standardized into a conversational format from diverse sources. By leveraging efficient fine-tuning techniques such as **QLoRA-based quantization** and **LoRA adapters**, the model enhances Indic language capabilities while keeping resource usage practical. Evaluation on benchmark datasets shows consistent **3–4% accuracy** improvements across multiple Indic languages, demonstrating the effectiveness of targeted fine-tuning with curated data. a compact Mixture-of-Experts (MoE) model --- ## Key Contributions - Curated one of the **largest Indic corpora** to date: 561M samples → cleaned into **29M high-quality samples** across **13 Indic languages**. - Fine-tuned **Phi-mini-MoE** (7.6B params, 2.4B active) using **QLoRA (4-bit)** and **LoRA adapters**, making training feasible on a single **A100-80GB GPU**. - Achieved **+3–4% accuracy improvements** on major Indic benchmarks: - **ARC-Challenge-Indic** (reasoning tasks) - **MMLU-Indic** (knowledge & domain understanding) - Improved **generalization across multiple Indic languages** including Hindi, Kannada, Tamil, Telugu, Marathi, Bengali, Malayalam, Gujarati, Odia, Punjabi, Assamese, Sinhala, and Urdu. --- ## Model Architecture - **Base model:** Phi-mini-MoE-Instruct (Microsoft) - **Parameters:** 7.6B total (2.4B active per token) - **Layers:** 32 decoder-only transformer blocks - **Attention:** Grouped Query Attention (GQA) - **Experts per layer:** 16 (Top-2 active per token) - **Context length:** 4096 tokens --- ## Usage To load the fine-tuned model: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "SandLogicTechnologies/IndicPhi-mini" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", load_in_4bit=True ) prompt = "ग्रामीण क्षेत्रों में ऑनलाइन शिक्षा की समस्याएं क्या हैं?" inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Dataset Preparation ### Data Sources - **Total collected:** 561M samples from **53 datasets** from Hugging Face. - **Languages covered:** 13 Indian languages which include Hindi, Kannada, Telugu, Tamil, Marathi, Malayalam, Gujarati, Bengali,Odia, Punjabi, Assamese, Sinhala, Urdu. - **Categories:** General text, translation, instruction, conversational. ### Processing Pipeline 1. **Manual Filtering** – removed noisy, irrelevant, and malformed samples. 2. **Preprocessing** – deduplication, language identification, normalization, minimum length filtering. 3. **Format Conversion** – standardized into **UltraChat JSON schema** (multi-turn conversations). ### Final Cleaned Dataset - **Size:** 29M samples ### Dataset Distribution (Final Cleaned) | Language | Samples | |------------|-----------| | Hindi | 4.63M | | Kannada | 3.54M | | Telugu | 3.72M | | Tamil | 3.86M | | Marathi | 3.79M | | Malayalam | 2.81M | | Gujarati | 2.94M | | Bengali | 1.82M | | Odia | 438K | | Punjabi | 1.21M | | Assamese | 185K | | Sinhala | 64K | | Urdu | 58K | **Total curated dataset:** ~29 million high-quality samples --- ### Training Details - **Hardware:** 1 × NVIDIA A100-80GB - **Precision:** QLoRA (4-bit quantization) - **Batching:** Effective batch size 256 (32 × 8 gradient accumulation) - **Steps:** 8,500 - **Optimizer:** AdamW (8-bit) + cosine LR schedule + 1k warmup steps - **LoRA configuration:** - Layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj - r=128, α=128, dropout=0 - **Final training loss:** 0.48 --- ## Evaluation & Results ### Benchmarks 1. **ARC-Challenge-Indic** (reasoning) 2. **MMLU-Indic** (knowledge & domain understanding) ### Improvements - **ARC-Challenge-Indic** - Accuracy: **21.03 → 24.46 (+3.43%)** - Normalized Accuracy: **24.69 → 28.86 (+4.17%)** - **MMLU-Indic** - Accuracy: **27.47 → 30.95 (+3.48%)** ### Results #### ARC-Challenge-Indic | Language | Accuracy (Phi-mini-MoE) | Accuracy (IndicPhi-mini) | |------------|-------------------------|--------------------------| | Hindi | 22.61 | 26.17 | | Kannada | 20.96 | 25.83 | | Tamil | 20.78 | 24.61 | | Telugu | 20.70 | 26.00 | | Bengali | 21.91 | 25.04 | | Gujarati | 18.17 | 21.30 | | Malayalam | 22.26 | 23.91 | | Marathi | 19.65 | 25.22 | | Odia | 22.26 | 24.17 | Accuracy: **(Phi-mini-MoE) 21.03 → (IndicPhi-mini) 24.46 (+3.43%)** **MMLU-Indic** | Language | Accuracy (Phi-mini-MoE) | Accuracy (Phi-mini-MoE)| |------------|-------------------------|-------------------------| | Hindi | 28.01 | 31.45 | | Kannada | 26.74 | 30.12 | | Tamil | 27.53 | 30.84 | | Telugu | 27.20 | 31.02 | | Bengali | 28.36 | 31.44 | | Gujarati | 25.91 | 29.28 | | Malayalam | 26.65 | 29.77 | | Marathi | 27.12 | 30.63 | | Odia | 27.05 | 30.45 | | Punjabi | 26.42 | 29.61 | | Assamese | 25.98 | 29.23 | | Sinhala | 24.87 | 27.66 | | Urdu | 25.44 | 28.71 | Accuracy: **(Phi-mini-MoE) 27.47 → (IndicPhi-mini) 30.95 (+3.48%)** ## Acknowledgments The **Phi-mini-MoE-Instruct** models are based on the original work by **Microsoft** and fine-tuned by the **Sandlogic** development team. Special thanks to: - The [Microsoft](https://huggingface.co/microsoft) team for developing and releasing the [microsoft/Phi-mini-MoE-instruct](https://huggingface.co/microsoft/Phi-mini-MoE-instruct) model. - The authors and organizations behind the **53 open-source datasets** that made this work possible. The complete list of dataset sources and citations is available [here](https://github.com/sandlogic/SandLogic-Lexicons/blob/main/Images/dataset_citation.md). --- ## Contact For any inquiries or support, please contact us at support@sandlogic.com or visit our [Website](https://www.sandlogic.com/).