--- library_name: transformers tags: - gpt2 - assamese - language-model - text-generation - low-resource - educational - research - generated_from_trainer metrics: - accuracy model-index: - name: Assamese GPT-2 results: [] --- # Assamese GPT-2 Model This is a GPT-2 language model trained from scratch on Assamese monolingual text, using data from **IndicCorpV2** . The model is developed for **educational and research purposes** to support natural language understanding and generation tasks in Assamese — a low-resource language. ## 📖 Model Description The Assamese GPT-2 model is based on the standard GPT-2 decoder-only transformer architecture with 12 layers, 12 attention heads, 768 hidden size. It is capable of generating grammatically coherent and contextually relevant Assamese text and serves as a foundation for downstream NLP tasks such as: - Language modeling - Text completion/generation - Fine-tuning for classification or summarization ## ✅ Intended Uses - Academic research on Assamese NLP - Training and benchmarking in educational settings - Exploration of low-resource language modeling ## 🚫 Limitations - Trained on general-domain monolingual data, may not perform well on domain-specific texts (e.g., legal, medical). - Might generate biased, incomplete, or hallucinated outputs. - Not suitable for production use or deployment in sensitive applications. ## 📚 Training and Evaluation Data The model was trained using Assamese monolingual data collected from: - **IndicCorpV2**: A curated collection of web-crawled and processed data for Indic languages. Data preprocessing included: - Unicode normalization - Removal of noisy characters and malformed tokens - Sentence segmentation using Assamese-specific heuristics ## 🧪 Training Procedure ### Hyperparameters - Architecture: GPT2 (12 layers, 12 heads, 768 hidden size) - Tokenizer vocab size: 50,000 - Context window size: 1024 tokens - Learning rate: 5e-5 - Epochs: 20 - Batch size: 64 - Optimizer: AdamW (β₁=0.9, β₂=0.999, ε=1e-8) - Scheduler: Linear - Mixed Precision: Native AMP - Seed: 42 ### Results - Final Evaluation Loss: -29.1890 - Accuracy: 0.3452 ## 🚀 Example Usage ```python from transformers import GPT2LMHeadModel, GPT2Tokenizer model = GPT2LMHeadModel.from_pretrained("BharatVLM/AssameseGPT2") tokenizer = GPT2Tokenizer.from_pretrained("BharatVLM/AssameseGPT2") prompt = "অসমৰ ইতিহাস" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=50, do_sample=True) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## 📄 License This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. Commercial use is not permitted. Use is allowed for academic and research purposes only. ## 📬 Citation Please cite this model as: @misc{assamesegpt2, author = {BharatVLM}, title = {Assamese GPT-2 Model}, year = 2025, howpublished = {\url{https://huggingface.co/BharatVLM/AssameseGPT2}}, note = {Trained using IndicCorpV2 and OSCAR corpora} } ## 🧰 Framework Versions - Transformers: 4.52.0.dev0 - PyTorch: 2.5.1+cu121 - Datasets: 3.6.0 - Tokenizers: 0.21.1 ## Contact Us For questions or academic collaboration, please contact: ai.bharatvlm@gmail.com.