--- language: en license: apache-2.0 tags: - i3-architecture - hybrid-model - rwkv-mamba - custom_code datasets: - agentlans/high-quality-english-sentences - roneneldan/TinyStories - starhopp3r/TinyChat library_name: transformers pipeline_tag: text-generation --- # i3-80M - Hybrid Architecture Language Model ## Model Description The **i3-80M Model** is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers. This is the second model in the i3 series, scaling up from the original [i3-22M](https://huggingface.co/FlameF0X/i3-22m) with improved architecture and multi-dataset training. > [!NOTE] > To use the model try it [here](https://huggingface.co/spaces/FlameF0X/i3-80m). > > For other reasons, [here](https://huggingface.co/FlameF0X/i3-80m/blob/main/CITE%C8%98TEM%C4%82.md) is the Romanian translation. ## Model Statistics - **Total Parameters**: ~82.77M (82,765,160) - **Architecture**: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers - **Vocabulary Size**: 35,560 tokens (variable-length chunks with token) - **Hidden Dimension (d_model)**: 512 - **Attention Heads**: 16 - **State Dimension (d_state)**: 32 - **Max Sequence Length**: 256 - **Tokenization**: Memory-efficient variable-length chunking (2-3 characters) ### Architecture Breakdown ``` Layers 1-10: RWKV-Mamba Hybrid Blocks (Recurrent/Conv) ├─ RWKVMambaHybrid (Time-mixing + State-space) └─ Feed-Forward Network (4x expansion) Layers 11-16: Full Attention Blocks ├─ Multi-Head Attention (16 heads) └─ Feed-Forward Network (4x expansion) ``` ## Comparison with i3-22M | Feature | i3-22M | i3-80M (This Model) | |---------|--------|---------------------| | **Parameters** | 22.6M | 82.77M | | **Architecture** | 24 Hybrid Layers | 10 Hybrid + 6 Attention Layers | | **Hidden Dimension** | 512 | 512 | | **Vocabulary Size** | 4,466 | 35,560 | | **Training Dataset** | TinyChat only | TinyStories + TinyChat + HQ Sentences | | **Total Tokens** | ~1M conversations | ~3M+ tokens | | **Final Loss** | ~2.0 | ~2.0 | | **Final Perplexity** | 7.29-9.70 | 7.29-10.0 | | **Training Time** | ~17 hours | ~2-4 hours | | **Attention Layers** | None (Pure Hybrid) | 6 Full Attention Layers | ### Key Improvements Over i3-22M 1. **Hybrid Architecture**: Introduces full multi-head attention in upper layers for better long-range dependencies 2. **Larger Vocabulary**: 8x larger vocabulary (35,560 vs 4,466) for better token coverage 3. **Multi-Dataset Training**: Trained on 3 diverse datasets vs single dataset 4. **Better Generalization**: Exposure to narratives (TinyStories), conversations (TinyChat), and formal text (HQ Sentences) 5. **Enhanced Unknown Token Handling**: Robust token system for out-of-vocabulary words ### When to Use Each Model **Use i3-22M if you need:** - Smaller model size (~22M params) - Pure conversational focus (TinyChat specialized) - Lower memory footprint - Faster inference **Use i3-80M if you need:** - Better general-purpose text generation - Stronger attention-based reasoning (6 attention layers) - Larger vocabulary coverage - Multi-domain text understanding (stories, chat, formal text) ### Key Features 1. **Hybrid Architecture**: Combines the efficiency of recurrent/convolutional processing with the power of attention - Early layers use RWKV-Mamba hybrid for efficient sequence processing - Later layers use full multi-head attention for complex pattern recognition 2. **Memory-Optimized Training**: - Streaming vocabulary building (no full text storage) - Vocabulary caching (build once, reuse) - Efficient chunk frequency counting - Automatic memory cleanup 3. **Multi-Dataset Pre-training**: Trained on diverse text sources for robust language understanding - TinyStories: Narrative and storytelling - TinyChat: Conversational dynamics - High-Quality English Sentences: Linguistic diversity 4. **Smart Tokenization**: Variable-length chunking (2-3 chars) with common trigram optimization - Total tokens processed: **3,000,000+** - Handles unknown tokens gracefully with token ## Training Details ### Training Configuration - **Datasets**: - `agentlans/high-quality-english-sentences` - `roneneldan/TinyStories` - `starhopp3r/TinyChat` - **Training Steps**: 5,000 iterations - **Batch Size**: 4 (with gradient accumulation support) - **Learning Rate**: 3e-4 (with warmup and cosine decay) - **Optimizer**: AdamW with gradient clipping (max norm: 1.0) - **Hardware**: NVIDIA P100 (16GB VRAM) - **Training Time**: ~2-4 hours - **Framework**: PyTorch ### Training Dynamics - **GPU Utilization**: Stable at ~15-20% during training - **GPU Memory**: \~18% allocated (~2.2GB / 12GB) - **Power Usage**: ~40W average - **Throughput**: ~100-550 tokens/sec ### Performance Metrics | Metric | Initial | Final | |--------|---------|-------| | Training Loss | ~10.0 | ~1.7 | | Perplexity | ~4000+ | ~6 | ![image](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/ugtJGyEkQfbGieURP2W78.png) > [!NOTE] > I dont know why the logging starts at step 4.6k . **i3-22m** and **i3-80m** comparation? ![image](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/utj6B7AE_gMMI9jnHc37Z.png) The model shows strong convergence with stable training dynamics and efficient GPU utilization. ## Usage ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained("FlameF0X/i3-80m") tokenizer = AutoTokenizer.from_pretrained("FlameF0X/i3-80m") # Generate text prompt = "hello" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate( inputs.input_ids, max_length=100, temperature=0.8, top_k=40 ) generated_text = tokenizer.decode(outputs[0]) print(generated_text) ``` ## Technical Innovations 1. **RWKV-Mamba Hybrid Recurrence**: Combines RWKV's time-mixing with Mamba's state-space dynamics - Linear complexity for long sequences - Efficient recurrent processing - State-space modeling for temporal dependencies 2. **Hierarchical Processing**: - Lower layers focus on local patterns (conv/recurrent) - Upper layers capture global dependencies (attention) 3. **Memory Efficiency**: - Streaming tokenization during vocab building - No full dataset storage in RAM - Automatic cleanup of intermediate data ## Model Files - `pytorch_model.bin`: Model weights - `config.json`: Model configuration - `chunk_vocab_combined.json`: Tokenizer vocabulary ## Training Tracking This model was tracked using Weights & Biases (WandB) with comprehensive metrics: - Real-time loss and perplexity tracking - Gradient norm monitoring - Learning rate scheduling visualization - Generation samples logged to tables - Model checkpoints as artifacts - System resource monitoring ## Limitations - Trained on English text only - Limited to 256 token context window - May require fine-tuning for specific downstream tasks - Conversational style influenced by TinyChat dataset ## Model Series - [i3-22M](https://huggingface.co/FlameF0X/i3-22m) - Original model with pure hybrid architecture - **i3-80M** (This model) - Scaled version with attention layers and multi-dataset training ## Citation ```bibtex @misc{i3-80m, author = {FlameF0X}, title = {i3-80M: Hybrid Architecture Language Model}, year = {2025}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/FlameF0X/i3-80m}} } ```