--- title: Transformer Edge Optimization emoji: ๐Ÿš€ colorFrom: blue colorTo: purple sdk: gradio sdk_version: 5.49.1 app_file: app.py pinned: false license: mit tags: - quantization - optimization - edge-ai - mobile - transformers - onnx - sentiment-analysis duplicated_from: null --- # ๐Ÿš€ Transformer Edge Optimization Demo
[![GitHub](https://img.shields.io/badge/GitHub-Repository-blue?logo=github)](https://github.com/mtkaya/transformer-edge-optimization) [![License](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/mtkaya/transformer-edge-optimization/blob/main/LICENSE) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mtkaya/transformer-edge-optimization/blob/main/notebooks/01_quantization_basics.ipynb) **Interactive demo comparing Original vs Quantized transformer models** [Try Demo](#) โ€ข [GitHub Repo](https://github.com/mtkaya/transformer-edge-optimization) โ€ข [Notebooks](https://github.com/mtkaya/transformer-edge-optimization/tree/main/notebooks)
--- ## ๐ŸŽฏ What Does This Demo Do? This interactive demo showcases **model quantization** - a technique to make AI models smaller and faster for mobile/edge devices. ### Try It: 1. **Quick Prediction** - Test sentiment analysis with quantized model 2. **Model Comparison** - Compare Original (FP32) vs Quantized (INT8) side by side 3. **Documentation** - Learn about the techniques --- ## โœจ Key Results | Metric | Original | Quantized | Improvement | |--------|----------|-----------|-------------| | **Size** | 255 MB | 68 MB | **3.75x smaller** โฌ‡๏ธ | | **Speed** | 12.3 ms | 5.8 ms | **2.1x faster** โšก | | **Accuracy** | 91.8% | 90.2% | **-1.6%** ๐Ÿ“Š | **Conclusion:** Nearly **4x smaller** model with **2x faster** inference and only **1.6% accuracy loss**! --- ## ๐Ÿงช What is Quantization? **Quantization** reduces model size by converting weights from 32-bit floating point (FP32) to 8-bit integers (INT8). ### How It Works: ```python import torch from transformers import AutoModelForSequenceClassification # Load model model = AutoModelForSequenceClassification.from_pretrained( "distilbert-base-uncased-finetuned-sst-2-english" ) # Quantize: FP32 โ†’ INT8 quantized = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, dtype=torch.qint8 ) # Now 4x smaller! ๐ŸŽ‰ ``` ### Why Quantization? - โœ… **Smaller models** - Fit on mobile devices - โœ… **Faster inference** - Better user experience - โœ… **Lower power** - Longer battery life - โœ… **Easy to implement** - Post-training, no retraining --- ## ๐Ÿ“Š Optimization Techniques This project demonstrates **3 major techniques**: ### 1. **Quantization** (This Demo) - **Compression:** 4x - **Speed:** 2-3x faster - **Difficulty:** โญ Easy ### 2. **ONNX Runtime** - **Compression:** 3.8x - **Speed:** 2.2x faster - **Difficulty:** โญโญ Medium - **Benefit:** Cross-platform deployment ### 3. **Knowledge Distillation** - **Compression:** 6-10x - **Speed:** 3x faster - **Difficulty:** โญโญโญ Advanced - **Benefit:** Student model learns from teacher --- ## ๐Ÿš€ Try The Full Toolkit ### Interactive Notebooks (Google Colab): #### 1. Quantization Basics (15 minutes) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mtkaya/transformer-edge-optimization/blob/main/notebooks/01_quantization_basics.ipynb) **Learn:** - Dynamic quantization - Static quantization - Model size comparison - Performance benchmarking --- #### 2. ONNX Runtime Optimization (20 minutes) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mtkaya/transformer-edge-optimization/blob/main/notebooks/02_huggingface_optimum.ipynb) **Learn:** - PyTorch โ†’ ONNX conversion - Hugging Face Optimum - Cross-platform deployment - Hardware acceleration --- #### 3. Knowledge Distillation (30 minutes) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mtkaya/transformer-edge-optimization/blob/main/notebooks/05_distilbert_training.ipynb) **Learn:** - Teacher-student training - Distillation loss - Creating tiny models - BERT โ†’ TinyBERT --- ## ๐Ÿ’ป Use Cases ### ๐Ÿ“ฑ Mobile Apps ```kotlin // Android with TFLite val analyzer = SentimentAnalyzer(context) val result = analyzer.predict("Great app!") ``` ### ๐ŸŒ Web Apps ```javascript // Browser with Transformers.js import { pipeline } from '@xenova/transformers'; const classifier = await pipeline('sentiment-analysis'); ``` ### ๐Ÿค– Edge Devices ```python # Raspberry Pi with ONNX Runtime import onnxruntime as ort session = ort.InferenceSession("model.onnx") ``` --- ## ๐Ÿ“š Full Documentation ### GitHub Repository **[mtkaya/transformer-edge-optimization](https://github.com/mtkaya/transformer-edge-optimization)** Contains: - โœ… 3 Jupyter notebooks - โœ… Example code (Python, Kotlin, JavaScript) - โœ… Comprehensive documentation - โœ… CI/CD pipeline - โœ… Docker support ### Quick Links: - [Installation Guide](https://github.com/mtkaya/transformer-edge-optimization#-installation) - [Usage Examples](https://github.com/mtkaya/transformer-edge-optimization#-examples) - [API Reference](https://github.com/mtkaya/transformer-edge-optimization#-api-reference) - [Contributing](https://github.com/mtkaya/transformer-edge-optimization/blob/main/CONTRIBUTING.md) --- ## ๐ŸŽ“ Technical Details ### Model Used: **DistilBERT** fine-tuned on SST-2 (Stanford Sentiment Treebank) - Base Model: `distilbert-base-uncased-finetuned-sst-2-english` - Parameters: 67M - Task: Binary sentiment classification (Positive/Negative) ### Quantization Approach: **Dynamic Quantization** with PyTorch - Weights: INT8 (8-bit integers) - Activations: FP32 (computed at runtime) - Method: `torch.quantization.quantize_dynamic()` ### Benchmark Hardware: - **CPU:** Intel Xeon (Colab) - **Input:** 128 tokens average - **Iterations:** 100 runs per test --- ## ๐Ÿ“Š Detailed Benchmark ### Model Size: ``` Original (FP32): 255.43 MB Quantized (INT8): 68.12 MB Compression Ratio: 3.75x Space Saved: 187.31 MB (73.3%) ``` ### Inference Speed (CPU): ``` Original: 12.34 ยฑ 0.45 ms Quantized: 5.78 ยฑ 0.23 ms Speedup: 2.13x Time Saved: 6.56 ms per inference (53.2%) ``` ### Accuracy (SST-2 Test Set): ``` Original: 91.8% accuracy Quantized: 90.2% accuracy Difference: -1.6% ``` ### Memory Usage: ``` Original: ~280 MB Quantized: ~95 MB Reduction: 2.95x ``` --- ## ๐ŸŒŸ Features of This Demo ### ๐ŸŽฏ Quick Prediction - Enter any text - Toggle between Original/Quantized - See prediction + confidence + model info ### โš–๏ธ Model Comparison - Side-by-side comparison - Same input, both models - Performance metrics ### ๐Ÿ“š Documentation - Learn about quantization - See benchmark results - Access notebooks - Quick start code --- ## ๐Ÿค Contributing We welcome contributions! Check out: - **GitHub Issues:** [Report bugs](https://github.com/mtkaya/transformer-edge-optimization/issues) - **Discussions:** [Ask questions](https://github.com/mtkaya/transformer-edge-optimization/discussions) - **Pull Requests:** [Contribute code](https://github.com/mtkaya/transformer-edge-optimization/pulls) --- ## ๐Ÿ“„ License This project is licensed under the **MIT License**. See [LICENSE](https://github.com/mtkaya/transformer-edge-optimization/blob/main/LICENSE) for details. --- ## ๐Ÿ™ Acknowledgments Built with: - [Hugging Face Transformers](https://github.com/huggingface/transformers) - [PyTorch](https://pytorch.org/) - [Gradio](https://gradio.app/) Inspired by: - [DistilBERT paper](https://arxiv.org/abs/1910.01108) (Sanh et al., 2019) - [Q8BERT](https://arxiv.org/abs/1910.06188) (Zafrir et al., 2021) --- ## ๐Ÿ“ง Contact - **GitHub:** [@mtkaya](https://github.com/mtkaya) - **Issues:** [Report here](https://github.com/mtkaya/transformer-edge-optimization/issues) ---
**โญ Star the repo if you find this useful! โญ** [GitHub Repository](https://github.com/mtkaya/transformer-edge-optimization) โ€ข [Documentation](https://github.com/mtkaya/transformer-edge-optimization#readme) โ€ข [Notebooks](https://github.com/mtkaya/transformer-edge-optimization/tree/main/notebooks) **Made with โค๏ธ for the AI community**