---
title: Transformer Edge Optimization
emoji: ๐
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
tags:
- quantization
- optimization
- edge-ai
- mobile
- transformers
- onnx
- sentiment-analysis
duplicated_from: null
---
# ๐ Transformer Edge Optimization Demo
[](https://github.com/mtkaya/transformer-edge-optimization)
[](https://github.com/mtkaya/transformer-edge-optimization/blob/main/LICENSE)
[](https://colab.research.google.com/github/mtkaya/transformer-edge-optimization/blob/main/notebooks/01_quantization_basics.ipynb)
**Interactive demo comparing Original vs Quantized transformer models**
[Try Demo](#) โข [GitHub Repo](https://github.com/mtkaya/transformer-edge-optimization) โข [Notebooks](https://github.com/mtkaya/transformer-edge-optimization/tree/main/notebooks)
---
## ๐ฏ What Does This Demo Do?
This interactive demo showcases **model quantization** - a technique to make AI models smaller and faster for mobile/edge devices.
### Try It:
1. **Quick Prediction** - Test sentiment analysis with quantized model
2. **Model Comparison** - Compare Original (FP32) vs Quantized (INT8) side by side
3. **Documentation** - Learn about the techniques
---
## โจ Key Results
| Metric | Original | Quantized | Improvement |
|--------|----------|-----------|-------------|
| **Size** | 255 MB | 68 MB | **3.75x smaller** โฌ๏ธ |
| **Speed** | 12.3 ms | 5.8 ms | **2.1x faster** โก |
| **Accuracy** | 91.8% | 90.2% | **-1.6%** ๐ |
**Conclusion:** Nearly **4x smaller** model with **2x faster** inference and only **1.6% accuracy loss**!
---
## ๐งช What is Quantization?
**Quantization** reduces model size by converting weights from 32-bit floating point (FP32) to 8-bit integers (INT8).
### How It Works:
```python
import torch
from transformers import AutoModelForSequenceClassification
# Load model
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
)
# Quantize: FP32 โ INT8
quantized = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
# Now 4x smaller! ๐
```
### Why Quantization?
- โ
**Smaller models** - Fit on mobile devices
- โ
**Faster inference** - Better user experience
- โ
**Lower power** - Longer battery life
- โ
**Easy to implement** - Post-training, no retraining
---
## ๐ Optimization Techniques
This project demonstrates **3 major techniques**:
### 1. **Quantization** (This Demo)
- **Compression:** 4x
- **Speed:** 2-3x faster
- **Difficulty:** โญ Easy
### 2. **ONNX Runtime**
- **Compression:** 3.8x
- **Speed:** 2.2x faster
- **Difficulty:** โญโญ Medium
- **Benefit:** Cross-platform deployment
### 3. **Knowledge Distillation**
- **Compression:** 6-10x
- **Speed:** 3x faster
- **Difficulty:** โญโญโญ Advanced
- **Benefit:** Student model learns from teacher
---
## ๐ Try The Full Toolkit
### Interactive Notebooks (Google Colab):
#### 1. Quantization Basics (15 minutes)
[](https://colab.research.google.com/github/mtkaya/transformer-edge-optimization/blob/main/notebooks/01_quantization_basics.ipynb)
**Learn:**
- Dynamic quantization
- Static quantization
- Model size comparison
- Performance benchmarking
---
#### 2. ONNX Runtime Optimization (20 minutes)
[](https://colab.research.google.com/github/mtkaya/transformer-edge-optimization/blob/main/notebooks/02_huggingface_optimum.ipynb)
**Learn:**
- PyTorch โ ONNX conversion
- Hugging Face Optimum
- Cross-platform deployment
- Hardware acceleration
---
#### 3. Knowledge Distillation (30 minutes)
[](https://colab.research.google.com/github/mtkaya/transformer-edge-optimization/blob/main/notebooks/05_distilbert_training.ipynb)
**Learn:**
- Teacher-student training
- Distillation loss
- Creating tiny models
- BERT โ TinyBERT
---
## ๐ป Use Cases
### ๐ฑ Mobile Apps
```kotlin
// Android with TFLite
val analyzer = SentimentAnalyzer(context)
val result = analyzer.predict("Great app!")
```
### ๐ Web Apps
```javascript
// Browser with Transformers.js
import { pipeline } from '@xenova/transformers';
const classifier = await pipeline('sentiment-analysis');
```
### ๐ค Edge Devices
```python
# Raspberry Pi with ONNX Runtime
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
```
---
## ๐ Full Documentation
### GitHub Repository
**[mtkaya/transformer-edge-optimization](https://github.com/mtkaya/transformer-edge-optimization)**
Contains:
- โ
3 Jupyter notebooks
- โ
Example code (Python, Kotlin, JavaScript)
- โ
Comprehensive documentation
- โ
CI/CD pipeline
- โ
Docker support
### Quick Links:
- [Installation Guide](https://github.com/mtkaya/transformer-edge-optimization#-installation)
- [Usage Examples](https://github.com/mtkaya/transformer-edge-optimization#-examples)
- [API Reference](https://github.com/mtkaya/transformer-edge-optimization#-api-reference)
- [Contributing](https://github.com/mtkaya/transformer-edge-optimization/blob/main/CONTRIBUTING.md)
---
## ๐ Technical Details
### Model Used:
**DistilBERT** fine-tuned on SST-2 (Stanford Sentiment Treebank)
- Base Model: `distilbert-base-uncased-finetuned-sst-2-english`
- Parameters: 67M
- Task: Binary sentiment classification (Positive/Negative)
### Quantization Approach:
**Dynamic Quantization** with PyTorch
- Weights: INT8 (8-bit integers)
- Activations: FP32 (computed at runtime)
- Method: `torch.quantization.quantize_dynamic()`
### Benchmark Hardware:
- **CPU:** Intel Xeon (Colab)
- **Input:** 128 tokens average
- **Iterations:** 100 runs per test
---
## ๐ Detailed Benchmark
### Model Size:
```
Original (FP32): 255.43 MB
Quantized (INT8): 68.12 MB
Compression Ratio: 3.75x
Space Saved: 187.31 MB (73.3%)
```
### Inference Speed (CPU):
```
Original: 12.34 ยฑ 0.45 ms
Quantized: 5.78 ยฑ 0.23 ms
Speedup: 2.13x
Time Saved: 6.56 ms per inference (53.2%)
```
### Accuracy (SST-2 Test Set):
```
Original: 91.8% accuracy
Quantized: 90.2% accuracy
Difference: -1.6%
```
### Memory Usage:
```
Original: ~280 MB
Quantized: ~95 MB
Reduction: 2.95x
```
---
## ๐ Features of This Demo
### ๐ฏ Quick Prediction
- Enter any text
- Toggle between Original/Quantized
- See prediction + confidence + model info
### โ๏ธ Model Comparison
- Side-by-side comparison
- Same input, both models
- Performance metrics
### ๐ Documentation
- Learn about quantization
- See benchmark results
- Access notebooks
- Quick start code
---
## ๐ค Contributing
We welcome contributions! Check out:
- **GitHub Issues:** [Report bugs](https://github.com/mtkaya/transformer-edge-optimization/issues)
- **Discussions:** [Ask questions](https://github.com/mtkaya/transformer-edge-optimization/discussions)
- **Pull Requests:** [Contribute code](https://github.com/mtkaya/transformer-edge-optimization/pulls)
---
## ๐ License
This project is licensed under the **MIT License**.
See [LICENSE](https://github.com/mtkaya/transformer-edge-optimization/blob/main/LICENSE) for details.
---
## ๐ Acknowledgments
Built with:
- [Hugging Face Transformers](https://github.com/huggingface/transformers)
- [PyTorch](https://pytorch.org/)
- [Gradio](https://gradio.app/)
Inspired by:
- [DistilBERT paper](https://arxiv.org/abs/1910.01108) (Sanh et al., 2019)
- [Q8BERT](https://arxiv.org/abs/1910.06188) (Zafrir et al., 2021)
---
## ๐ง Contact
- **GitHub:** [@mtkaya](https://github.com/mtkaya)
- **Issues:** [Report here](https://github.com/mtkaya/transformer-edge-optimization/issues)
---
**โญ Star the repo if you find this useful! โญ**
[GitHub Repository](https://github.com/mtkaya/transformer-edge-optimization) โข
[Documentation](https://github.com/mtkaya/transformer-edge-optimization#readme) โข
[Notebooks](https://github.com/mtkaya/transformer-edge-optimization/tree/main/notebooks)
**Made with โค๏ธ for the AI community**