Spaces:
Build error
Build error
File size: 9,884 Bytes
a26f37f ede5327 a26f37f ede5327 a26f37f ede5327 a26f37f ede5327 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 |
---
title: Language Detection App
emoji: π
colorFrom: indigo
colorTo: blue
sdk: gradio
python_version: 3.9
app_file: app.py
license: mit
---
# π Language Detection App
A powerful and elegant language detection application built with Gradio frontend and a modular backend featuring multiple state-of-the-art ML models organized by architecture and training dataset.
## β¨ Features
- **Clean Gradio Interface**: Simple, intuitive web interface for language detection
- **Multiple Model Architectures**: Choose between XLM-RoBERTa (Model A) and BERT (Model B) architectures
- **Multiple Training Datasets**: Models trained on standard (Dataset A) and enhanced (Dataset B) datasets
- **Centralized Configuration**: All model configurations and settings in one place
- **Modular Backend**: Easy-to-extend architecture for plugging in your own ML models
- **Real-time Detection**: Instant language detection with confidence scores
- **Multiple Predictions**: Shows top 5 language predictions with confidence levels
- **100+ Languages**: Support for major world languages (varies by model)
- **Example Texts**: Pre-loaded examples in various languages for testing
- **Model Switching**: Seamlessly switch between different models
- **Extensible**: Abstract base class for implementing custom models
## π Quick Start
### 1. Setup Environment
```bash
# Create virtual environment
python -m venv venv
# Activate environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
```
### 2. Test the Backend
```bash
# Run tests to verify everything works
python test_app.py
# Test specific model combinations
python test_model_a_dataset_a.py
python test_model_b_dataset_b.py
```
### 3. Launch the App
```bash
# Start the Gradio app
python app.py
```
The app will be available at `http://localhost:7860`
## π§© Model Architecture
The system is organized around two dimensions:
### ποΈ Model Architectures
- **Model A**: XLM-RoBERTa based architectures - Excellent cross-lingual transfer capabilities
- **Model B**: BERT based architectures - Efficient and fast processing
### π Training Datasets
- **Dataset A**: Standard multilingual language detection dataset - Broad language coverage
- **Dataset B**: Enhanced/specialized language detection dataset - Ultra-high accuracy focus
### π€ Available Model Combinations
1. **Model A Dataset A** - XLM-RoBERTa + Standard Dataset β
- **Architecture**: XLM-RoBERTa (Model A)
- **Training**: Dataset A (standard multilingual)
- **Accuracy**: 97.9%
- **Size**: 278M parameters
- **Languages**: 100+ languages
- **Strengths**: Balanced performance, robust cross-lingual capabilities, comprehensive language coverage
- **Use Cases**: General-purpose language detection, multilingual content processing
2. **Model B Dataset A** - BERT + Standard Dataset β
- **Architecture**: BERT (Model B)
- **Training**: Dataset A (standard multilingual)
- **Accuracy**: 96.17%
- **Size**: 178M parameters
- **Languages**: 100+ languages
- **Strengths**: Fast inference, broad language support, efficient processing
- **Use Cases**: High-throughput detection, real-time applications, resource-constrained environments
3. **Model A Dataset B** - XLM-RoBERTa + Enhanced Dataset β
- **Architecture**: XLM-RoBERTa (Model A)
- **Training**: Dataset B (enhanced/specialized)
- **Accuracy**: 99.72%
- **Size**: 278M parameters
- **Training Loss**: 0.0176
- **Languages**: 20 carefully selected languages
- **Strengths**: Exceptional accuracy, focused language support, state-of-the-art results
- **Use Cases**: Research applications, high-precision detection, critical accuracy requirements
4. **Model B Dataset B** - BERT + Enhanced Dataset β
- **Architecture**: BERT (Model B)
- **Training**: Dataset B (enhanced/specialized)
- **Accuracy**: 99.85%
- **Size**: 178M parameters
- **Training Loss**: 0.0125
- **Languages**: 20 carefully selected languages
- **Strengths**: Highest accuracy, ultra-low training loss, precision-optimized
- **Use Cases**: Maximum precision applications, research requiring highest accuracy
### ποΈ Core Components
- **`BaseLanguageModel`**: Abstract interface that all models must implement
- **`ModelRegistry`**: Manages model registration and creation with centralized configuration
- **`LanguageDetector`**: Main orchestrator for language detection
- **`model_config.py`**: Centralized configuration for all models and language mappings
### π§ Adding New Models
To add a new model combination, simply:
1. Create a new file in `backend/models/` (e.g., `model_c_dataset_a.py`)
2. Inherit from `BaseLanguageModel`
3. Implement the required methods
4. Add configuration to `model_config.py`
5. Register it in `ModelRegistry`
Example:
```python
# backend/models/model_c_dataset_a.py
from .base_model import BaseLanguageModel
from .model_config import get_model_config
class ModelCDatasetA(BaseLanguageModel):
def __init__(self):
self.model_key = "model-c-dataset-a"
self.config = get_model_config(self.model_key)
# Initialize your model
def predict(self, text: str) -> Dict[str, Any]:
# Implement prediction logic
pass
def get_supported_languages(self) -> List[str]:
# Return supported language codes
pass
def get_model_info(self) -> Dict[str, Any]:
# Return model metadata from config
pass
```
Then add configuration in `model_config.py` and register in `language_detector.py`.
## π§ͺ Testing
The project includes comprehensive test suites:
- **`test_app.py`**: General app functionality tests
- **`test_model_a_dataset_a.py`**: Tests for XLM-RoBERTa + standard dataset
- **`test_model_b_dataset_b.py`**: Tests for BERT + enhanced dataset (highest accuracy)
- **Model comparison tests**: Automated testing across all model combinations
- **Model switching tests**: Verify seamless model switching
## π Supported Languages
The models support different language sets based on their training:
- **Model A/B + Dataset A**: 100+ languages including major European, Asian, African, and other world languages based on the CC-100 dataset
- **Model A/B + Dataset B**: 20 carefully selected high-performance languages (Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese)
## π Model Comparison
| Feature | Model A Dataset A | Model B Dataset A | Model A Dataset B | Model B Dataset B |
|---------|-------------------|-------------------|-------------------|-------------------|
| **Architecture** | XLM-RoBERTa | BERT | XLM-RoBERTa | BERT |
| **Dataset** | Standard | Standard | Enhanced | Enhanced |
| **Accuracy** | 97.9% | 96.17% | 99.72% | **99.85%** π |
| **Model Size** | 278M | 178M | 278M | 178M |
| **Languages** | 100+ | 100+ | 20 (curated) | 20 (curated) |
| **Training Loss** | N/A | N/A | 0.0176 | **0.0125** |
| **Speed** | Moderate | **Fast** | Moderate | **Fast** |
| **Memory Usage** | Higher | **Lower** | Higher | **Lower** |
| **Best For** | Balanced performance | Speed & broad coverage | Ultra-high accuracy | **Maximum precision** |
### π― Model Selection Guide
- **π Model B Dataset B**: Choose for maximum accuracy on 20 core languages (99.85%)
- **π¬ Model A Dataset B**: Choose for ultra-high accuracy on 20 core languages (99.72%)
- **βοΈ Model A Dataset A**: Choose for balanced performance and comprehensive language coverage (97.9%)
- **β‘ Model B Dataset A**: Choose for fast inference and broad language coverage (96.17%)
## π§ Configuration
You can configure models using the centralized configuration system:
```python
# Default model selection
detector = LanguageDetector(model_key="model-a-dataset-a") # Balanced XLM-RoBERTa
detector = LanguageDetector(model_key="model-b-dataset-a") # Fast BERT
detector = LanguageDetector(model_key="model-a-dataset-b") # Ultra-high accuracy XLM-RoBERTa
detector = LanguageDetector(model_key="model-b-dataset-b") # Maximum precision BERT
# All configurations are centralized in backend/models/model_config.py
```
## π Project Structure
```
language-detection/
βββ backend/
β βββ models/
β β βββ model_config.py # Centralized configuration
β β βββ base_model.py # Abstract base class
β β βββ model_a_dataset_a.py # XLM-RoBERTa + Standard
β β βββ model_b_dataset_a.py # BERT + Standard
β β βββ model_a_dataset_b.py # XLM-RoBERTa + Enhanced
β β βββ model_b_dataset_b.py # BERT + Enhanced
β β βββ __init__.py
β βββ language_detector.py # Main orchestrator
βββ tests/
βββ app.py # Gradio interface
βββ README.md
```
## π€ Contributing
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/new-model-combination`)
3. Implement your model following the `BaseLanguageModel` interface
4. Add configuration to `model_config.py`
5. Add tests for your implementation
6. Commit your changes (`git commit -m 'Add new model combination'`)
7. Push to the branch (`git push origin feature/new-model-combination`)
8. Open a Pull Request
## π License
This project is open source and available under the MIT License.
## π Acknowledgments
- **Hugging Face** for the transformers library and model hosting platform
- **Model providers** for the fine-tuned language detection models used in this project
- **Gradio** for the excellent web interface framework
- **Open source community** for the foundational technologies that make this project possible |