File size: 9,884 Bytes
a26f37f
ede5327
 
 
 
a26f37f
ede5327
a26f37f
ede5327
a26f37f
 
ede5327
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
---
title: Language Detection App
emoji: 🌍
colorFrom: indigo
colorTo: blue
sdk: gradio
python_version: 3.9
app_file: app.py
license: mit
---

# 🌍 Language Detection App

A powerful and elegant language detection application built with Gradio frontend and a modular backend featuring multiple state-of-the-art ML models organized by architecture and training dataset.

## ✨ Features

- **Clean Gradio Interface**: Simple, intuitive web interface for language detection
- **Multiple Model Architectures**: Choose between XLM-RoBERTa (Model A) and BERT (Model B) architectures
- **Multiple Training Datasets**: Models trained on standard (Dataset A) and enhanced (Dataset B) datasets
- **Centralized Configuration**: All model configurations and settings in one place
- **Modular Backend**: Easy-to-extend architecture for plugging in your own ML models
- **Real-time Detection**: Instant language detection with confidence scores
- **Multiple Predictions**: Shows top 5 language predictions with confidence levels
- **100+ Languages**: Support for major world languages (varies by model)
- **Example Texts**: Pre-loaded examples in various languages for testing
- **Model Switching**: Seamlessly switch between different models
- **Extensible**: Abstract base class for implementing custom models

## πŸš€ Quick Start

### 1. Setup Environment

```bash
# Create virtual environment
python -m venv venv

# Activate environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
```

### 2. Test the Backend

```bash
# Run tests to verify everything works
python test_app.py

# Test specific model combinations
python test_model_a_dataset_a.py
python test_model_b_dataset_b.py
```

### 3. Launch the App

```bash
# Start the Gradio app
python app.py
```

The app will be available at `http://localhost:7860`

## 🧩 Model Architecture

The system is organized around two dimensions:

### πŸ—οΈ Model Architectures
- **Model A**: XLM-RoBERTa based architectures - Excellent cross-lingual transfer capabilities
- **Model B**: BERT based architectures - Efficient and fast processing

### πŸ“Š Training Datasets  
- **Dataset A**: Standard multilingual language detection dataset - Broad language coverage
- **Dataset B**: Enhanced/specialized language detection dataset - Ultra-high accuracy focus

### πŸ€– Available Model Combinations

1. **Model A Dataset A** - XLM-RoBERTa + Standard Dataset βœ…
   - **Architecture**: XLM-RoBERTa (Model A)
   - **Training**: Dataset A (standard multilingual)
   - **Accuracy**: 97.9%
   - **Size**: 278M parameters
   - **Languages**: 100+ languages
   - **Strengths**: Balanced performance, robust cross-lingual capabilities, comprehensive language coverage
   - **Use Cases**: General-purpose language detection, multilingual content processing

2. **Model B Dataset A** - BERT + Standard Dataset βœ…
   - **Architecture**: BERT (Model B)
   - **Training**: Dataset A (standard multilingual)
   - **Accuracy**: 96.17%
   - **Size**: 178M parameters
   - **Languages**: 100+ languages
   - **Strengths**: Fast inference, broad language support, efficient processing
   - **Use Cases**: High-throughput detection, real-time applications, resource-constrained environments

3. **Model A Dataset B** - XLM-RoBERTa + Enhanced Dataset βœ…
   - **Architecture**: XLM-RoBERTa (Model A)
   - **Training**: Dataset B (enhanced/specialized)
   - **Accuracy**: 99.72%
   - **Size**: 278M parameters
   - **Training Loss**: 0.0176
   - **Languages**: 20 carefully selected languages
   - **Strengths**: Exceptional accuracy, focused language support, state-of-the-art results
   - **Use Cases**: Research applications, high-precision detection, critical accuracy requirements

4. **Model B Dataset B** - BERT + Enhanced Dataset βœ…
   - **Architecture**: BERT (Model B)
   - **Training**: Dataset B (enhanced/specialized)
   - **Accuracy**: 99.85%
   - **Size**: 178M parameters
   - **Training Loss**: 0.0125
   - **Languages**: 20 carefully selected languages
   - **Strengths**: Highest accuracy, ultra-low training loss, precision-optimized
   - **Use Cases**: Maximum precision applications, research requiring highest accuracy

### πŸ—οΈ Core Components

- **`BaseLanguageModel`**: Abstract interface that all models must implement
- **`ModelRegistry`**: Manages model registration and creation with centralized configuration
- **`LanguageDetector`**: Main orchestrator for language detection
- **`model_config.py`**: Centralized configuration for all models and language mappings

### πŸ”§ Adding New Models

To add a new model combination, simply:

1. Create a new file in `backend/models/` (e.g., `model_c_dataset_a.py`)
2. Inherit from `BaseLanguageModel`
3. Implement the required methods
4. Add configuration to `model_config.py`
5. Register it in `ModelRegistry`

Example:
```python
# backend/models/model_c_dataset_a.py
from .base_model import BaseLanguageModel
from .model_config import get_model_config

class ModelCDatasetA(BaseLanguageModel):
    def __init__(self):
        self.model_key = "model-c-dataset-a"
        self.config = get_model_config(self.model_key)
        # Initialize your model
    
    def predict(self, text: str) -> Dict[str, Any]:
        # Implement prediction logic
        pass
    
    def get_supported_languages(self) -> List[str]:
        # Return supported language codes
        pass
    
    def get_model_info(self) -> Dict[str, Any]:
        # Return model metadata from config
        pass
```

Then add configuration in `model_config.py` and register in `language_detector.py`.

## πŸ§ͺ Testing

The project includes comprehensive test suites:

- **`test_app.py`**: General app functionality tests
- **`test_model_a_dataset_a.py`**: Tests for XLM-RoBERTa + standard dataset
- **`test_model_b_dataset_b.py`**: Tests for BERT + enhanced dataset (highest accuracy)
- **Model comparison tests**: Automated testing across all model combinations
- **Model switching tests**: Verify seamless model switching

## 🌐 Supported Languages

The models support different language sets based on their training:

- **Model A/B + Dataset A**: 100+ languages including major European, Asian, African, and other world languages based on the CC-100 dataset
- **Model A/B + Dataset B**: 20 carefully selected high-performance languages (Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese)

## πŸ“Š Model Comparison

| Feature | Model A Dataset A | Model B Dataset A | Model A Dataset B | Model B Dataset B |
|---------|-------------------|-------------------|-------------------|-------------------|
| **Architecture** | XLM-RoBERTa | BERT | XLM-RoBERTa | BERT |
| **Dataset** | Standard | Standard | Enhanced | Enhanced |
| **Accuracy** | 97.9% | 96.17% | 99.72% | **99.85%** πŸ† |
| **Model Size** | 278M | 178M | 278M | 178M |
| **Languages** | 100+ | 100+ | 20 (curated) | 20 (curated) |
| **Training Loss** | N/A | N/A | 0.0176 | **0.0125** |
| **Speed** | Moderate | **Fast** | Moderate | **Fast** |
| **Memory Usage** | Higher | **Lower** | Higher | **Lower** |
| **Best For** | Balanced performance | Speed & broad coverage | Ultra-high accuracy | **Maximum precision** |

### 🎯 Model Selection Guide

- **πŸ† Model B Dataset B**: Choose for maximum accuracy on 20 core languages (99.85%)
- **πŸ”¬ Model A Dataset B**: Choose for ultra-high accuracy on 20 core languages (99.72%)
- **βš–οΈ Model A Dataset A**: Choose for balanced performance and comprehensive language coverage (97.9%)
- **⚑ Model B Dataset A**: Choose for fast inference and broad language coverage (96.17%)

## πŸ”§ Configuration

You can configure models using the centralized configuration system:

```python
# Default model selection
detector = LanguageDetector(model_key="model-a-dataset-a")  # Balanced XLM-RoBERTa
detector = LanguageDetector(model_key="model-b-dataset-a")  # Fast BERT
detector = LanguageDetector(model_key="model-a-dataset-b")  # Ultra-high accuracy XLM-RoBERTa
detector = LanguageDetector(model_key="model-b-dataset-b")  # Maximum precision BERT

# All configurations are centralized in backend/models/model_config.py
```

## πŸ“ Project Structure

```
language-detection/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ model_config.py          # Centralized configuration
β”‚   β”‚   β”œβ”€β”€ base_model.py            # Abstract base class
β”‚   β”‚   β”œβ”€β”€ model_a_dataset_a.py     # XLM-RoBERTa + Standard
β”‚   β”‚   β”œβ”€β”€ model_b_dataset_a.py     # BERT + Standard
β”‚   β”‚   β”œβ”€β”€ model_a_dataset_b.py     # XLM-RoBERTa + Enhanced
β”‚   β”‚   β”œβ”€β”€ model_b_dataset_b.py     # BERT + Enhanced
β”‚   β”‚   └── __init__.py
β”‚   └── language_detector.py         # Main orchestrator
β”œβ”€β”€ tests/
β”œβ”€β”€ app.py                           # Gradio interface
└── README.md
```

## 🀝 Contributing

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/new-model-combination`)
3. Implement your model following the `BaseLanguageModel` interface
4. Add configuration to `model_config.py`
5. Add tests for your implementation
6. Commit your changes (`git commit -m 'Add new model combination'`)
7. Push to the branch (`git push origin feature/new-model-combination`)
8. Open a Pull Request

## πŸ“ License

This project is open source and available under the MIT License.

## πŸ™ Acknowledgments

- **Hugging Face** for the transformers library and model hosting platform
- **Model providers** for the fine-tuned language detection models used in this project
- **Gradio** for the excellent web interface framework
- **Open source community** for the foundational technologies that make this project possible