language-detection / README.md
yangdingcheok's picture
Upload 3 files
ede5327 verified
---
title: Language Detection App
emoji: 🌍
colorFrom: indigo
colorTo: blue
sdk: gradio
python_version: 3.9
app_file: app.py
license: mit
---
# 🌍 Language Detection App
A powerful and elegant language detection application built with Gradio frontend and a modular backend featuring multiple state-of-the-art ML models organized by architecture and training dataset.
## ✨ Features
- **Clean Gradio Interface**: Simple, intuitive web interface for language detection
- **Multiple Model Architectures**: Choose between XLM-RoBERTa (Model A) and BERT (Model B) architectures
- **Multiple Training Datasets**: Models trained on standard (Dataset A) and enhanced (Dataset B) datasets
- **Centralized Configuration**: All model configurations and settings in one place
- **Modular Backend**: Easy-to-extend architecture for plugging in your own ML models
- **Real-time Detection**: Instant language detection with confidence scores
- **Multiple Predictions**: Shows top 5 language predictions with confidence levels
- **100+ Languages**: Support for major world languages (varies by model)
- **Example Texts**: Pre-loaded examples in various languages for testing
- **Model Switching**: Seamlessly switch between different models
- **Extensible**: Abstract base class for implementing custom models
## πŸš€ Quick Start
### 1. Setup Environment
```bash
# Create virtual environment
python -m venv venv
# Activate environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
```
### 2. Test the Backend
```bash
# Run tests to verify everything works
python test_app.py
# Test specific model combinations
python test_model_a_dataset_a.py
python test_model_b_dataset_b.py
```
### 3. Launch the App
```bash
# Start the Gradio app
python app.py
```
The app will be available at `http://localhost:7860`
## 🧩 Model Architecture
The system is organized around two dimensions:
### πŸ—οΈ Model Architectures
- **Model A**: XLM-RoBERTa based architectures - Excellent cross-lingual transfer capabilities
- **Model B**: BERT based architectures - Efficient and fast processing
### πŸ“Š Training Datasets
- **Dataset A**: Standard multilingual language detection dataset - Broad language coverage
- **Dataset B**: Enhanced/specialized language detection dataset - Ultra-high accuracy focus
### πŸ€– Available Model Combinations
1. **Model A Dataset A** - XLM-RoBERTa + Standard Dataset βœ…
- **Architecture**: XLM-RoBERTa (Model A)
- **Training**: Dataset A (standard multilingual)
- **Accuracy**: 97.9%
- **Size**: 278M parameters
- **Languages**: 100+ languages
- **Strengths**: Balanced performance, robust cross-lingual capabilities, comprehensive language coverage
- **Use Cases**: General-purpose language detection, multilingual content processing
2. **Model B Dataset A** - BERT + Standard Dataset βœ…
- **Architecture**: BERT (Model B)
- **Training**: Dataset A (standard multilingual)
- **Accuracy**: 96.17%
- **Size**: 178M parameters
- **Languages**: 100+ languages
- **Strengths**: Fast inference, broad language support, efficient processing
- **Use Cases**: High-throughput detection, real-time applications, resource-constrained environments
3. **Model A Dataset B** - XLM-RoBERTa + Enhanced Dataset βœ…
- **Architecture**: XLM-RoBERTa (Model A)
- **Training**: Dataset B (enhanced/specialized)
- **Accuracy**: 99.72%
- **Size**: 278M parameters
- **Training Loss**: 0.0176
- **Languages**: 20 carefully selected languages
- **Strengths**: Exceptional accuracy, focused language support, state-of-the-art results
- **Use Cases**: Research applications, high-precision detection, critical accuracy requirements
4. **Model B Dataset B** - BERT + Enhanced Dataset βœ…
- **Architecture**: BERT (Model B)
- **Training**: Dataset B (enhanced/specialized)
- **Accuracy**: 99.85%
- **Size**: 178M parameters
- **Training Loss**: 0.0125
- **Languages**: 20 carefully selected languages
- **Strengths**: Highest accuracy, ultra-low training loss, precision-optimized
- **Use Cases**: Maximum precision applications, research requiring highest accuracy
### πŸ—οΈ Core Components
- **`BaseLanguageModel`**: Abstract interface that all models must implement
- **`ModelRegistry`**: Manages model registration and creation with centralized configuration
- **`LanguageDetector`**: Main orchestrator for language detection
- **`model_config.py`**: Centralized configuration for all models and language mappings
### πŸ”§ Adding New Models
To add a new model combination, simply:
1. Create a new file in `backend/models/` (e.g., `model_c_dataset_a.py`)
2. Inherit from `BaseLanguageModel`
3. Implement the required methods
4. Add configuration to `model_config.py`
5. Register it in `ModelRegistry`
Example:
```python
# backend/models/model_c_dataset_a.py
from .base_model import BaseLanguageModel
from .model_config import get_model_config
class ModelCDatasetA(BaseLanguageModel):
def __init__(self):
self.model_key = "model-c-dataset-a"
self.config = get_model_config(self.model_key)
# Initialize your model
def predict(self, text: str) -> Dict[str, Any]:
# Implement prediction logic
pass
def get_supported_languages(self) -> List[str]:
# Return supported language codes
pass
def get_model_info(self) -> Dict[str, Any]:
# Return model metadata from config
pass
```
Then add configuration in `model_config.py` and register in `language_detector.py`.
## πŸ§ͺ Testing
The project includes comprehensive test suites:
- **`test_app.py`**: General app functionality tests
- **`test_model_a_dataset_a.py`**: Tests for XLM-RoBERTa + standard dataset
- **`test_model_b_dataset_b.py`**: Tests for BERT + enhanced dataset (highest accuracy)
- **Model comparison tests**: Automated testing across all model combinations
- **Model switching tests**: Verify seamless model switching
## 🌐 Supported Languages
The models support different language sets based on their training:
- **Model A/B + Dataset A**: 100+ languages including major European, Asian, African, and other world languages based on the CC-100 dataset
- **Model A/B + Dataset B**: 20 carefully selected high-performance languages (Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese)
## πŸ“Š Model Comparison
| Feature | Model A Dataset A | Model B Dataset A | Model A Dataset B | Model B Dataset B |
|---------|-------------------|-------------------|-------------------|-------------------|
| **Architecture** | XLM-RoBERTa | BERT | XLM-RoBERTa | BERT |
| **Dataset** | Standard | Standard | Enhanced | Enhanced |
| **Accuracy** | 97.9% | 96.17% | 99.72% | **99.85%** πŸ† |
| **Model Size** | 278M | 178M | 278M | 178M |
| **Languages** | 100+ | 100+ | 20 (curated) | 20 (curated) |
| **Training Loss** | N/A | N/A | 0.0176 | **0.0125** |
| **Speed** | Moderate | **Fast** | Moderate | **Fast** |
| **Memory Usage** | Higher | **Lower** | Higher | **Lower** |
| **Best For** | Balanced performance | Speed & broad coverage | Ultra-high accuracy | **Maximum precision** |
### 🎯 Model Selection Guide
- **πŸ† Model B Dataset B**: Choose for maximum accuracy on 20 core languages (99.85%)
- **πŸ”¬ Model A Dataset B**: Choose for ultra-high accuracy on 20 core languages (99.72%)
- **βš–οΈ Model A Dataset A**: Choose for balanced performance and comprehensive language coverage (97.9%)
- **⚑ Model B Dataset A**: Choose for fast inference and broad language coverage (96.17%)
## πŸ”§ Configuration
You can configure models using the centralized configuration system:
```python
# Default model selection
detector = LanguageDetector(model_key="model-a-dataset-a") # Balanced XLM-RoBERTa
detector = LanguageDetector(model_key="model-b-dataset-a") # Fast BERT
detector = LanguageDetector(model_key="model-a-dataset-b") # Ultra-high accuracy XLM-RoBERTa
detector = LanguageDetector(model_key="model-b-dataset-b") # Maximum precision BERT
# All configurations are centralized in backend/models/model_config.py
```
## πŸ“ Project Structure
```
language-detection/
β”œβ”€β”€ backend/
β”‚ β”œβ”€β”€ models/
β”‚ β”‚ β”œβ”€β”€ model_config.py # Centralized configuration
β”‚ β”‚ β”œβ”€β”€ base_model.py # Abstract base class
β”‚ β”‚ β”œβ”€β”€ model_a_dataset_a.py # XLM-RoBERTa + Standard
β”‚ β”‚ β”œβ”€β”€ model_b_dataset_a.py # BERT + Standard
β”‚ β”‚ β”œβ”€β”€ model_a_dataset_b.py # XLM-RoBERTa + Enhanced
β”‚ β”‚ β”œβ”€β”€ model_b_dataset_b.py # BERT + Enhanced
β”‚ β”‚ └── __init__.py
β”‚ └── language_detector.py # Main orchestrator
β”œβ”€β”€ tests/
β”œβ”€β”€ app.py # Gradio interface
└── README.md
```
## 🀝 Contributing
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/new-model-combination`)
3. Implement your model following the `BaseLanguageModel` interface
4. Add configuration to `model_config.py`
5. Add tests for your implementation
6. Commit your changes (`git commit -m 'Add new model combination'`)
7. Push to the branch (`git push origin feature/new-model-combination`)
8. Open a Pull Request
## πŸ“ License
This project is open source and available under the MIT License.
## πŸ™ Acknowledgments
- **Hugging Face** for the transformers library and model hosting platform
- **Model providers** for the fine-tuned language detection models used in this project
- **Gradio** for the excellent web interface framework
- **Open source community** for the foundational technologies that make this project possible