Spaces:

yangdingcheok
/

language-detection

Build error

App Files Files Community

yangdingcheok commited on 24 days ago

Commit

ede5327

verified ·

1 Parent(s): a26f37f

Upload 3 files

Browse files

Files changed (3) hide show

README.md +250 -7
app.py +260 -0
requirements.txt +64 -0

README.md CHANGED Viewed

@@ -1,12 +1,255 @@
 ---
-title: Language Detection
-emoji: 📚
-colorFrom: green
-colorTo: yellow
 sdk: gradio
-sdk_version: 5.33.0
 app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Language Detection App
+emoji: 🌍
+colorFrom: indigo
+colorTo: blue
 sdk: gradio
+python_version: 3.9
 app_file: app.py
+license: mit
 ---
+# 🌍 Language Detection App
+A powerful and elegant language detection application built with Gradio frontend and a modular backend featuring multiple state-of-the-art ML models organized by architecture and training dataset.
+## ✨ Features
+- **Clean Gradio Interface**: Simple, intuitive web interface for language detection
+- **Multiple Model Architectures**: Choose between XLM-RoBERTa (Model A) and BERT (Model B) architectures
+- **Multiple Training Datasets**: Models trained on standard (Dataset A) and enhanced (Dataset B) datasets
+- **Centralized Configuration**: All model configurations and settings in one place
+- **Modular Backend**: Easy-to-extend architecture for plugging in your own ML models
+- **Real-time Detection**: Instant language detection with confidence scores
+- **Multiple Predictions**: Shows top 5 language predictions with confidence levels
+- **100+ Languages**: Support for major world languages (varies by model)
+- **Example Texts**: Pre-loaded examples in various languages for testing
+- **Model Switching**: Seamlessly switch between different models
+- **Extensible**: Abstract base class for implementing custom models
+## 🚀 Quick Start
+### 1. Setup Environment
+```bash
+# Create virtual environment
+python -m venv venv
+# Activate environment
+# On macOS/Linux:
+source venv/bin/activate
+# On Windows:
+venv\Scripts\activate
+# Install dependencies
+pip install -r requirements.txt
+```
+### 2. Test the Backend
+```bash
+# Run tests to verify everything works
+python test_app.py
+# Test specific model combinations
+python test_model_a_dataset_a.py
+python test_model_b_dataset_b.py
+```
+### 3. Launch the App
+```bash
+# Start the Gradio app
+python app.py
+```
+The app will be available at `http://localhost:7860`
+## 🧩 Model Architecture
+The system is organized around two dimensions:
+### 🏗️ Model Architectures
+- **Model A**: XLM-RoBERTa based architectures - Excellent cross-lingual transfer capabilities
+- **Model B**: BERT based architectures - Efficient and fast processing
+### 📊 Training Datasets
+- **Dataset A**: Standard multilingual language detection dataset - Broad language coverage
+- **Dataset B**: Enhanced/specialized language detection dataset - Ultra-high accuracy focus
+### 🤖 Available Model Combinations
+1. **Model A Dataset A** - XLM-RoBERTa + Standard Dataset ✅
+   - **Architecture**: XLM-RoBERTa (Model A)
+   - **Training**: Dataset A (standard multilingual)
+   - **Accuracy**: 97.9%
+   - **Size**: 278M parameters
+   - **Languages**: 100+ languages
+   - **Strengths**: Balanced performance, robust cross-lingual capabilities, comprehensive language coverage
+   - **Use Cases**: General-purpose language detection, multilingual content processing
+2. **Model B Dataset A** - BERT + Standard Dataset ✅
+   - **Architecture**: BERT (Model B)
+   - **Training**: Dataset A (standard multilingual)
+   - **Accuracy**: 96.17%
+   - **Size**: 178M parameters
+   - **Languages**: 100+ languages
+   - **Strengths**: Fast inference, broad language support, efficient processing
+   - **Use Cases**: High-throughput detection, real-time applications, resource-constrained environments
+3. **Model A Dataset B** - XLM-RoBERTa + Enhanced Dataset ✅
+   - **Architecture**: XLM-RoBERTa (Model A)
+   - **Training**: Dataset B (enhanced/specialized)
+   - **Accuracy**: 99.72%
+   - **Size**: 278M parameters
+   - **Training Loss**: 0.0176
+   - **Languages**: 20 carefully selected languages
+   - **Strengths**: Exceptional accuracy, focused language support, state-of-the-art results
+   - **Use Cases**: Research applications, high-precision detection, critical accuracy requirements
+4. **Model B Dataset B** - BERT + Enhanced Dataset ✅
+   - **Architecture**: BERT (Model B)
+   - **Training**: Dataset B (enhanced/specialized)
+   - **Accuracy**: 99.85%
+   - **Size**: 178M parameters
+   - **Training Loss**: 0.0125
+   - **Languages**: 20 carefully selected languages
+   - **Strengths**: Highest accuracy, ultra-low training loss, precision-optimized
+   - **Use Cases**: Maximum precision applications, research requiring highest accuracy
+### 🏗️ Core Components
+- **`BaseLanguageModel`**: Abstract interface that all models must implement
+- **`ModelRegistry`**: Manages model registration and creation with centralized configuration
+- **`LanguageDetector`**: Main orchestrator for language detection
+- **`model_config.py`**: Centralized configuration for all models and language mappings
+### 🔧 Adding New Models
+To add a new model combination, simply:
+1. Create a new file in `backend/models/` (e.g., `model_c_dataset_a.py`)
+2. Inherit from `BaseLanguageModel`
+3. Implement the required methods
+4. Add configuration to `model_config.py`
+5. Register it in `ModelRegistry`
+Example:
+```python
+# backend/models/model_c_dataset_a.py
+from .base_model import BaseLanguageModel
+from .model_config import get_model_config
+class ModelCDatasetA(BaseLanguageModel):
+    def __init__(self):
+        self.model_key = "model-c-dataset-a"
+        self.config = get_model_config(self.model_key)
+        # Initialize your model
+    def predict(self, text: str) -> Dict[str, Any]:
+        # Implement prediction logic
+        pass
+    def get_supported_languages(self) -> List[str]:
+        # Return supported language codes
+        pass
+    def get_model_info(self) -> Dict[str, Any]:
+        # Return model metadata from config
+        pass
+```
+Then add configuration in `model_config.py` and register in `language_detector.py`.
+## 🧪 Testing
+The project includes comprehensive test suites:
+- **`test_app.py`**: General app functionality tests
+- **`test_model_a_dataset_a.py`**: Tests for XLM-RoBERTa + standard dataset
+- **`test_model_b_dataset_b.py`**: Tests for BERT + enhanced dataset (highest accuracy)
+- **Model comparison tests**: Automated testing across all model combinations
+- **Model switching tests**: Verify seamless model switching
+## 🌐 Supported Languages
+The models support different language sets based on their training:
+- **Model A/B + Dataset A**: 100+ languages including major European, Asian, African, and other world languages based on the CC-100 dataset
+- **Model A/B + Dataset B**: 20 carefully selected high-performance languages (Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese)
+## 📊 Model Comparison
+| Feature | Model A Dataset A | Model B Dataset A | Model A Dataset B | Model B Dataset B |
+|---------|-------------------|-------------------|-------------------|-------------------|
+| **Architecture** | XLM-RoBERTa | BERT | XLM-RoBERTa | BERT |
+| **Dataset** | Standard | Standard | Enhanced | Enhanced |
+| **Accuracy** | 97.9% | 96.17% | 99.72% | **99.85%** 🏆 |
+| **Model Size** | 278M | 178M | 278M | 178M |
+| **Languages** | 100+ | 100+ | 20 (curated) | 20 (curated) |
+| **Training Loss** | N/A | N/A | 0.0176 | **0.0125** |
+| **Speed** | Moderate | **Fast** | Moderate | **Fast** |
+| **Memory Usage** | Higher | **Lower** | Higher | **Lower** |
+| **Best For** | Balanced performance | Speed & broad coverage | Ultra-high accuracy | **Maximum precision** |
+### 🎯 Model Selection Guide
+- **🏆 Model B Dataset B**: Choose for maximum accuracy on 20 core languages (99.85%)
+- **🔬 Model A Dataset B**: Choose for ultra-high accuracy on 20 core languages (99.72%)
+- **⚖️ Model A Dataset A**: Choose for balanced performance and comprehensive language coverage (97.9%)
+- **⚡ Model B Dataset A**: Choose for fast inference and broad language coverage (96.17%)
+## 🔧 Configuration
+You can configure models using the centralized configuration system:
+```python
+# Default model selection
+detector = LanguageDetector(model_key="model-a-dataset-a")  # Balanced XLM-RoBERTa
+detector = LanguageDetector(model_key="model-b-dataset-a")  # Fast BERT
+detector = LanguageDetector(model_key="model-a-dataset-b")  # Ultra-high accuracy XLM-RoBERTa
+detector = LanguageDetector(model_key="model-b-dataset-b")  # Maximum precision BERT
+# All configurations are centralized in backend/models/model_config.py
+```
+## 📁 Project Structure
+```
+language-detection/
+├── backend/
+│   ├── models/
+│   │   ├── model_config.py          # Centralized configuration
+│   │   ├── base_model.py            # Abstract base class
+│   │   ├── model_a_dataset_a.py     # XLM-RoBERTa + Standard
+│   │   ├── model_b_dataset_a.py     # BERT + Standard
+│   │   ├── model_a_dataset_b.py     # XLM-RoBERTa + Enhanced
+│   │   ├── model_b_dataset_b.py     # BERT + Enhanced
+│   │   └── __init__.py
+│   └── language_detector.py         # Main orchestrator
+├── tests/
+├── app.py                           # Gradio interface
+└── README.md
+```
+## 🤝 Contributing
+1. Fork the repository
+2. Create your feature branch (`git checkout -b feature/new-model-combination`)
+3. Implement your model following the `BaseLanguageModel` interface
+4. Add configuration to `model_config.py`
+5. Add tests for your implementation
+6. Commit your changes (`git commit -m 'Add new model combination'`)
+7. Push to the branch (`git push origin feature/new-model-combination`)
+8. Open a Pull Request
+## 📝 License
+This project is open source and available under the MIT License.
+## 🙏 Acknowledgments
+- **Hugging Face** for the transformers library and model hosting platform
+- **Model providers** for the fine-tuned language detection models used in this project
+- **Gradio** for the excellent web interface framework
+- **Open source community** for the foundational technologies that make this project possible

app.py ADDED Viewed

	@@ -0,0 +1,260 @@

+import gradio as gr
+from backend.language_detector import LanguageDetector
+def main():
+    # Initialize the language detector with default model (Model A Dataset A)
+    detector = LanguageDetector()
+    # Create Gradio interface
+    with gr.Blocks(title="Language Detection App", theme=gr.themes.Soft()) as app:
+        gr.Markdown("# 🌍 Language Detection App")
+        gr.Markdown("Select a model and enter text below to detect its language with confidence scores.")
+        # Model Selection Section with visual styling
+        with gr.Group():
+            gr.Markdown(
+                "<div style='text-align: center; padding: 16px 0 8px 0; margin-bottom: 16px; font-size: 18px; font-weight: 600; border-bottom: 2px solid; background: linear-gradient(90deg, transparent, rgba(99, 102, 241, 0.1), transparent); border-radius: 8px 8px 0 0;'>🤖 Model Selection</div>"
+            )
+            # Get available models
+            available_models = detector.get_available_models()
+            model_choices = []
+            model_info_map = {}
+            for key, info in available_models.items():
+                if info["status"] == "available":
+                    model_choices.append((info["display_name"], key))
+                else:
+                    model_choices.append((f"{info['display_name']} (Coming Soon)", key))
+                model_info_map[key] = info
+            model_selector = gr.Dropdown(
+                choices=model_choices,
+                value="model-a-dataset-a",  # Default to Model A Dataset A
+                label="Choose Language Detection Model",
+                interactive=True
+            )
+            # Model Information Display
+            model_info_display = gr.Markdown(
+                value=_format_model_info(detector.get_current_model_info()),
+                label="Model Information"
+            )
+        # Add visual separator
+        gr.Markdown(
+            "<div style='margin: 24px 0; border-top: 3px solid rgba(99, 102, 241, 0.2); background: linear-gradient(90deg, transparent, rgba(99, 102, 241, 0.05), transparent); height: 2px;'></div>"
+        )
+        # Analysis Section
+        with gr.Group():
+            gr.Markdown(
+                "<div style='text-align: center; padding: 16px 0 8px 0; margin-bottom: 16px; font-size: 18px; font-weight: 600; border-bottom: 2px solid; background: linear-gradient(90deg, transparent, rgba(34, 197, 94, 0.1), transparent); border-radius: 8px 8px 0 0;'>🔍 Language Analysis</div>"
+            )
+            with gr.Row():
+                with gr.Column(scale=2):
+                    # Input section
+                    text_input = gr.Textbox(
+                        label="Text to Analyze",
+                        placeholder="Enter text here to detect its language...",
+                        lines=5,
+                        max_lines=10
+                    )
+                    detect_btn = gr.Button("🔍 Detect Language", variant="primary", size="lg")
+                    # Example texts
+                    gr.Examples(
+                        examples=[
+                            ["Hello, how are you today?"],
+                            ["Bonjour, comment allez-vous?"],
+                            ["Hola, ¿cómo estás?"],
+                            ["Guten Tag, wie geht es Ihnen?"],
+                            ["こんにちは、元気ですか？"],
+                            ["Привет, как дела?"],
+                            ["Ciao, come stai?"],
+                            ["Olá, como você está?"],
+                            ["你好，你好吗？"],
+                            ["안녕하세요, 어떻게 지내세요?"]
+                        ],
+                        inputs=text_input,
+                        label="Try these examples:"
+                    )
+                with gr.Column(scale=2):
+                    # Output section
+                    with gr.Group():
+                        gr.Markdown(
+                            "<div style='text-align: center; padding: 16px 0 8px 0; margin-bottom: 12px; font-size: 18px; font-weight: 600; border-bottom: 2px solid; background: linear-gradient(90deg, transparent, rgba(168, 85, 247, 0.1), transparent); border-radius: 8px 8px 0 0;'>📊 Detection Results</div>"
+                        )
+                        detected_language = gr.Textbox(
+                            label="Detected Language",
+                            interactive=False
+                        )
+                        confidence_score = gr.Number(
+                            label="Confidence Score",
+                            interactive=False,
+                            precision=4
+                        )
+                        language_code = gr.Textbox(
+                            label="Language Code (ISO 639-1)",
+                            interactive=False
+                        )
+                        # Top predictions table
+                        top_predictions = gr.Dataframe(
+                            headers=["Language", "Code", "Confidence"],
+                            label="Top 5 Predictions",
+                            interactive=False,
+                            wrap=True
+                        )
+        # Status/Info section
+        with gr.Row():
+            status_text = gr.Textbox(
+                label="Status",
+                interactive=False,
+                visible=False
+            )
+        # Event handlers
+        def detect_language_wrapper(text, selected_model):
+            if not text.strip():
+                return (
+                    "No text provided",
+                    0.0,
+                    "",
+                    [],
+                    gr.update(value="Please enter some text to analyze.", visible=True)
+                )
+            try:
+                # Switch model if needed
+                if detector.current_model_key != selected_model:
+                    try:
+                        detector.switch_model(selected_model)
+                    except NotImplementedError:
+                        return (
+                            "Model unavailable",
+                            0.0,
+                            "",
+                            [],
+                            gr.update(value="This model is not yet implemented. Please select an available model.", visible=True)
+                        )
+                    except Exception as e:
+                        return (
+                            "Model error",
+                            0.0,
+                            "",
+                            [],
+                            gr.update(value=f"Error loading model: {str(e)}", visible=True)
+                        )
+                result = detector.detect_language(text)
+                # Extract main prediction
+                main_lang = result['language']
+                main_confidence = result['confidence']
+                main_code = result['language_code']
+                # Format top predictions for table
+                predictions_table = [
+                    [pred['language'], pred['language_code'], f"{pred['confidence']:.4f}"]
+                    for pred in result['top_predictions']
+                ]
+                model_info = result.get('metadata', {}).get('model_info', {})
+                model_name = model_info.get('name', 'Unknown Model')
+                return (
+                    main_lang,
+                    main_confidence,
+                    main_code,
+                    predictions_table,
+                    gr.update(value=f"✅ Analysis Complete\n\nInput Text: {text[:100]}{'...' if len(text) > 100 else ''}\n\nDetected Language: {main_lang} ({main_code})\nConfidence: {main_confidence:.2%}\n\nModel: {model_name}", visible=True)
+                )
+            except Exception as e:
+                return (
+                    "Error occurred",
+                    0.0,
+                    "",
+                    [],
+                    gr.update(value=f"Error: {str(e)}", visible=True)
+                )
+        def update_model_info(selected_model):
+            """Update model information display when model selection changes."""
+            try:
+                if detector.current_model_key != selected_model:
+                    detector.switch_model(selected_model)
+                model_info = detector.get_current_model_info()
+                return _format_model_info(model_info)
+            except NotImplementedError:
+                return "**This model is not yet implemented.** Please select an available model."
+            except Exception as e:
+                return f"**Error loading model information:** {str(e)}"
+        # Connect the button to the detection function
+        detect_btn.click(
+            fn=detect_language_wrapper,
+            inputs=[text_input, model_selector],
+            outputs=[detected_language, confidence_score, language_code, top_predictions, status_text]
+        )
+        # Also trigger on Enter key in text input
+        text_input.submit(
+            fn=detect_language_wrapper,
+            inputs=[text_input, model_selector],
+            outputs=[detected_language, confidence_score, language_code, top_predictions, status_text]
+        )
+        # Update model info when selection changes
+        model_selector.change(
+            fn=update_model_info,
+            inputs=[model_selector],
+            outputs=[model_info_display]
+        )
+    return app
+def _format_model_info(model_info):
+    """Format model information for display."""
+    if not model_info:
+        return "No model information available."
+    formatted_info = f"""
+**{model_info.get('name', 'Unknown Model')}**
+{model_info.get('description', 'No description available.')}
+**📊 Performance:**
+- Accuracy: {model_info.get('accuracy', 'N/A')}
+- Model Size: {model_info.get('model_size', 'N/A')}
+**🏗️ Architecture:**
+- Model Architecture: {model_info.get('architecture', 'N/A')}
+- Base Model: {model_info.get('base_model', 'N/A')}
+- Training Dataset: {model_info.get('dataset', 'N/A')}
+**🌐 Languages:** {model_info.get('languages_supported', 'N/A')}
+**⚙️ Training Details:** {model_info.get('training_details', 'N/A')}
+**💡 Use Cases:** {model_info.get('use_cases', 'N/A')}
+**✅ Strengths:** {model_info.get('strengths', 'N/A')}
+**⚠️ Limitations:** {model_info.get('limitations', 'N/A')}
+"""
+    return formatted_info
+if __name__ == "__main__":
+    app = main()
+    app.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,64 @@

+aiofiles==24.1.0
+annotated-types==0.7.0
+anyio==4.9.0
+audioop-lts==0.2.1
+certifi==2025.4.26
+charset-normalizer==3.4.2
+click==8.1.8
+fastapi==0.115.12
+ffmpy==0.5.0
+filelock==3.18.0
+fsspec==2025.5.1
+gradio==5.31.0
+gradio_client==1.10.1
+groovy==0.1.2
+h11==0.16.0
+hf-xet==1.1.2
+httpcore==1.0.9
+httpx==0.28.1
+huggingface-hub==0.32.0
+idna==3.10
+Jinja2==3.1.6
+markdown-it-py==3.0.0
+MarkupSafe==3.0.2
+mdurl==0.1.2
+mpmath==1.3.0
+networkx==3.4.2
+numpy==2.2.6
+orjson==3.10.18
+packaging==25.0
+pandas==2.2.3
+pillow==11.2.1
+pydantic==2.11.5
+pydantic_core==2.33.2
+pydub==0.25.1
+Pygments==2.19.1
+python-dateutil==2.9.0.post0
+python-multipart==0.0.20
+pytz==2025.2
+PyYAML==6.0.2
+regex==2024.11.6
+requests==2.32.3
+rich==14.0.0
+ruff==0.11.11
+safehttpx==0.1.6
+safetensors==0.5.3
+semantic-version==2.10.0
+setuptools==80.8.0
+shellingham==1.5.4
+six==1.17.0
+sniffio==1.3.1
+starlette==0.46.2
+sympy==1.14.0
+tokenizers==0.21.1
+tomlkit==0.13.2
+torch==2.7.0
+tqdm==4.67.1
+transformers==4.52.3
+typer==0.15.4
+typing-inspection==0.4.1
+typing_extensions==4.13.2
+tzdata==2025.2
+urllib3==2.4.0
+uvicorn==0.34.2
+websockets==15.0.1