|
# Sema Translation API - Complete Documentation |
|
|
|
Welcome to the comprehensive documentation for the Sema Translation API - an enterprise-grade translation service supporting 200+ languages with custom HuggingFace models and a focus on African languages. |
|
|
|
## ๐ Documentation Overview |
|
|
|
This documentation covers all aspects of the Sema Translation API, from custom model implementation to advanced deployment scenarios and future application ideas. |
|
|
|
### ๐ Core Documentation |
|
|
|
#### **[Custom Models Implementation](CUSTOM_MODELS_IMPLEMENTATION.md)** |
|
**Essential Reading** - Detailed documentation of how we implemented custom HuggingFace models: |
|
- Unified `sematech/sema-utils` repository structure |
|
- CTranslate2 optimization for 2-4x faster inference |
|
- Model loading pipeline and caching strategy |
|
- Performance benchmarks and monitoring |
|
- Model update and versioning process |
|
|
|
#### **[API Capabilities](API_CAPABILITIES.md)** |
|
Complete overview of enhanced API features: |
|
- 55+ African languages (updated from 23) |
|
- Server-side performance timing |
|
- Language detection with confidence scores |
|
- Comprehensive language metadata system |
|
|
|
#### **[Future Considerations](FUTURE_CONSIDERATIONS.md)** |
|
Roadmap and application ideas: |
|
- Authentication & user management with Supabase |
|
- Database integration and caching strategies |
|
- Document translation and real-time streaming |
|
- Innovative application ideas (chatbots, education, government services) |
|
|
|
#### **[Deployment Architecture](DEPLOYMENT_ARCHITECTURE.md)** |
|
Infrastructure and deployment details: |
|
- HuggingFace Spaces deployment process |
|
- Performance characteristics and resource requirements |
|
- Monitoring with Prometheus and structured logging |
|
- CI/CD pipeline and scaling considerations |
|
|
|
### ๐ Additional Documentation |
|
|
|
#### **[Project Overview](PROJECT_OVERVIEW.md)** |
|
High-level project introduction and goals |
|
|
|
#### **[API Reference](API_REFERENCE.md)** |
|
Complete endpoint documentation with examples |
|
|
|
## ๐ Key Achievements & Features |
|
|
|
### Custom HuggingFace Models Integration |
|
- **Unified Repository**: `sematech/sema-utils` containing all models |
|
- **Optimized Performance**: CTranslate2 INT8 quantization (75% size reduction) |
|
- **Automatic Updates**: HuggingFace Hub integration with version management |
|
- **Enterprise Caching**: Intelligent model caching and loading strategies |
|
|
|
### Enhanced African Language Support |
|
- **55+ African Languages**: Complete FLORES-200 African language coverage |
|
- **Regional Distribution**: West, East, Southern, Central, and North Africa |
|
- **Multiple Scripts**: Latin, Arabic, Ethiopic, Tifinagh support |
|
- **Cultural Context**: Native names and regional information |
|
|
|
### Performance & Monitoring |
|
- **Server-Side Timing**: Request performance tracking in headers and responses |
|
- **Prometheus Metrics**: Comprehensive monitoring and analytics |
|
- **Request Tracking**: Unique request IDs for debugging |
|
- **Health Monitoring**: System status and model availability checks |
|
|
|
## ๐ง Technical Implementation Highlights |
|
|
|
### Model Architecture |
|
``` |
|
Custom HuggingFace Models (sematech/sema-utils) |
|
โโโ Translation: NLLB-200 3.3B (CTranslate2 optimized) |
|
โโโ Language Detection: FastText LID.176 |
|
โโโ Tokenization: SentencePiece |
|
โโโ Language Database: FLORES-200 complete |
|
``` |
|
|
|
### Performance Metrics |
|
- **Model Size**: 2.5GB (optimized from 6.6GB) |
|
- **Inference Speed**: 0.2-2.5 seconds depending on text length |
|
- **Memory Usage**: ~3.2GB for models, 50-100MB per request |
|
- **Language Detection**: 0.01-0.05 seconds with 99%+ accuracy |
|
|
|
### API Enhancements |
|
- **Request Timing**: Server-side performance measurement |
|
- **Language Metadata**: Complete language information system |
|
- **Error Handling**: Comprehensive validation and error responses |
|
- **Rate Limiting**: 60 requests/minute with graceful degradation |
|
|
|
## ๐ Quick Start Examples |
|
|
|
### Basic Translation with Timing |
|
```bash |
|
curl -v -X POST "https://sematech-sema-api.hf.space/api/v1/translate" \ |
|
-H "Content-Type: application/json" \ |
|
-d '{"text": "Habari ya asubuhi", "target_language": "eng_Latn"}' |
|
|
|
# Response includes timing information: |
|
# X-Response-Time: 1.234s |
|
# X-Request-ID: 550e8400-e29b-41d4-a716-446655440000 |
|
``` |
|
|
|
### African Languages Discovery |
|
```bash |
|
# Get all 55+ African languages |
|
curl "https://sematech-sema-api.hf.space/api/v1/languages/african" |
|
|
|
# Search for specific African languages |
|
curl "https://sematech-sema-api.hf.space/api/v1/languages/search?q=Akan" |
|
curl "https://sematech-sema-api.hf.space/api/v1/languages/search?q=Bambara" |
|
``` |
|
|
|
### Language Detection with Confidence |
|
```bash |
|
curl -X POST "https://sematech-sema-api.hf.space/api/v1/detect-language" \ |
|
-H "Content-Type: application/json" \ |
|
-d '{"text": "Habari ya asubuhi"}' |
|
|
|
# Returns: detected language, confidence score, timing information |
|
``` |
|
|
|
## ๐ฏ Application Use Cases |
|
|
|
### 1. Multilingual Chatbot Implementation |
|
```python |
|
async def process_user_input(user_text): |
|
# 1. Detect language |
|
detection = await detect_language(user_text) |
|
|
|
# 2. Decide processing flow |
|
if detection.is_english: |
|
response = await llm_chat(user_text) |
|
else: |
|
# Translate โ Process โ Translate back |
|
english_input = await translate(user_text, "eng_Latn") |
|
english_response = await llm_chat(english_input) |
|
response = await translate(english_response, detection.detected_language) |
|
|
|
return response |
|
``` |
|
|
|
### 2. African News Platform |
|
- Aggregate news from multiple African countries |
|
- Translate between African languages |
|
- Provide summaries in user's preferred language |
|
|
|
### 3. Educational Platform |
|
- Interactive language learning with African languages |
|
- Cultural context and pronunciation guides |
|
- Progress tracking across multiple languages |
|
|
|
### 4. Government Services |
|
- Multilingual official document translation |
|
- Emergency notifications in local languages |
|
- Citizen services in preferred languages |
|
|
|
## ๐ API Statistics & Metrics |
|
|
|
### Language Coverage |
|
- **Total Languages**: 200+ (FLORES-200 complete) |
|
- **African Languages**: 55+ (updated from 23) |
|
- **Writing Scripts**: Latin, Arabic, Ethiopic, Tifinagh, Cyrillic, Han, etc. |
|
- **Geographic Regions**: Comprehensive global coverage |
|
|
|
### Performance Benchmarks |
|
- **Translation Speed**: 0.2-2.5s depending on text length |
|
- **Language Detection**: 0.01-0.05s with 99%+ accuracy |
|
- **Model Efficiency**: 75% size reduction with maintained quality |
|
- **Concurrent Handling**: Linear scaling with available resources |
|
|
|
### Quality Metrics |
|
- **BLEU Scores**: Industry-standard translation quality |
|
- **African Languages**: Specialized cultural context preservation |
|
- **Uptime**: 99.9% target availability |
|
- **Error Rate**: <1% under normal load |
|
|
|
## ๐ฎ Future Roadmap |
|
|
|
### Immediate (3-6 months) |
|
- User authentication and usage tracking |
|
- Database integration with PostgreSQL |
|
- Redis caching for improved performance |
|
- Advanced monitoring dashboards |
|
|
|
### Medium-term (6-12 months) |
|
- Document translation with formatting preservation |
|
- Real-time translation streaming via WebSocket |
|
- Domain-specific models (medical, legal, technical) |
|
- Mobile SDK development |
|
|
|
### Long-term (1-2 years) |
|
- AI-powered translation ecosystem |
|
- Enterprise integration platform |
|
- African language research contributions |
|
- Voice-to-voice translation capabilities |
|
|
|
## ๐ ๏ธ Development & Deployment |
|
|
|
### Local Development |
|
```bash |
|
# Clone and setup |
|
git clone https://github.com/lewiskimaru/sema.git |
|
cd sema/backend/sema-api |
|
|
|
# Install dependencies |
|
pip install -r requirements.txt |
|
|
|
# Run locally |
|
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000 |
|
``` |
|
|
|
### Testing |
|
```bash |
|
# Run comprehensive tests |
|
python tests/test_african_languages_update.py |
|
python tests/test_performance_timing.py |
|
python tests/simple_test.py |
|
``` |
|
|
|
### Deployment |
|
- **Platform**: HuggingFace Spaces |
|
- **Auto-deployment**: Git integration |
|
- **Model Updates**: Automatic from `sematech/sema-utils` |
|
- **Monitoring**: Prometheus metrics and health checks |
|
|
|
## ๐ Support & Resources |
|
|
|
### Documentation Links |
|
- **Live API**: https://sematech-sema-api.hf.space |
|
- **Interactive Docs**: https://sematech-sema-api.hf.space/ (Swagger UI) |
|
- **Health Status**: https://sematech-sema-api.hf.space/health |
|
- **Metrics**: https://sematech-sema-api.hf.space/metrics |
|
|
|
### Model Repository |
|
- **HuggingFace**: https://huggingface.co/sematech/sema-utils |
|
- **Model Documentation**: Comprehensive model usage and optimization guides |
|
- **Version History**: Track model updates and improvements |
|
|
|
### Community & Support |
|
- **GitHub Repository**: Complete source code and issue tracking |
|
- **Model Contributions**: Community-driven improvements |
|
- **Research Collaboration**: Academic partnerships for African language research |
|
|
|
--- |
|
|
|
**The Sema Translation API represents a significant advancement in African language technology, combining custom HuggingFace models with enterprise-grade infrastructure to serve diverse global communities.** |
|
|
|
*Documentation last updated: June 2024 | API Version: 2.0.0* |
|
|