Spaces:

sematech
/

sema-api

Sleeping

App Files Files Community

sema-api / docs /README.md

kamau1

Added documentation for using custom models

0745795 2 months ago

preview code

raw

history blame contribute delete

8.97 kB

	# Sema Translation API - Complete Documentation

	Welcome to the comprehensive documentation for the Sema Translation API - an enterprise-grade translation service supporting 200+ languages with custom HuggingFace models and a focus on African languages.

	## 📚 Documentation Overview

	This documentation covers all aspects of the Sema Translation API, from custom model implementation to advanced deployment scenarios and future application ideas.

	### 🚀 Core Documentation

	#### [Custom Models Implementation](CUSTOM_MODELS_IMPLEMENTATION.md)
	Essential Reading - Detailed documentation of how we implemented custom HuggingFace models:
	- Unified `sematech/sema-utils` repository structure
	- CTranslate2 optimization for 2-4x faster inference
	- Model loading pipeline and caching strategy
	- Performance benchmarks and monitoring
	- Model update and versioning process

	#### [API Capabilities](API_CAPABILITIES.md)
	Complete overview of enhanced API features:
	- 55+ African languages (updated from 23)
	- Server-side performance timing
	- Language detection with confidence scores
	- Comprehensive language metadata system

	#### [Future Considerations](FUTURE_CONSIDERATIONS.md)
	Roadmap and application ideas:
	- Authentication & user management with Supabase
	- Database integration and caching strategies
	- Document translation and real-time streaming
	- Innovative application ideas (chatbots, education, government services)

	#### [Deployment Architecture](DEPLOYMENT_ARCHITECTURE.md)
	Infrastructure and deployment details:
	- HuggingFace Spaces deployment process
	- Performance characteristics and resource requirements
	- Monitoring with Prometheus and structured logging
	- CI/CD pipeline and scaling considerations

	### 📖 Additional Documentation

	#### [Project Overview](PROJECT_OVERVIEW.md)
	High-level project introduction and goals

	#### [API Reference](API_REFERENCE.md)
	Complete endpoint documentation with examples

	## 🌟 Key Achievements & Features

	### Custom HuggingFace Models Integration
	- Unified Repository: `sematech/sema-utils` containing all models
	- Optimized Performance: CTranslate2 INT8 quantization (75% size reduction)
	- Automatic Updates: HuggingFace Hub integration with version management
	- Enterprise Caching: Intelligent model caching and loading strategies

	### Enhanced African Language Support
	- 55+ African Languages: Complete FLORES-200 African language coverage
	- Regional Distribution: West, East, Southern, Central, and North Africa
	- Multiple Scripts: Latin, Arabic, Ethiopic, Tifinagh support
	- Cultural Context: Native names and regional information

	### Performance & Monitoring
	- Server-Side Timing: Request performance tracking in headers and responses
	- Prometheus Metrics: Comprehensive monitoring and analytics
	- Request Tracking: Unique request IDs for debugging
	- Health Monitoring: System status and model availability checks

	## 🔧 Technical Implementation Highlights

	### Model Architecture
	```
	Custom HuggingFace Models (sematech/sema-utils)
	├── Translation: NLLB-200 3.3B (CTranslate2 optimized)
	├── Language Detection: FastText LID.176
	├── Tokenization: SentencePiece
	└── Language Database: FLORES-200 complete
	```

	### Performance Metrics
	- Model Size: 2.5GB (optimized from 6.6GB)
	- Inference Speed: 0.2-2.5 seconds depending on text length
	- Memory Usage: ~3.2GB for models, 50-100MB per request
	- Language Detection: 0.01-0.05 seconds with 99%+ accuracy

	### API Enhancements
	- Request Timing: Server-side performance measurement
	- Language Metadata: Complete language information system
	- Error Handling: Comprehensive validation and error responses
	- Rate Limiting: 60 requests/minute with graceful degradation

	## 🚀 Quick Start Examples

	### Basic Translation with Timing
	```bash
	curl -v -X POST "https://sematech-sema-api.hf.space/api/v1/translate" \
	-H "Content-Type: application/json" \
	-d '{"text": "Habari ya asubuhi", "target_language": "eng_Latn"}'

	# Response includes timing information:
	# X-Response-Time: 1.234s
	# X-Request-ID: 550e8400-e29b-41d4-a716-446655440000
	```

	### African Languages Discovery
	```bash
	# Get all 55+ African languages
	curl "https://sematech-sema-api.hf.space/api/v1/languages/african"

	# Search for specific African languages
	curl "https://sematech-sema-api.hf.space/api/v1/languages/search?q=Akan"
	curl "https://sematech-sema-api.hf.space/api/v1/languages/search?q=Bambara"
	```

	### Language Detection with Confidence
	```bash
	curl -X POST "https://sematech-sema-api.hf.space/api/v1/detect-language" \
	-H "Content-Type: application/json" \
	-d '{"text": "Habari ya asubuhi"}'

	# Returns: detected language, confidence score, timing information
	```

	## 🎯 Application Use Cases

	### 1. Multilingual Chatbot Implementation
	```python
	async def process_user_input(user_text):
	# 1. Detect language
	detection = await detect_language(user_text)

	# 2. Decide processing flow
	if detection.is_english:
	response = await llm_chat(user_text)
	else:
	# Translate → Process → Translate back
	english_input = await translate(user_text, "eng_Latn")
	english_response = await llm_chat(english_input)
	response = await translate(english_response, detection.detected_language)

	return response
	```

	### 2. African News Platform
	- Aggregate news from multiple African countries
	- Translate between African languages
	- Provide summaries in user's preferred language

	### 3. Educational Platform
	- Interactive language learning with African languages
	- Cultural context and pronunciation guides
	- Progress tracking across multiple languages

	### 4. Government Services
	- Multilingual official document translation
	- Emergency notifications in local languages
	- Citizen services in preferred languages

	## 📊 API Statistics & Metrics

	### Language Coverage
	- Total Languages: 200+ (FLORES-200 complete)
	- African Languages: 55+ (updated from 23)
	- Writing Scripts: Latin, Arabic, Ethiopic, Tifinagh, Cyrillic, Han, etc.
	- Geographic Regions: Comprehensive global coverage

	### Performance Benchmarks
	- Translation Speed: 0.2-2.5s depending on text length
	- Language Detection: 0.01-0.05s with 99%+ accuracy
	- Model Efficiency: 75% size reduction with maintained quality
	- Concurrent Handling: Linear scaling with available resources

	### Quality Metrics
	- BLEU Scores: Industry-standard translation quality
	- African Languages: Specialized cultural context preservation
	- Uptime: 99.9% target availability
	- Error Rate: <1% under normal load

	## 🔮 Future Roadmap

	### Immediate (3-6 months)
	- User authentication and usage tracking
	- Database integration with PostgreSQL
	- Redis caching for improved performance
	- Advanced monitoring dashboards

	### Medium-term (6-12 months)
	- Document translation with formatting preservation
	- Real-time translation streaming via WebSocket
	- Domain-specific models (medical, legal, technical)
	- Mobile SDK development

	### Long-term (1-2 years)
	- AI-powered translation ecosystem
	- Enterprise integration platform
	- African language research contributions
	- Voice-to-voice translation capabilities

	## 🛠️ Development & Deployment

	### Local Development
	```bash
	# Clone and setup
	git clone https://github.com/lewiskimaru/sema.git
	cd sema/backend/sema-api

	# Install dependencies
	pip install -r requirements.txt

	# Run locally
	uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
	```

	### Testing
	```bash
	# Run comprehensive tests
	python tests/test_african_languages_update.py
	python tests/test_performance_timing.py
	python tests/simple_test.py
	```

	### Deployment
	- Platform: HuggingFace Spaces
	- Auto-deployment: Git integration
	- Model Updates: Automatic from `sematech/sema-utils`
	- Monitoring: Prometheus metrics and health checks

	## 📞 Support & Resources

	### Documentation Links
	- Live API: https://sematech-sema-api.hf.space
	- Interactive Docs: https://sematech-sema-api.hf.space/ (Swagger UI)
	- Health Status: https://sematech-sema-api.hf.space/health
	- Metrics: https://sematech-sema-api.hf.space/metrics

	### Model Repository
	- HuggingFace: https://huggingface.co/sematech/sema-utils
	- Model Documentation: Comprehensive model usage and optimization guides
	- Version History: Track model updates and improvements

	### Community & Support
	- GitHub Repository: Complete source code and issue tracking
	- Model Contributions: Community-driven improvements
	- Research Collaboration: Academic partnerships for African language research

	---

	The Sema Translation API represents a significant advancement in African language technology, combining custom HuggingFace models with enterprise-grade infrastructure to serve diverse global communities.

	Documentation last updated: June 2024 \| API Version: 2.0.0