kamau1 commited on
Commit
a7d24e3
·
1 Parent(s): 7dd7ca0

update: Fastapi codebase structure with api endpoints

Browse files
Dockerfile CHANGED
@@ -45,11 +45,11 @@ RUN pip install --no-cache-dir --user -r requirements.txt
45
  COPY --chown=user --from=model-builder /root/.cache/huggingface $HOME/.cache/huggingface
46
 
47
  # Copy the application code
48
- COPY --chown=user ./sema_translation_api.py sema_translation_api.py
49
 
50
  # Expose port 7860 (HuggingFace Spaces standard)
51
  EXPOSE 7860
52
 
53
  # Tell uvicorn to run on port 7860, which is the standard for HF Spaces
54
  # Use 0.0.0.0 to make it accessible from outside the container
55
- CMD ["uvicorn", "sema_translation_api:app", "--host", "0.0.0.0", "--port", "7860"]
 
45
  COPY --chown=user --from=model-builder /root/.cache/huggingface $HOME/.cache/huggingface
46
 
47
  # Copy the application code
48
+ COPY --chown=user ./app app
49
 
50
  # Expose port 7860 (HuggingFace Spaces standard)
51
  EXPOSE 7860
52
 
53
  # Tell uvicorn to run on port 7860, which is the standard for HF Spaces
54
  # Use 0.0.0.0 to make it accessible from outside the container
55
+ CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "7860"]
README.md CHANGED
@@ -6,46 +6,103 @@ colorTo: green
6
  sdk: docker
7
  pinned: false
8
  license: mit
9
- short_description: Translation API using consolidated sema-utils models
10
  ---
11
 
12
  # Sema Translation API 🌍
13
 
14
- A powerful translation API that supports multiple African languages using the consolidated `sematech/sema-utils` model repository.
15
 
16
- ## Features
17
 
 
18
  - **Automatic Language Detection**: Detects source language automatically if not provided
19
- - **Multi-language Support**: Supports 200+ languages via FLORES-200 codes
20
- - **Fast Translation**: Uses CTranslate2 for optimized inference
21
- - **RESTful API**: Clean FastAPI interface with automatic documentation
22
- - **Consolidated Models**: Uses models from the unified `sematech/sema-utils` repository
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
- ## API Endpoints
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
- ### `GET /`
27
- Health check endpoint that returns API status and version information.
 
28
 
29
- ### `POST /translate`
30
- Main translation endpoint that accepts:
31
 
32
- **Request Body:**
33
  ```json
34
  {
35
  "text": "Habari ya asubuhi",
36
  "target_language": "eng_Latn",
37
- "source_language": "swh_Latn"
38
  }
39
  ```
40
 
41
- **Response:**
42
  ```json
43
  {
44
  "translated_text": "Good morning",
45
  "source_language": "swh_Latn",
46
  "target_language": "eng_Latn",
47
  "inference_time": 0.234,
48
- "timestamp": "Monday | 2024-06-21 | 14:30:25"
 
 
49
  }
50
  ```
51
 
 
6
  sdk: docker
7
  pinned: false
8
  license: mit
9
+ short_description: Enterprise-grade translation API with 200+ language support
10
  ---
11
 
12
  # Sema Translation API 🌍
13
 
14
+ Enterprise-grade translation API supporting 200+ languages with automatic language detection, rate limiting, usage tracking, and comprehensive monitoring. Built with FastAPI and powered by the consolidated `sematech/sema-utils` model repository.
15
 
16
+ ## 🚀 Features
17
 
18
+ ### Core Translation
19
  - **Automatic Language Detection**: Detects source language automatically if not provided
20
+ - **200+ Language Support**: Supports all FLORES-200 language codes
21
+ - **High-Performance Translation**: Uses CTranslate2 for optimized inference
22
+ - **Character Count Tracking**: Monitors usage for billing and analytics
23
+
24
+ ### Enterprise Features
25
+ - **Rate Limiting**: 60 requests/minute, 1000 requests/hour per IP
26
+ - **Request Tracking**: Unique request IDs for debugging and monitoring
27
+ - **Usage Analytics**: Comprehensive metrics with Prometheus integration
28
+ - **Structured Logging**: JSON-formatted logs for easy parsing
29
+ - **Health Monitoring**: Detailed health checks for system monitoring
30
+
31
+ ### Security & Reliability
32
+ - **Input Validation**: Comprehensive request validation with Pydantic
33
+ - **Error Handling**: Graceful error handling with detailed error responses
34
+ - **CORS Support**: Configurable cross-origin resource sharing
35
+ - **Future-Ready Auth**: Designed for Supabase authentication integration
36
+
37
+ ### API Quality
38
+ - **OpenAPI Documentation**: Auto-generated Swagger UI and ReDoc
39
+ - **Type Safety**: Full TypeScript-compatible API schemas
40
+ - **Production Ready**: Follows FastAPI production best practices
41
+
42
+ ## 📁 Project Structure
43
 
44
+ ```
45
+ app/
46
+ ├── __init__.py
47
+ ├── main.py # Application entry point
48
+ ├── api/ # API route definitions
49
+ │ ├── __init__.py
50
+ │ └── v1/ # Versioned API routes
51
+ │ ├── __init__.py
52
+ │ └── endpoints.py # Route handlers
53
+ ├── core/ # Core configuration
54
+ │ ├── __init__.py
55
+ │ ├── config.py # Settings and configuration
56
+ │ ├── logging.py # Logging configuration
57
+ │ └── metrics.py # Prometheus metrics
58
+ ├── middleware/ # Custom middleware
59
+ │ ├── __init__.py
60
+ │ └── request_middleware.py # Request tracking middleware
61
+ ├── models/ # Data models
62
+ │ ├── __init__.py
63
+ │ └── schemas.py # Pydantic models
64
+ ├── services/ # Business logic
65
+ │ ├── __init__.py
66
+ │ └── translation.py # Translation service
67
+ └── utils/ # Utility functions
68
+ ├── __init__.py
69
+ └── helpers.py # Helper functions
70
+ ```
71
+
72
+ ## 🔗 API Endpoints
73
+
74
+ ### Health & Monitoring
75
+ - **`GET /`** - Basic health check
76
+ - **`GET /health`** - Detailed health monitoring
77
+ - **`GET /metrics`** - Prometheus metrics
78
+ - **`GET /docs`** - Swagger UI documentation
79
+ - **`GET /redoc`** - ReDoc documentation
80
 
81
+ ### Translation
82
+ - **`POST /translate`** - Main translation endpoint
83
+ - **`POST /api/v1/translate`** - Versioned translation endpoint
84
 
85
+ ### Request/Response Examples
 
86
 
87
+ **Translation Request:**
88
  ```json
89
  {
90
  "text": "Habari ya asubuhi",
91
  "target_language": "eng_Latn",
92
+ "source_language": "swh_Latn" // Optional
93
  }
94
  ```
95
 
96
+ **Translation Response:**
97
  ```json
98
  {
99
  "translated_text": "Good morning",
100
  "source_language": "swh_Latn",
101
  "target_language": "eng_Latn",
102
  "inference_time": 0.234,
103
+ "character_count": 17,
104
+ "timestamp": "Monday | 2024-06-21 | 14:30:25",
105
+ "request_id": "550e8400-e29b-41d4-a716-446655440000"
106
  }
107
  ```
108
 
app/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Sema Translation API Package
app/api/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # API routes and endpoints
app/api/v1/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # API v1 routes
app/api/v1/endpoints.py ADDED
@@ -0,0 +1,579 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ API v1 endpoints
3
+ """
4
+
5
+ import time
6
+ from fastapi import APIRouter, HTTPException, Request
7
+ from slowapi import Limiter
8
+ from slowapi.util import get_remote_address
9
+ from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
10
+ from fastapi.responses import Response
11
+
12
+ from ...models.schemas import (
13
+ TranslationRequest,
14
+ TranslationResponse,
15
+ HealthResponse,
16
+ LanguagesResponse,
17
+ LanguageStatsResponse,
18
+ LanguageInfo
19
+ )
20
+ from ...services.translation import (
21
+ translate_with_detection,
22
+ translate_with_source,
23
+ models_loaded
24
+ )
25
+ from ...services.languages import (
26
+ get_all_languages,
27
+ get_languages_by_region,
28
+ get_language_info,
29
+ is_language_supported,
30
+ get_popular_languages,
31
+ get_african_languages,
32
+ search_languages,
33
+ get_language_statistics
34
+ )
35
+ from ...core.config import settings
36
+ from ...core.logging import get_logger
37
+ from ...core.metrics import TRANSLATION_COUNT, CHARACTER_COUNT, ERROR_COUNT
38
+ from ...utils.helpers import get_nairobi_time
39
+
40
+ logger = get_logger()
41
+ limiter = Limiter(key_func=get_remote_address)
42
+
43
+ # Application start time for uptime calculation
44
+ app_start_time = time.time()
45
+
46
+ # Create router
47
+ router = APIRouter()
48
+
49
+
50
+ @router.get(
51
+ "/",
52
+ response_model=HealthResponse,
53
+ tags=["Health & Monitoring"],
54
+ summary="Basic Health Check",
55
+ description="Quick health check endpoint that returns basic API status information."
56
+ )
57
+ async def root():
58
+ """
59
+ ## Basic Health Check
60
+
61
+ Returns essential API status information including:
62
+ - ✅ API operational status
63
+ - 📦 Model loading status
64
+ - ⏱️ System uptime
65
+ - 🏷️ API version
66
+
67
+ **Use this endpoint for:**
68
+ - Load balancer health checks
69
+ - Basic monitoring
70
+ - API availability verification
71
+ """
72
+ uptime = time.time() - app_start_time
73
+ full_date, _ = get_nairobi_time()
74
+
75
+ return HealthResponse(
76
+ status="healthy" if models_loaded() else "degraded",
77
+ version=settings.app_version,
78
+ models_loaded=models_loaded(),
79
+ uptime=uptime,
80
+ timestamp=full_date
81
+ )
82
+
83
+
84
+ @router.get(
85
+ "/health",
86
+ response_model=HealthResponse,
87
+ tags=["Health & Monitoring"],
88
+ summary="Detailed Health Check",
89
+ description="Comprehensive health check with detailed system status for monitoring systems.",
90
+ responses={
91
+ 200: {"description": "System is healthy"},
92
+ 503: {"description": "System is unhealthy - models not loaded"}
93
+ }
94
+ )
95
+ async def health_check():
96
+ """
97
+ ## Detailed Health Check
98
+
99
+ Comprehensive health check endpoint designed for monitoring systems like:
100
+ - 📊 Prometheus/Grafana
101
+ - 🚨 Alerting systems
102
+ - 🔍 APM tools
103
+ - 🏥 Health monitoring dashboards
104
+
105
+ **Returns detailed information about:**
106
+ - System health status
107
+ - Model loading status
108
+ - API uptime
109
+ - Timestamp information
110
+
111
+ **HTTP Status Codes:**
112
+ - `200`: All systems operational
113
+ - `503`: Service unavailable (models not loaded)
114
+ """
115
+ uptime = time.time() - app_start_time
116
+ full_date, _ = get_nairobi_time()
117
+
118
+ # Perform additional health checks here
119
+ models_healthy = models_loaded()
120
+
121
+ return HealthResponse(
122
+ status="healthy" if models_healthy else "unhealthy",
123
+ version=settings.app_version,
124
+ models_loaded=models_healthy,
125
+ uptime=uptime,
126
+ timestamp=full_date
127
+ )
128
+
129
+
130
+ @router.get(
131
+ "/metrics",
132
+ tags=["Health & Monitoring"],
133
+ summary="Prometheus Metrics",
134
+ description="Prometheus-compatible metrics endpoint for monitoring and alerting.",
135
+ responses={
136
+ 200: {"description": "Metrics in Prometheus format", "content": {"text/plain": {}}},
137
+ 404: {"description": "Metrics disabled"}
138
+ }
139
+ )
140
+ async def get_metrics():
141
+ """
142
+ ## Prometheus Metrics
143
+
144
+ Returns metrics in Prometheus format for monitoring and alerting.
145
+
146
+ **Available Metrics:**
147
+ - 📊 `sema_requests_total` - Total API requests by endpoint and status
148
+ - ⏱️ `sema_request_duration_seconds` - Request duration histogram
149
+ - 🌍 `sema_translations_total` - Translation count by language pair
150
+ - 📝 `sema_characters_translated_total` - Total characters translated
151
+ - ❌ `sema_errors_total` - Error count by type
152
+
153
+ **Integration Examples:**
154
+ ```yaml
155
+ # Prometheus scrape config
156
+ scrape_configs:
157
+ - job_name: 'sema-api'
158
+ static_configs:
159
+ - targets: ['your-api-url:port']
160
+ metrics_path: '/metrics'
161
+ ```
162
+ """
163
+ if not settings.enable_metrics:
164
+ raise HTTPException(status_code=404, detail="Metrics disabled")
165
+
166
+ return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
167
+
168
+
169
+ @router.post(
170
+ "/translate",
171
+ response_model=TranslationResponse,
172
+ tags=["Translation"],
173
+ summary="Translate Text",
174
+ description="Translate text between 200+ languages with automatic language detection.",
175
+ responses={
176
+ 200: {"description": "Translation successful"},
177
+ 400: {"description": "Invalid request - empty text or invalid language code"},
178
+ 413: {"description": "Text too long - exceeds character limit"},
179
+ 429: {"description": "Rate limit exceeded"},
180
+ 500: {"description": "Translation service error"}
181
+ }
182
+ )
183
+ @limiter.limit(f"{settings.max_requests_per_minute}/minute")
184
+ async def translate_endpoint(
185
+ request: TranslationRequest,
186
+ http_request: Request
187
+ ):
188
+ """
189
+ ## 🌍 Translate Text
190
+
191
+ Translate text between 200+ languages using state-of-the-art neural machine translation.
192
+
193
+ ### ✨ Features
194
+ - **Automatic Language Detection**: Leave `source_language` empty for auto-detection
195
+ - **200+ Languages**: Full FLORES-200 language support
196
+ - **High Performance**: Optimized CTranslate2 inference engine
197
+ - **Usage Tracking**: Character count and request metrics
198
+ - **Request Tracking**: Unique request IDs for debugging
199
+
200
+ ### 🔒 Limits & Constraints
201
+ - **Rate Limit**: 60 requests per minute per IP address
202
+ - **Character Limit**: Maximum 5000 characters per request
203
+ - **Language Codes**: Must use FLORES-200 format (e.g., `eng_Latn`, `swh_Latn`)
204
+
205
+ ### 📝 Language Code Examples
206
+ | Language | Code | Example |
207
+ |----------|------|---------|
208
+ | English | `eng_Latn` | "Hello world" |
209
+ | Swahili | `swh_Latn` | "Habari ya dunia" |
210
+ | French | `fra_Latn` | "Bonjour le monde" |
211
+ | Kikuyu | `kik_Latn` | "Wĩ mwega?" |
212
+ | Spanish | `spa_Latn` | "Hola mundo" |
213
+
214
+ ### 🚀 Usage Examples
215
+
216
+ **Auto-detect source language:**
217
+ ```json
218
+ {
219
+ "text": "Habari ya asubuhi",
220
+ "target_language": "eng_Latn"
221
+ }
222
+ ```
223
+
224
+ **Specify source language:**
225
+ ```json
226
+ {
227
+ "text": "Good morning",
228
+ "source_language": "eng_Latn",
229
+ "target_language": "swh_Latn"
230
+ }
231
+ ```
232
+
233
+ ### 📊 Response Information
234
+ The response includes:
235
+ - Translated text
236
+ - Detected/provided source language
237
+ - Character count for usage tracking
238
+ - Inference time for performance monitoring
239
+ - Unique request ID for debugging
240
+ - Timestamp in Nairobi timezone
241
+ """
242
+ request_id = http_request.state.request_id
243
+
244
+ # Validate text length
245
+ if len(request.text) > settings.max_text_length:
246
+ raise HTTPException(
247
+ status_code=413,
248
+ detail=f"Text too long. Maximum {settings.max_text_length} characters allowed."
249
+ )
250
+
251
+ full_date, _ = get_nairobi_time()
252
+ character_count = len(request.text)
253
+
254
+ # Log translation request
255
+ logger.info(
256
+ "translation_started",
257
+ request_id=request_id,
258
+ source_language=request.source_language,
259
+ target_language=request.target_language,
260
+ character_count=character_count
261
+ )
262
+
263
+ try:
264
+ if request.source_language:
265
+ # Use provided source language
266
+ translated_text, inference_time = translate_with_source(
267
+ request.text,
268
+ request.source_language,
269
+ request.target_language
270
+ )
271
+ source_lang = request.source_language
272
+ else:
273
+ # Auto-detect source language
274
+ source_lang, translated_text, inference_time = translate_with_detection(
275
+ request.text,
276
+ request.target_language
277
+ )
278
+
279
+ # Update metrics
280
+ TRANSLATION_COUNT.labels(
281
+ source_lang=source_lang,
282
+ target_lang=request.target_language
283
+ ).inc()
284
+
285
+ CHARACTER_COUNT.inc(character_count)
286
+
287
+ # Log successful translation
288
+ logger.info(
289
+ "translation_completed",
290
+ request_id=request_id,
291
+ source_language=source_lang,
292
+ target_language=request.target_language,
293
+ character_count=character_count,
294
+ inference_time=inference_time
295
+ )
296
+
297
+ return TranslationResponse(
298
+ translated_text=translated_text,
299
+ source_language=source_lang,
300
+ target_language=request.target_language,
301
+ inference_time=inference_time,
302
+ character_count=character_count,
303
+ timestamp=full_date,
304
+ request_id=request_id
305
+ )
306
+
307
+ except Exception as e:
308
+ # Log translation error
309
+ logger.error(
310
+ "translation_failed",
311
+ request_id=request_id,
312
+ error=str(e),
313
+ error_type=type(e).__name__,
314
+ source_language=request.source_language,
315
+ target_language=request.target_language
316
+ )
317
+
318
+ # Update error metrics
319
+ ERROR_COUNT.labels(error_type="translation_error").inc()
320
+
321
+ raise HTTPException(
322
+ status_code=500,
323
+ detail="Translation service temporarily unavailable. Please try again later."
324
+ )
325
+
326
+
327
+ @router.get(
328
+ "/languages",
329
+ response_model=LanguagesResponse,
330
+ tags=["Languages"],
331
+ summary="Get All Supported Languages",
332
+ description="Retrieve a complete list of all supported languages with metadata."
333
+ )
334
+ async def get_languages():
335
+ """
336
+ ## 🌍 Get All Supported Languages
337
+
338
+ Returns a comprehensive list of all 200+ supported languages with detailed metadata.
339
+
340
+ ### 📋 Response Information
341
+ Each language includes:
342
+ - **English Name**: Standard English name
343
+ - **Native Name**: Name in the language's native script
344
+ - **Region**: Geographic region (Africa, Europe, Asia, etc.)
345
+ - **Script**: Writing system (Latin, Arabic, Cyrillic, etc.)
346
+
347
+ ### 🎯 Use Cases
348
+ - **Frontend Language Selectors**: Populate dropdown menus
349
+ - **API Integration**: Validate language codes before translation
350
+ - **Documentation**: Generate language support documentation
351
+ - **Analytics**: Track language usage patterns
352
+
353
+ ### 📊 Language Coverage
354
+ - **African Languages**: 25+ languages including Swahili, Hausa, Yoruba
355
+ - **European Languages**: 40+ languages including major EU languages
356
+ - **Asian Languages**: 80+ languages including Chinese, Japanese, Hindi
357
+ - **Middle Eastern**: 15+ languages including Arabic, Hebrew, Persian
358
+ - **Americas**: 30+ languages including indigenous languages
359
+ """
360
+ languages = get_all_languages()
361
+ return LanguagesResponse(
362
+ languages={code: LanguageInfo(**info) for code, info in languages.items()},
363
+ total_count=len(languages)
364
+ )
365
+
366
+
367
+ @router.get(
368
+ "/languages/popular",
369
+ response_model=LanguagesResponse,
370
+ tags=["Languages"],
371
+ summary="Get Popular Languages",
372
+ description="Get the most commonly used languages for quick access."
373
+ )
374
+ async def get_popular_languages_endpoint():
375
+ """
376
+ ## ⭐ Get Popular Languages
377
+
378
+ Returns the most commonly requested languages for quick access and better UX.
379
+
380
+ ### 🔥 Included Languages
381
+ - **Global**: English, Spanish, French, German, Portuguese, Russian
382
+ - **Asian**: Chinese, Japanese, Korean, Hindi, Arabic
383
+ - **African**: Swahili, Hausa, Yoruba, Amharic, Somali, Kikuyu
384
+
385
+ ### 💡 Perfect For
386
+ - **Quick Selection**: Show popular options first
387
+ - **Mobile Apps**: Reduced list for smaller screens
388
+ - **Default Options**: Pre-populate common language pairs
389
+ """
390
+ languages = get_popular_languages()
391
+ return LanguagesResponse(
392
+ languages={code: LanguageInfo(**info) for code, info in languages.items()},
393
+ total_count=len(languages)
394
+ )
395
+
396
+
397
+ @router.get(
398
+ "/languages/african",
399
+ response_model=LanguagesResponse,
400
+ tags=["Languages"],
401
+ summary="Get African Languages",
402
+ description="Get all supported African languages."
403
+ )
404
+ async def get_african_languages_endpoint():
405
+ """
406
+ ## 🌍 Get African Languages
407
+
408
+ Returns all supported African languages - our specialty!
409
+
410
+ ### 🎯 Featured African Languages
411
+ - **East Africa**: Swahili, Kikuyu, Luo, Amharic, Somali, Tigrinya
412
+ - **West Africa**: Hausa, Yoruba, Igbo, Wolof, Lingala
413
+ - **Southern Africa**: Zulu, Xhosa, Afrikaans, Tswana, Sotho, Shona
414
+ - **Central Africa**: Lingala, Umbundu
415
+
416
+ ### ✨ Special Features
417
+ - High-quality translations for African languages
418
+ - Cultural context preservation
419
+ - Support for various scripts (Latin, Ethiopic)
420
+ """
421
+ languages = get_african_languages()
422
+ return LanguagesResponse(
423
+ languages={code: LanguageInfo(**info) for code, info in languages.items()},
424
+ total_count=len(languages)
425
+ )
426
+
427
+
428
+ @router.get(
429
+ "/languages/region/{region}",
430
+ response_model=LanguagesResponse,
431
+ tags=["Languages"],
432
+ summary="Get Languages by Region",
433
+ description="Get all languages from a specific geographic region."
434
+ )
435
+ async def get_languages_by_region_endpoint(region: str):
436
+ """
437
+ ## 🗺️ Get Languages by Region
438
+
439
+ Filter languages by geographic region for targeted language support.
440
+
441
+ ### 🌍 Available Regions
442
+ - **Africa**: African languages (Swahili, Hausa, Yoruba, etc.)
443
+ - **Europe**: European languages (English, French, German, etc.)
444
+ - **Asia**: Asian languages (Chinese, Japanese, Hindi, etc.)
445
+ - **Middle East**: Middle Eastern languages (Arabic, Hebrew, Persian, etc.)
446
+ - **Americas**: Languages from the Americas
447
+
448
+ ### 📍 Usage Examples
449
+ ```
450
+ GET /languages/region/Africa
451
+ GET /languages/region/Europe
452
+ GET /languages/region/Asia
453
+ ```
454
+ """
455
+ languages = get_languages_by_region(region)
456
+ if not languages:
457
+ raise HTTPException(
458
+ status_code=404,
459
+ detail=f"No languages found for region: {region}. Available regions: Africa, Europe, Asia, Middle East, Americas"
460
+ )
461
+
462
+ return LanguagesResponse(
463
+ languages={code: LanguageInfo(**info) for code, info in languages.items()},
464
+ total_count=len(languages)
465
+ )
466
+
467
+
468
+ @router.get(
469
+ "/languages/search",
470
+ response_model=LanguagesResponse,
471
+ tags=["Languages"],
472
+ summary="Search Languages",
473
+ description="Search for languages by name, native name, or language code."
474
+ )
475
+ async def search_languages_endpoint(q: str):
476
+ """
477
+ ## 🔍 Search Languages
478
+
479
+ Search for languages using flexible text matching.
480
+
481
+ ### 🎯 Search Capabilities
482
+ - **English Names**: "Swahili", "French", "Chinese"
483
+ - **Native Names**: "Kiswahili", "Français", "中文"
484
+ - **Language Codes**: "swh_Latn", "fra_Latn", "cmn_Hans"
485
+ - **Partial Matches**: "Span" matches "Spanish"
486
+
487
+ ### 💡 Perfect For
488
+ - **Autocomplete**: Real-time language search
489
+ - **User Input**: Find languages by any name variation
490
+ - **Validation**: Check if a language exists
491
+
492
+ ### 📝 Query Examples
493
+ ```
494
+ GET /languages/search?q=Swahili
495
+ GET /languages/search?q=中文
496
+ GET /languages/search?q=ara
497
+ ```
498
+ """
499
+ if not q or len(q.strip()) < 2:
500
+ raise HTTPException(
501
+ status_code=400,
502
+ detail="Search query must be at least 2 characters long"
503
+ )
504
+
505
+ languages = search_languages(q.strip())
506
+ return LanguagesResponse(
507
+ languages={code: LanguageInfo(**info) for code, info in languages.items()},
508
+ total_count=len(languages)
509
+ )
510
+
511
+
512
+ @router.get(
513
+ "/languages/stats",
514
+ response_model=LanguageStatsResponse,
515
+ tags=["Languages"],
516
+ summary="Get Language Statistics",
517
+ description="Get comprehensive statistics about supported languages."
518
+ )
519
+ async def get_language_stats():
520
+ """
521
+ ## 📊 Language Statistics
522
+
523
+ Get comprehensive statistics about our language support coverage.
524
+
525
+ ### 📈 Statistics Include
526
+ - **Total Languages**: Complete count of supported languages
527
+ - **Regional Distribution**: Languages per geographic region
528
+ - **Script Coverage**: Number of writing systems supported
529
+ - **Detailed Breakdown**: Languages by region with counts
530
+
531
+ ### 🎯 Use Cases
532
+ - **Analytics Dashboards**: Display language coverage metrics
533
+ - **Marketing Materials**: Showcase translation capabilities
534
+ - **API Documentation**: Provide coverage statistics
535
+ - **Business Intelligence**: Track language support growth
536
+ """
537
+ stats = get_language_statistics()
538
+ return LanguageStatsResponse(**stats)
539
+
540
+
541
+ @router.get(
542
+ "/languages/{language_code}",
543
+ response_model=LanguageInfo,
544
+ tags=["Languages"],
545
+ summary="Get Language Information",
546
+ description="Get detailed information about a specific language."
547
+ )
548
+ async def get_language_info_endpoint(language_code: str):
549
+ """
550
+ ## 🔍 Get Language Information
551
+
552
+ Get detailed metadata about a specific language using its FLORES-200 code.
553
+
554
+ ### 📋 Information Provided
555
+ - **English Name**: Standard English name
556
+ - **Native Name**: Name in native script
557
+ - **Region**: Geographic region
558
+ - **Script**: Writing system used
559
+
560
+ ### 🎯 Use Cases
561
+ - **Language Validation**: Check if a code is supported
562
+ - **UI Display**: Show language names in interfaces
563
+ - **Documentation**: Generate language-specific docs
564
+
565
+ ### 📝 Example Codes
566
+ ```
567
+ GET /languages/swh_Latn # Swahili
568
+ GET /languages/eng_Latn # English
569
+ GET /languages/cmn_Hans # Chinese (Simplified)
570
+ ```
571
+ """
572
+ language_info = get_language_info(language_code)
573
+ if not language_info:
574
+ raise HTTPException(
575
+ status_code=404,
576
+ detail=f"Language code '{language_code}' not supported. Use /languages to see all supported languages."
577
+ )
578
+
579
+ return LanguageInfo(**language_info)
app/core/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Core configuration and settings
app/core/config.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Application configuration and settings
3
+ """
4
+
5
+ from typing import List
6
+ from pydantic_settings import BaseSettings
7
+
8
+
9
+ class Settings(BaseSettings):
10
+ """Application settings and configuration"""
11
+
12
+ # Application Info
13
+ app_name: str = "Sema Translation API"
14
+ app_version: str = "2.0.0"
15
+ description: str = "Enterprise-grade translation API supporting 200+ languages"
16
+ environment: str = "development"
17
+ debug: bool = True
18
+
19
+ # API Configuration
20
+ max_text_length: int = 5000
21
+ max_requests_per_minute: int = 60
22
+ max_requests_per_hour: int = 1000
23
+
24
+ # Security
25
+ allowed_hosts: List[str] = ["*"]
26
+ cors_origins: List[str] = ["*"]
27
+
28
+ # Models
29
+ model_repo_id: str = "sematech/sema-utils"
30
+ translation_model: str = "sematrans-3.3B"
31
+ beam_size: int = 1
32
+ device: str = "cpu"
33
+
34
+ # Monitoring
35
+ enable_metrics: bool = True
36
+ log_level: str = "INFO"
37
+
38
+ class Config:
39
+ env_file = ".env"
40
+ env_prefix = "SEMA_"
41
+
42
+
43
+ # Global settings instance
44
+ settings = Settings()
app/core/logging.py ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Logging configuration and setup
3
+ """
4
+
5
+ import structlog
6
+ from .config import settings
7
+
8
+
9
+ def configure_logging():
10
+ """Configure structured logging for the application"""
11
+
12
+ structlog.configure(
13
+ processors=[
14
+ structlog.stdlib.filter_by_level,
15
+ structlog.stdlib.add_logger_name,
16
+ structlog.stdlib.add_log_level,
17
+ structlog.stdlib.PositionalArgumentsFormatter(),
18
+ structlog.processors.TimeStamper(fmt="iso"),
19
+ structlog.processors.StackInfoRenderer(),
20
+ structlog.processors.format_exc_info,
21
+ structlog.processors.UnicodeDecoder(),
22
+ structlog.processors.JSONRenderer()
23
+ ],
24
+ context_class=dict,
25
+ logger_factory=structlog.stdlib.LoggerFactory(),
26
+ wrapper_class=structlog.stdlib.BoundLogger,
27
+ cache_logger_on_first_use=True,
28
+ )
29
+
30
+
31
+ def get_logger():
32
+ """Get a configured logger instance"""
33
+ return structlog.get_logger()
app/core/metrics.py ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Prometheus metrics configuration
3
+ """
4
+
5
+ from prometheus_client import Counter, Histogram
6
+
7
+
8
+ # Request metrics
9
+ REQUEST_COUNT = Counter(
10
+ 'sema_requests_total',
11
+ 'Total requests',
12
+ ['method', 'endpoint', 'status']
13
+ )
14
+
15
+ REQUEST_DURATION = Histogram(
16
+ 'sema_request_duration_seconds',
17
+ 'Request duration',
18
+ ['method', 'endpoint']
19
+ )
20
+
21
+ # Translation metrics
22
+ TRANSLATION_COUNT = Counter(
23
+ 'sema_translations_total',
24
+ 'Total translations',
25
+ ['source_lang', 'target_lang']
26
+ )
27
+
28
+ CHARACTER_COUNT = Counter(
29
+ 'sema_characters_translated_total',
30
+ 'Total characters translated'
31
+ )
32
+
33
+ # Error metrics
34
+ ERROR_COUNT = Counter(
35
+ 'sema_errors_total',
36
+ 'Total errors',
37
+ ['error_type']
38
+ )
app/main.py ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Sema Translation API - Main Application
3
+ Enterprise-grade translation API with proper FastAPI structure
4
+ """
5
+
6
+ from fastapi import FastAPI
7
+ from fastapi.middleware.cors import CORSMiddleware
8
+ from fastapi.middleware.trustedhost import TrustedHostMiddleware
9
+ from slowapi import _rate_limit_exceeded_handler
10
+ from slowapi.errors import RateLimitExceeded
11
+
12
+ from .core.config import settings
13
+ from .core.logging import configure_logging, get_logger
14
+ from .middleware.request_middleware import request_middleware
15
+ from .services.translation import load_models
16
+ from .api.v1.endpoints import router as v1_router, limiter
17
+
18
+ # Configure logging
19
+ configure_logging()
20
+ logger = get_logger()
21
+
22
+
23
+ def create_application() -> FastAPI:
24
+ """Create and configure the FastAPI application"""
25
+
26
+ app = FastAPI(
27
+ title=settings.app_name,
28
+ description="""
29
+ ## 🌍 Enterprise Translation API
30
+
31
+ A powerful, production-ready translation API supporting 200+ languages with automatic language detection.
32
+
33
+ ### 🚀 Key Features
34
+ - **Automatic Language Detection**: Detects source language if not provided
35
+ - **200+ Language Support**: Full FLORES-200 language code support
36
+ - **Rate Limiting**: 60 requests/minute per IP address
37
+ - **Usage Tracking**: Character count and request metrics
38
+ - **High Performance**: CTranslate2 optimized inference
39
+ - **Enterprise Monitoring**: Prometheus metrics and structured logging
40
+
41
+ ### 🔒 Rate Limits
42
+ - **Per IP**: 60 requests per minute
43
+ - **Character Limit**: 5000 characters per request
44
+ - **Concurrent Requests**: Async processing for optimal performance
45
+
46
+ ### 📊 Monitoring
47
+ - **Health Checks**: `/health` endpoint for system monitoring
48
+ - **Metrics**: `/metrics` endpoint for Prometheus integration
49
+ - **Request Tracking**: Unique request IDs for debugging
50
+
51
+ ### 🌐 Language Support
52
+ Supports all FLORES-200 language codes including:
53
+ - **African Languages**: Swahili (swh_Latn), Kikuyu (kik_Latn), Luo (luo_Latn)
54
+ - **European Languages**: English (eng_Latn), French (fra_Latn), Spanish (spa_Latn)
55
+ - **And 190+ more languages**
56
+
57
+ ### 📝 Usage Examples
58
+ ```bash
59
+ # Basic translation with auto-detection
60
+ curl -X POST "/translate" \\
61
+ -H "Content-Type: application/json" \\
62
+ -d '{"text": "Habari ya asubuhi", "target_language": "eng_Latn"}'
63
+
64
+ # Translation with specified source language
65
+ curl -X POST "/translate" \\
66
+ -H "Content-Type: application/json" \\
67
+ -d '{"text": "Hello world", "source_language": "eng_Latn", "target_language": "swh_Latn"}'
68
+ ```
69
+ """,
70
+ version=settings.app_version,
71
+ docs_url="/docs",
72
+ redoc_url="/redoc",
73
+ openapi_url="/openapi.json",
74
+ contact={
75
+ "name": "Sema AI Team",
76
+ "url": "https://github.com/lewiskimaru/sema",
77
+ "email": "support@sema.ai"
78
+ },
79
+ license_info={
80
+ "name": "MIT License",
81
+ "url": "https://opensource.org/licenses/MIT"
82
+ },
83
+ servers=[
84
+ {
85
+ "url": "https://sematech-sema-api.hf.space",
86
+ "description": "Production server"
87
+ },
88
+ {
89
+ "url": "http://localhost:8000",
90
+ "description": "Development server"
91
+ }
92
+ ]
93
+ )
94
+
95
+ # Add rate limiting
96
+ app.state.limiter = limiter
97
+ app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
98
+
99
+ # Security middleware
100
+ if settings.allowed_hosts != ["*"]:
101
+ app.add_middleware(TrustedHostMiddleware, allowed_hosts=settings.allowed_hosts)
102
+
103
+ # CORS middleware
104
+ app.add_middleware(
105
+ CORSMiddleware,
106
+ allow_origins=settings.cors_origins,
107
+ allow_credentials=True,
108
+ allow_methods=["GET", "POST", "OPTIONS"],
109
+ allow_headers=["*"],
110
+ )
111
+
112
+ # Request middleware
113
+ app.middleware("http")(request_middleware)
114
+
115
+ # Include API routes
116
+ app.include_router(v1_router, prefix="/api/v1")
117
+ app.include_router(v1_router) # Also include at root for backward compatibility
118
+
119
+ return app
120
+
121
+
122
+ # Create the application instance
123
+ app = create_application()
124
+
125
+
126
+ @app.on_event("startup")
127
+ async def startup_event():
128
+ """Initialize the application on startup"""
129
+ logger.info("application_startup", version=settings.app_version, environment=settings.environment)
130
+
131
+ print(f"\n🎵 Starting {settings.app_name} v{settings.app_version}")
132
+ print("🎼 Loading the Orchestra... 🦋")
133
+
134
+ try:
135
+ load_models()
136
+ logger.info("models_loaded_successfully")
137
+ print("🎉 API started successfully!")
138
+ print(f"📊 Metrics enabled: {settings.enable_metrics}")
139
+ print(f"🔒 Environment: {settings.environment}")
140
+ print(f"📝 Documentation: /docs")
141
+ print(f"📈 Metrics: /metrics")
142
+ print(f"❤️ Health: /health")
143
+ print(f"🔗 API v1: /api/v1/")
144
+ print()
145
+
146
+ except Exception as e:
147
+ logger.error("startup_failed", error=str(e))
148
+ print(f"❌ Startup failed: {e}")
149
+ raise
150
+
151
+
152
+ @app.on_event("shutdown")
153
+ async def shutdown_event():
154
+ """Cleanup on application shutdown"""
155
+ logger.info("application_shutdown")
156
+ print("\n👋 Shutting down Sema Translation API...")
157
+ print("🧹 Cleaning up resources...")
158
+ print("✅ Shutdown complete\n")
159
+
160
+
161
+ if __name__ == "__main__":
162
+ import uvicorn
163
+ uvicorn.run(
164
+ "app.main:app",
165
+ host="0.0.0.0",
166
+ port=8000,
167
+ reload=settings.debug
168
+ )
app/middleware/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Middleware components
app/middleware/request_middleware.py ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Request middleware for logging, metrics, and request tracking
3
+ """
4
+
5
+ import time
6
+ from fastapi import Request
7
+ from ..core.logging import get_logger
8
+ from ..core.metrics import REQUEST_COUNT, REQUEST_DURATION, ERROR_COUNT
9
+ from ..utils.helpers import generate_request_id
10
+
11
+ logger = get_logger()
12
+
13
+
14
+ async def request_middleware(request: Request, call_next):
15
+ """Middleware for request tracking, metrics, and logging"""
16
+ start_time = time.time()
17
+ request_id = generate_request_id()
18
+
19
+ # Add request ID to request state
20
+ request.state.request_id = request_id
21
+
22
+ # Log request
23
+ logger.info(
24
+ "request_started",
25
+ request_id=request_id,
26
+ method=request.method,
27
+ url=str(request.url),
28
+ client_ip=request.client.host if request.client else "unknown",
29
+ user_agent=request.headers.get("user-agent", "unknown")
30
+ )
31
+
32
+ try:
33
+ response = await call_next(request)
34
+
35
+ # Calculate duration
36
+ duration = time.time() - start_time
37
+
38
+ # Update metrics
39
+ REQUEST_COUNT.labels(
40
+ method=request.method,
41
+ endpoint=request.url.path,
42
+ status=response.status_code
43
+ ).inc()
44
+
45
+ REQUEST_DURATION.labels(
46
+ method=request.method,
47
+ endpoint=request.url.path
48
+ ).observe(duration)
49
+
50
+ # Log response
51
+ logger.info(
52
+ "request_completed",
53
+ request_id=request_id,
54
+ status_code=response.status_code,
55
+ duration=duration
56
+ )
57
+
58
+ # Add request ID to response headers
59
+ response.headers["X-Request-ID"] = request_id
60
+
61
+ return response
62
+
63
+ except Exception as e:
64
+ duration = time.time() - start_time
65
+
66
+ # Update error metrics
67
+ ERROR_COUNT.labels(error_type=type(e).__name__).inc()
68
+
69
+ # Log error
70
+ logger.error(
71
+ "request_failed",
72
+ request_id=request_id,
73
+ error=str(e),
74
+ error_type=type(e).__name__,
75
+ duration=duration
76
+ )
77
+
78
+ raise
app/models/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Data models and schemas
app/models/schemas.py ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Pydantic models for request/response validation
3
+ """
4
+
5
+ from typing import Optional, Dict
6
+ from pydantic import BaseModel, Field, validator
7
+
8
+
9
+ class TranslationRequest(BaseModel):
10
+ """
11
+ Translation request model
12
+
13
+ Validates input for the translation endpoint with proper FLORES-200 language codes.
14
+ """
15
+
16
+ text: str = Field(
17
+ ...,
18
+ example="Habari ya asubuhi",
19
+ description="Text to translate (1-5000 characters)",
20
+ min_length=1,
21
+ max_length=5000,
22
+ title="Input Text"
23
+ )
24
+ target_language: str = Field(
25
+ ...,
26
+ example="eng_Latn",
27
+ description="Target language in FLORES-200 format (e.g., eng_Latn for English)",
28
+ regex=r"^[a-z]{3}_[A-Z][a-z]{3}$",
29
+ title="Target Language Code"
30
+ )
31
+ source_language: Optional[str] = Field(
32
+ None,
33
+ example="swh_Latn",
34
+ description="Source language in FLORES-200 format. If not provided, language will be auto-detected",
35
+ regex=r"^[a-z]{3}_[A-Z][a-z]{3}$",
36
+ title="Source Language Code (Optional)"
37
+ )
38
+
39
+ class Config:
40
+ schema_extra = {
41
+ "examples": [
42
+ {
43
+ "summary": "Auto-detect source language",
44
+ "description": "Translate Swahili to English with automatic language detection",
45
+ "value": {
46
+ "text": "Habari ya asubuhi",
47
+ "target_language": "eng_Latn"
48
+ }
49
+ },
50
+ {
51
+ "summary": "Specify source language",
52
+ "description": "Translate English to Swahili with specified source language",
53
+ "value": {
54
+ "text": "Good morning",
55
+ "source_language": "eng_Latn",
56
+ "target_language": "swh_Latn"
57
+ }
58
+ },
59
+ {
60
+ "summary": "African language translation",
61
+ "description": "Translate Kikuyu to English",
62
+ "value": {
63
+ "text": "Wĩ mwega?",
64
+ "source_language": "kik_Latn",
65
+ "target_language": "eng_Latn"
66
+ }
67
+ }
68
+ ]
69
+ }
70
+
71
+ @validator('text')
72
+ def validate_text(cls, v):
73
+ if not v.strip():
74
+ raise ValueError('Text cannot be empty or only whitespace')
75
+ return v.strip()
76
+
77
+
78
+ class TranslationResponse(BaseModel):
79
+ """
80
+ Translation response model
81
+
82
+ Contains the translated text and metadata about the translation process.
83
+ """
84
+
85
+ translated_text: str = Field(
86
+ ...,
87
+ description="The translated text result",
88
+ example="Good morning",
89
+ title="Translated Text"
90
+ )
91
+ source_language: str = Field(
92
+ ...,
93
+ description="Detected or provided source language code",
94
+ example="swh_Latn",
95
+ title="Source Language"
96
+ )
97
+ target_language: str = Field(
98
+ ...,
99
+ description="Target language code as requested",
100
+ example="eng_Latn",
101
+ title="Target Language"
102
+ )
103
+ inference_time: float = Field(
104
+ ...,
105
+ description="Time taken for translation in seconds",
106
+ example=0.234,
107
+ ge=0,
108
+ title="Inference Time (seconds)"
109
+ )
110
+ character_count: int = Field(
111
+ ...,
112
+ description="Number of characters in the input text",
113
+ example=17,
114
+ ge=1,
115
+ title="Character Count"
116
+ )
117
+ timestamp: str = Field(
118
+ ...,
119
+ description="Timestamp of the translation in Nairobi timezone",
120
+ example="Monday | 2024-06-21 | 14:30:25",
121
+ title="Timestamp"
122
+ )
123
+ request_id: str = Field(
124
+ ...,
125
+ description="Unique request identifier for debugging and tracking",
126
+ example="550e8400-e29b-41d4-a716-446655440000",
127
+ title="Request ID"
128
+ )
129
+
130
+ class Config:
131
+ schema_extra = {
132
+ "example": {
133
+ "translated_text": "Good morning",
134
+ "source_language": "swh_Latn",
135
+ "target_language": "eng_Latn",
136
+ "inference_time": 0.234,
137
+ "character_count": 17,
138
+ "timestamp": "Monday | 2024-06-21 | 14:30:25",
139
+ "request_id": "550e8400-e29b-41d4-a716-446655440000"
140
+ }
141
+ }
142
+
143
+
144
+ class HealthResponse(BaseModel):
145
+ """Response model for health check endpoints"""
146
+
147
+ status: str = Field(..., description="API health status")
148
+ version: str = Field(..., description="API version")
149
+ models_loaded: bool = Field(..., description="Whether models are loaded")
150
+ uptime: float = Field(..., description="API uptime in seconds")
151
+ timestamp: str = Field(..., description="Current timestamp")
152
+
153
+
154
+ class ErrorResponse(BaseModel):
155
+ """Response model for error responses"""
156
+
157
+ error: str = Field(..., description="Error type")
158
+ message: str = Field(..., description="Error message")
159
+ request_id: str = Field(..., description="Request identifier")
160
+ timestamp: str = Field(..., description="Error timestamp")
161
+
162
+
163
+ class LanguageInfo(BaseModel):
164
+ """
165
+ Language information model
166
+
167
+ Contains metadata about a supported language.
168
+ """
169
+
170
+ name: str = Field(..., description="English name of the language", example="Swahili")
171
+ native_name: str = Field(..., description="Native name of the language", example="Kiswahili")
172
+ region: str = Field(..., description="Geographic region", example="Africa")
173
+ script: str = Field(..., description="Writing script", example="Latin")
174
+
175
+
176
+ class LanguagesResponse(BaseModel):
177
+ """
178
+ Languages list response model
179
+
180
+ Contains a dictionary of supported languages with their metadata.
181
+ """
182
+
183
+ languages: Dict[str, LanguageInfo] = Field(..., description="Dictionary of language codes to language info")
184
+ total_count: int = Field(..., description="Total number of languages")
185
+
186
+ class Config:
187
+ schema_extra = {
188
+ "example": {
189
+ "languages": {
190
+ "swh_Latn": {
191
+ "name": "Swahili",
192
+ "native_name": "Kiswahili",
193
+ "region": "Africa",
194
+ "script": "Latin"
195
+ },
196
+ "eng_Latn": {
197
+ "name": "English",
198
+ "native_name": "English",
199
+ "region": "Europe",
200
+ "script": "Latin"
201
+ }
202
+ },
203
+ "total_count": 2
204
+ }
205
+ }
206
+
207
+
208
+ class LanguageStatsResponse(BaseModel):
209
+ """
210
+ Language statistics response model
211
+
212
+ Contains statistics about supported languages.
213
+ """
214
+
215
+ total_languages: int = Field(..., description="Total number of supported languages")
216
+ regions: int = Field(..., description="Number of geographic regions covered")
217
+ scripts: int = Field(..., description="Number of writing scripts supported")
218
+ by_region: Dict[str, int] = Field(..., description="Language count by region")
219
+
220
+ class Config:
221
+ schema_extra = {
222
+ "example": {
223
+ "total_languages": 200,
224
+ "regions": 6,
225
+ "scripts": 15,
226
+ "by_region": {
227
+ "Africa": 25,
228
+ "Europe": 40,
229
+ "Asia": 80,
230
+ "Middle East": 15,
231
+ "Americas": 30,
232
+ "Oceania": 10
233
+ }
234
+ }
235
+ }
app/services/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Business logic and services
app/services/languages.py ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Language support service - provides information about supported languages
3
+ """
4
+
5
+ from typing import Dict, List, Optional
6
+ from ..core.logging import get_logger
7
+
8
+ logger = get_logger()
9
+
10
+ # FLORES-200 language codes with human-readable names and regions
11
+ SUPPORTED_LANGUAGES = {
12
+ # African Languages
13
+ "afr_Latn": {"name": "Afrikaans", "native_name": "Afrikaans", "region": "Africa", "script": "Latin"},
14
+ "amh_Ethi": {"name": "Amharic", "native_name": "አማርኛ", "region": "Africa", "script": "Ethiopic"},
15
+ "hau_Latn": {"name": "Hausa", "native_name": "Hausa", "region": "Africa", "script": "Latin"},
16
+ "ibo_Latn": {"name": "Igbo", "native_name": "Igbo", "region": "Africa", "script": "Latin"},
17
+ "kik_Latn": {"name": "Kikuyu", "native_name": "Gĩkũyũ", "region": "Africa", "script": "Latin"},
18
+ "lin_Latn": {"name": "Lingala", "native_name": "Lingála", "region": "Africa", "script": "Latin"},
19
+ "lug_Latn": {"name": "Luganda", "native_name": "Luganda", "region": "Africa", "script": "Latin"},
20
+ "luo_Latn": {"name": "Luo", "native_name": "Dholuo", "region": "Africa", "script": "Latin"},
21
+ "nya_Latn": {"name": "Chichewa", "native_name": "Chichewa", "region": "Africa", "script": "Latin"},
22
+ "orm_Latn": {"name": "Oromo", "native_name": "Afaan Oromoo", "region": "Africa", "script": "Latin"},
23
+ "sna_Latn": {"name": "Shona", "native_name": "ChiShona", "region": "Africa", "script": "Latin"},
24
+ "som_Latn": {"name": "Somali", "native_name": "Soomaali", "region": "Africa", "script": "Latin"},
25
+ "sot_Latn": {"name": "Southern Sotho", "native_name": "Sesotho", "region": "Africa", "script": "Latin"},
26
+ "ssw_Latn": {"name": "Swati", "native_name": "SiSwati", "region": "Africa", "script": "Latin"},
27
+ "swh_Latn": {"name": "Swahili", "native_name": "Kiswahili", "region": "Africa", "script": "Latin"},
28
+ "tir_Ethi": {"name": "Tigrinya", "native_name": "ትግርኛ", "region": "Africa", "script": "Ethiopic"},
29
+ "tsn_Latn": {"name": "Tswana", "native_name": "Setswana", "region": "Africa", "script": "Latin"},
30
+ "tso_Latn": {"name": "Tsonga", "native_name": "Xitsonga", "region": "Africa", "script": "Latin"},
31
+ "umb_Latn": {"name": "Umbundu", "native_name": "Umbundu", "region": "Africa", "script": "Latin"},
32
+ "wol_Latn": {"name": "Wolof", "native_name": "Wolof", "region": "Africa", "script": "Latin"},
33
+ "xho_Latn": {"name": "Xhosa", "native_name": "isiXhosa", "region": "Africa", "script": "Latin"},
34
+ "yor_Latn": {"name": "Yoruba", "native_name": "Yorùbá", "region": "Africa", "script": "Latin"},
35
+ "zul_Latn": {"name": "Zulu", "native_name": "isiZulu", "region": "Africa", "script": "Latin"},
36
+
37
+ # European Languages
38
+ "eng_Latn": {"name": "English", "native_name": "English", "region": "Europe", "script": "Latin"},
39
+ "fra_Latn": {"name": "French", "native_name": "Français", "region": "Europe", "script": "Latin"},
40
+ "deu_Latn": {"name": "German", "native_name": "Deutsch", "region": "Europe", "script": "Latin"},
41
+ "spa_Latn": {"name": "Spanish", "native_name": "Español", "region": "Europe", "script": "Latin"},
42
+ "ita_Latn": {"name": "Italian", "native_name": "Italiano", "region": "Europe", "script": "Latin"},
43
+ "por_Latn": {"name": "Portuguese", "native_name": "Português", "region": "Europe", "script": "Latin"},
44
+ "rus_Cyrl": {"name": "Russian", "native_name": "Русский", "region": "Europe", "script": "Cyrillic"},
45
+ "nld_Latn": {"name": "Dutch", "native_name": "Nederlands", "region": "Europe", "script": "Latin"},
46
+ "pol_Latn": {"name": "Polish", "native_name": "Polski", "region": "Europe", "script": "Latin"},
47
+ "ces_Latn": {"name": "Czech", "native_name": "Čeština", "region": "Europe", "script": "Latin"},
48
+ "hun_Latn": {"name": "Hungarian", "native_name": "Magyar", "region": "Europe", "script": "Latin"},
49
+ "ron_Latn": {"name": "Romanian", "native_name": "Română", "region": "Europe", "script": "Latin"},
50
+ "bul_Cyrl": {"name": "Bulgarian", "native_name": "Български", "region": "Europe", "script": "Cyrillic"},
51
+ "hrv_Latn": {"name": "Croatian", "native_name": "Hrvatski", "region": "Europe", "script": "Latin"},
52
+ "srp_Cyrl": {"name": "Serbian", "native_name": "Српски", "region": "Europe", "script": "Cyrillic"},
53
+ "slk_Latn": {"name": "Slovak", "native_name": "Slovenčina", "region": "Europe", "script": "Latin"},
54
+ "slv_Latn": {"name": "Slovenian", "native_name": "Slovenščina", "region": "Europe", "script": "Latin"},
55
+ "est_Latn": {"name": "Estonian", "native_name": "Eesti", "region": "Europe", "script": "Latin"},
56
+ "lav_Latn": {"name": "Latvian", "native_name": "Latviešu", "region": "Europe", "script": "Latin"},
57
+ "lit_Latn": {"name": "Lithuanian", "native_name": "Lietuvių", "region": "Europe", "script": "Latin"},
58
+
59
+ # Asian Languages
60
+ "cmn_Hans": {"name": "Chinese (Simplified)", "native_name": "中文 (简体)", "region": "Asia", "script": "Han"},
61
+ "cmn_Hant": {"name": "Chinese (Traditional)", "native_name": "中文 (繁體)", "region": "Asia", "script": "Han"},
62
+ "jpn_Jpan": {"name": "Japanese", "native_name": "日本語", "region": "Asia", "script": "Japanese"},
63
+ "kor_Hang": {"name": "Korean", "native_name": "한국어", "region": "Asia", "script": "Hangul"},
64
+ "hin_Deva": {"name": "Hindi", "native_name": "हिन्दी", "region": "Asia", "script": "Devanagari"},
65
+ "ben_Beng": {"name": "Bengali", "native_name": "বাংলা", "region": "Asia", "script": "Bengali"},
66
+ "urd_Arab": {"name": "Urdu", "native_name": "اردو", "region": "Asia", "script": "Arabic"},
67
+ "tam_Taml": {"name": "Tamil", "native_name": "தமிழ்", "region": "Asia", "script": "Tamil"},
68
+ "tel_Telu": {"name": "Telugu", "native_name": "తెలుగు", "region": "Asia", "script": "Telugu"},
69
+ "mar_Deva": {"name": "Marathi", "native_name": "मराठी", "region": "Asia", "script": "Devanagari"},
70
+ "guj_Gujr": {"name": "Gujarati", "native_name": "ગુજરાતી", "region": "Asia", "script": "Gujarati"},
71
+ "kan_Knda": {"name": "Kannada", "native_name": "ಕನ್ನಡ", "region": "Asia", "script": "Kannada"},
72
+ "mal_Mlym": {"name": "Malayalam", "native_name": "മലയാളം", "region": "Asia", "script": "Malayalam"},
73
+ "ori_Orya": {"name": "Odia", "native_name": "ଓଡ଼ିଆ", "region": "Asia", "script": "Odia"},
74
+ "pan_Guru": {"name": "Punjabi", "native_name": "ਪੰਜਾਬੀ", "region": "Asia", "script": "Gurmukhi"},
75
+ "tha_Thai": {"name": "Thai", "native_name": "ไทย", "region": "Asia", "script": "Thai"},
76
+ "vie_Latn": {"name": "Vietnamese", "native_name": "Tiếng Việt", "region": "Asia", "script": "Latin"},
77
+ "ind_Latn": {"name": "Indonesian", "native_name": "Bahasa Indonesia", "region": "Asia", "script": "Latin"},
78
+ "msa_Latn": {"name": "Malay", "native_name": "Bahasa Melayu", "region": "Asia", "script": "Latin"},
79
+ "tgl_Latn": {"name": "Tagalog", "native_name": "Tagalog", "region": "Asia", "script": "Latin"},
80
+
81
+ # Middle Eastern Languages
82
+ "ara_Arab": {"name": "Arabic", "native_name": "العربية", "region": "Middle East", "script": "Arabic"},
83
+ "heb_Hebr": {"name": "Hebrew", "native_name": "עברית", "region": "Middle East", "script": "Hebrew"},
84
+ "fas_Arab": {"name": "Persian", "native_name": "فارسی", "region": "Middle East", "script": "Arabic"},
85
+ "tur_Latn": {"name": "Turkish", "native_name": "Türkçe", "region": "Middle East", "script": "Latin"},
86
+
87
+ # Americas Languages
88
+ "spa_Latn": {"name": "Spanish", "native_name": "Español", "region": "Americas", "script": "Latin"},
89
+ "por_Latn": {"name": "Portuguese", "native_name": "Português", "region": "Americas", "script": "Latin"},
90
+ "eng_Latn": {"name": "English", "native_name": "English", "region": "Americas", "script": "Latin"},
91
+ "fra_Latn": {"name": "French", "native_name": "Français", "region": "Americas", "script": "Latin"},
92
+ }
93
+
94
+
95
+ def get_all_languages() -> Dict[str, Dict[str, str]]:
96
+ """Get all supported languages with their metadata"""
97
+ return SUPPORTED_LANGUAGES
98
+
99
+
100
+ def get_languages_by_region(region: str) -> Dict[str, Dict[str, str]]:
101
+ """Get languages filtered by region"""
102
+ return {
103
+ code: info for code, info in SUPPORTED_LANGUAGES.items()
104
+ if info["region"].lower() == region.lower()
105
+ }
106
+
107
+
108
+ def get_language_info(language_code: str) -> Optional[Dict[str, str]]:
109
+ """Get information about a specific language"""
110
+ return SUPPORTED_LANGUAGES.get(language_code)
111
+
112
+
113
+ def is_language_supported(language_code: str) -> bool:
114
+ """Check if a language code is supported"""
115
+ return language_code in SUPPORTED_LANGUAGES
116
+
117
+
118
+ def get_popular_languages() -> Dict[str, Dict[str, str]]:
119
+ """Get most commonly used languages"""
120
+ popular_codes = [
121
+ "eng_Latn", "spa_Latn", "fra_Latn", "deu_Latn", "ita_Latn", "por_Latn",
122
+ "rus_Cyrl", "cmn_Hans", "jpn_Jpan", "kor_Hang", "ara_Arab", "hin_Deva",
123
+ "swh_Latn", "hau_Latn", "yor_Latn", "amh_Ethi", "som_Latn", "kik_Latn"
124
+ ]
125
+ return {code: SUPPORTED_LANGUAGES[code] for code in popular_codes if code in SUPPORTED_LANGUAGES}
126
+
127
+
128
+ def get_african_languages() -> Dict[str, Dict[str, str]]:
129
+ """Get African languages specifically"""
130
+ return get_languages_by_region("Africa")
131
+
132
+
133
+ def search_languages(query: str) -> Dict[str, Dict[str, str]]:
134
+ """Search languages by name or native name"""
135
+ query_lower = query.lower()
136
+ results = {}
137
+
138
+ for code, info in SUPPORTED_LANGUAGES.items():
139
+ if (query_lower in info["name"].lower() or
140
+ query_lower in info["native_name"].lower() or
141
+ query_lower in code.lower()):
142
+ results[code] = info
143
+
144
+ return results
145
+
146
+
147
+ def get_language_statistics() -> Dict[str, int]:
148
+ """Get statistics about supported languages"""
149
+ stats = {
150
+ "total_languages": len(SUPPORTED_LANGUAGES),
151
+ "regions": len(set(info["region"] for info in SUPPORTED_LANGUAGES.values())),
152
+ "scripts": len(set(info["script"] for info in SUPPORTED_LANGUAGES.values()))
153
+ }
154
+
155
+ # Count by region
156
+ region_counts = {}
157
+ for info in SUPPORTED_LANGUAGES.values():
158
+ region = info["region"]
159
+ region_counts[region] = region_counts.get(region, 0) + 1
160
+
161
+ stats["by_region"] = region_counts
162
+ return stats
sema_translation_api.py → app/services/translation.py RENAMED
@@ -1,159 +1,110 @@
1
  """
2
- Sema Translation API - New Implementation
3
- Created for testing consolidated sema-utils repository
4
- Uses HuggingFace Hub for model downloading
5
  """
6
 
7
  import os
8
  import time
9
- from datetime import datetime
10
- import pytz
11
- from typing import Optional
12
-
13
- from fastapi import FastAPI, HTTPException, Request
14
- from fastapi.middleware.cors import CORSMiddleware
15
- from pydantic import BaseModel, Field
16
  from huggingface_hub import hf_hub_download
17
  import ctranslate2
18
  import sentencepiece as spm
19
  import fasttext
20
 
21
- # --- Data Models ---
22
- class TranslationRequest(BaseModel):
23
- text: str = Field(..., example="Habari ya asubuhi", description="Text to translate")
24
- target_language: str = Field(..., example="eng_Latn", description="FLORES-200 target language code")
25
- source_language: Optional[str] = Field(None, example="swh_Latn", description="Optional FLORES-200 source language code")
26
-
27
- class TranslationResponse(BaseModel):
28
- translated_text: str
29
- source_language: str
30
- target_language: str
31
- inference_time: float
32
- timestamp: str
33
-
34
- # --- FastAPI App Setup ---
35
- app = FastAPI(
36
- title="Sema Translation API",
37
- description="Translation API using consolidated sema-utils models from HuggingFace",
38
- version="2.0.0"
39
- )
40
-
41
- # CORS middleware
42
- app.add_middleware(
43
- CORSMiddleware,
44
- allow_origins=["*"],
45
- allow_credentials=False,
46
- allow_methods=["*"],
47
- allow_headers=["*"],
48
- )
49
-
50
- # --- Global Variables ---
51
- REPO_ID = "sematech/sema-utils"
52
- beam_size = 1
53
- device = "cpu"
54
-
55
- # Model instances (will be loaded on startup)
56
- lang_model = None
57
- sp_model = None
58
- translator = None
59
-
60
- def get_nairobi_time():
61
- """Get current time in Nairobi timezone"""
62
- nairobi_timezone = pytz.timezone('Africa/Nairobi')
63
- current_time_nairobi = datetime.now(nairobi_timezone)
64
-
65
- curr_day = current_time_nairobi.strftime('%A')
66
- curr_date = current_time_nairobi.strftime('%Y-%m-%d')
67
- curr_time = current_time_nairobi.strftime('%H:%M:%S')
68
-
69
- full_date = f"{curr_day} | {curr_date} | {curr_time}"
70
- return full_date, curr_time
71
-
72
- def get_model_paths():
73
  """Get model paths from HuggingFace cache (models pre-downloaded in Docker)"""
74
- print("🔄 Loading models from cache...")
75
 
76
  try:
77
  # Check if we're in offline mode (Docker environment)
78
  offline_mode = os.environ.get("HF_HUB_OFFLINE", "0") == "1"
79
 
80
  if offline_mode:
81
- print("📦 Running in offline mode - using cached models")
82
  # In offline mode, models are already downloaded and cached
83
- # We need to find them in the cache directory
84
-
85
- # Get paths from cache using hf_hub_download with local_files_only=True
86
  spm_path = hf_hub_download(
87
- repo_id=REPO_ID,
88
  filename="spm.model",
89
  local_files_only=True
90
  )
91
 
92
  ft_path = hf_hub_download(
93
- repo_id=REPO_ID,
94
  filename="lid218e.bin",
95
  local_files_only=True
96
  )
97
 
98
  # Get the translation model path
99
  model_bin_path = hf_hub_download(
100
- repo_id=REPO_ID,
101
- filename="translation_models/sematrans-3.3B/model.bin",
102
  local_files_only=True
103
  )
104
 
105
- # The model directory is the parent of the model.bin file
106
  ct_model_full_path = os.path.dirname(model_bin_path)
107
 
108
  else:
109
- print("🌐 Running in online mode - downloading models")
110
  # Online mode - download models (for local development)
111
  spm_path = hf_hub_download(
112
- repo_id=REPO_ID,
113
  filename="spm.model"
114
  )
115
 
116
  ft_path = hf_hub_download(
117
- repo_id=REPO_ID,
118
  filename="lid218e.bin"
119
  )
120
 
121
  # Download all necessary CTranslate2 files
122
  model_bin_path = hf_hub_download(
123
- repo_id=REPO_ID,
124
- filename="translation_models/sematrans-3.3B/model.bin"
125
  )
126
 
127
  hf_hub_download(
128
- repo_id=REPO_ID,
129
- filename="translation_models/sematrans-3.3B/config.json"
130
  )
131
 
132
  hf_hub_download(
133
- repo_id=REPO_ID,
134
- filename="translation_models/sematrans-3.3B/shared_vocabulary.txt"
135
  )
136
 
137
  ct_model_full_path = os.path.dirname(model_bin_path)
138
 
139
- print(f"📁 Model paths:")
140
- print(f" SentencePiece: {spm_path}")
141
- print(f" Language detection: {ft_path}")
142
- print(f" Translation model: {ct_model_full_path}")
 
 
143
 
144
  return spm_path, ft_path, ct_model_full_path
145
 
146
  except Exception as e:
147
- print(f" Error loading models: {e}")
148
  raise e
149
 
 
150
  def load_models():
151
  """Load all models into memory"""
152
  global lang_model, sp_model, translator
153
 
154
- print("🚀 Loading models into memory...")
155
 
156
- # Get model paths (from cache or download)
157
  spm_path, ft_path, ct_model_path = get_model_paths()
158
 
159
  # Suppress fasttext warnings
@@ -161,25 +112,26 @@ def load_models():
161
 
162
  try:
163
  # Load language detection model
164
- print("1️⃣ Loading language detection model...")
165
  lang_model = fasttext.load_model(ft_path)
166
 
167
  # Load SentencePiece model
168
- print("2️⃣ Loading SentencePiece model...")
169
  sp_model = spm.SentencePieceProcessor()
170
  sp_model.load(spm_path)
171
 
172
  # Load translation model
173
- print("3️⃣ Loading translation model...")
174
- translator = ctranslate2.Translator(ct_model_path, device)
175
 
176
- print("✅ All models loaded successfully!")
177
 
178
  except Exception as e:
179
- print(f" Error loading models: {e}")
180
  raise e
181
 
182
- def translate_with_detection(text: str, target_lang: str):
 
183
  """Translate text with automatic source language detection"""
184
  start_time = time.time()
185
 
@@ -200,7 +152,7 @@ def translate_with_detection(text: str, target_lang: str):
200
  source_sents_subworded,
201
  batch_type="tokens",
202
  max_batch_size=2048,
203
- beam_size=beam_size,
204
  target_prefix=target_prefix,
205
  )
206
 
@@ -213,7 +165,8 @@ def translate_with_detection(text: str, target_lang: str):
213
 
214
  return source_lang, translated_text, inference_time
215
 
216
- def translate_with_source(text: str, source_lang: str, target_lang: str):
 
217
  """Translate text with provided source language"""
218
  start_time = time.time()
219
 
@@ -230,7 +183,7 @@ def translate_with_source(text: str, source_lang: str, target_lang: str):
230
  source_sents_subworded,
231
  batch_type="tokens",
232
  max_batch_size=2048,
233
- beam_size=beam_size,
234
  target_prefix=target_prefix
235
  )
236
 
@@ -243,72 +196,7 @@ def translate_with_source(text: str, source_lang: str, target_lang: str):
243
 
244
  return translated_text, inference_time
245
 
246
- # --- API Endpoints ---
247
-
248
- @app.get("/")
249
- async def root():
250
- """Health check endpoint"""
251
- return {
252
- "status": "ok",
253
- "message": "Sema Translation API is running",
254
- "version": "2.0.0",
255
- "models_loaded": all([lang_model, sp_model, translator])
256
- }
257
-
258
- @app.post("/translate", response_model=TranslationResponse)
259
- async def translate_endpoint(request: TranslationRequest):
260
- """
261
- Main translation endpoint.
262
- Automatically detects source language if not provided.
263
- """
264
- if not request.text.strip():
265
- raise HTTPException(status_code=400, detail="Input text cannot be empty")
266
-
267
- full_date, current_time = get_nairobi_time()
268
- print(f"\n🔄 Request: {full_date}")
269
- print(f"Target: {request.target_language}, Text: {request.text[:50]}...")
270
 
271
- try:
272
- if request.source_language:
273
- # Use provided source language
274
- translated_text, inference_time = translate_with_source(
275
- request.text,
276
- request.source_language,
277
- request.target_language
278
- )
279
- source_lang = request.source_language
280
- else:
281
- # Auto-detect source language
282
- source_lang, translated_text, inference_time = translate_with_detection(
283
- request.text,
284
- request.target_language
285
- )
286
-
287
- _, response_time = get_nairobi_time()
288
- print(f"✅ Response: {response_time}")
289
- print(f"Source: {source_lang}, Translation: {translated_text[:50]}...\n")
290
-
291
- return TranslationResponse(
292
- translated_text=translated_text,
293
- source_language=source_lang,
294
- target_language=request.target_language,
295
- inference_time=inference_time,
296
- timestamp=full_date
297
- )
298
-
299
- except Exception as e:
300
- print(f"❌ Translation error: {e}")
301
- raise HTTPException(status_code=500, detail=f"Translation failed: {str(e)}")
302
-
303
- # --- Startup Event ---
304
- @app.on_event("startup")
305
- async def startup_event():
306
- """Load models when the application starts"""
307
- print("\n🎵 Starting Sema Translation API...")
308
- print("🎼 Loading the Orchestra... 🦋")
309
- load_models()
310
- print("🎉 API started successfully!\n")
311
-
312
- if __name__ == "__main__":
313
- import uvicorn
314
- uvicorn.run(app, host="0.0.0.0", port=8000)
 
1
  """
2
+ Translation service - handles model loading and translation logic
 
 
3
  """
4
 
5
  import os
6
  import time
7
+ from typing import Tuple, Optional
 
 
 
 
 
 
8
  from huggingface_hub import hf_hub_download
9
  import ctranslate2
10
  import sentencepiece as spm
11
  import fasttext
12
 
13
+ from ..core.config import settings
14
+ from ..core.logging import get_logger
15
+
16
+ logger = get_logger()
17
+
18
+ # Global model instances
19
+ lang_model: Optional[fasttext.FastText._FastText] = None
20
+ sp_model: Optional[spm.SentencePieceProcessor] = None
21
+ translator: Optional[ctranslate2.Translator] = None
22
+
23
+
24
+ def get_model_paths() -> Tuple[str, str, str]:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  """Get model paths from HuggingFace cache (models pre-downloaded in Docker)"""
26
+ logger.info("loading_models_from_cache")
27
 
28
  try:
29
  # Check if we're in offline mode (Docker environment)
30
  offline_mode = os.environ.get("HF_HUB_OFFLINE", "0") == "1"
31
 
32
  if offline_mode:
33
+ logger.info("running_in_offline_mode")
34
  # In offline mode, models are already downloaded and cached
 
 
 
35
  spm_path = hf_hub_download(
36
+ repo_id=settings.model_repo_id,
37
  filename="spm.model",
38
  local_files_only=True
39
  )
40
 
41
  ft_path = hf_hub_download(
42
+ repo_id=settings.model_repo_id,
43
  filename="lid218e.bin",
44
  local_files_only=True
45
  )
46
 
47
  # Get the translation model path
48
  model_bin_path = hf_hub_download(
49
+ repo_id=settings.model_repo_id,
50
+ filename=f"translation_models/{settings.translation_model}/model.bin",
51
  local_files_only=True
52
  )
53
 
 
54
  ct_model_full_path = os.path.dirname(model_bin_path)
55
 
56
  else:
57
+ logger.info("running_in_online_mode")
58
  # Online mode - download models (for local development)
59
  spm_path = hf_hub_download(
60
+ repo_id=settings.model_repo_id,
61
  filename="spm.model"
62
  )
63
 
64
  ft_path = hf_hub_download(
65
+ repo_id=settings.model_repo_id,
66
  filename="lid218e.bin"
67
  )
68
 
69
  # Download all necessary CTranslate2 files
70
  model_bin_path = hf_hub_download(
71
+ repo_id=settings.model_repo_id,
72
+ filename=f"translation_models/{settings.translation_model}/model.bin"
73
  )
74
 
75
  hf_hub_download(
76
+ repo_id=settings.model_repo_id,
77
+ filename=f"translation_models/{settings.translation_model}/config.json"
78
  )
79
 
80
  hf_hub_download(
81
+ repo_id=settings.model_repo_id,
82
+ filename=f"translation_models/{settings.translation_model}/shared_vocabulary.txt"
83
  )
84
 
85
  ct_model_full_path = os.path.dirname(model_bin_path)
86
 
87
+ logger.info(
88
+ "model_paths_resolved",
89
+ spm_path=spm_path,
90
+ ft_path=ft_path,
91
+ ct_model_path=ct_model_full_path
92
+ )
93
 
94
  return spm_path, ft_path, ct_model_full_path
95
 
96
  except Exception as e:
97
+ logger.error("model_path_resolution_failed", error=str(e))
98
  raise e
99
 
100
+
101
  def load_models():
102
  """Load all models into memory"""
103
  global lang_model, sp_model, translator
104
 
105
+ logger.info("starting_model_loading")
106
 
107
+ # Get model paths
108
  spm_path, ft_path, ct_model_path = get_model_paths()
109
 
110
  # Suppress fasttext warnings
 
112
 
113
  try:
114
  # Load language detection model
115
+ logger.info("loading_language_detection_model")
116
  lang_model = fasttext.load_model(ft_path)
117
 
118
  # Load SentencePiece model
119
+ logger.info("loading_sentencepiece_model")
120
  sp_model = spm.SentencePieceProcessor()
121
  sp_model.load(spm_path)
122
 
123
  # Load translation model
124
+ logger.info("loading_translation_model")
125
+ translator = ctranslate2.Translator(ct_model_path, settings.device)
126
 
127
+ logger.info("all_models_loaded_successfully")
128
 
129
  except Exception as e:
130
+ logger.error("model_loading_failed", error=str(e))
131
  raise e
132
 
133
+
134
+ def translate_with_detection(text: str, target_lang: str) -> Tuple[str, str, float]:
135
  """Translate text with automatic source language detection"""
136
  start_time = time.time()
137
 
 
152
  source_sents_subworded,
153
  batch_type="tokens",
154
  max_batch_size=2048,
155
+ beam_size=settings.beam_size,
156
  target_prefix=target_prefix,
157
  )
158
 
 
165
 
166
  return source_lang, translated_text, inference_time
167
 
168
+
169
+ def translate_with_source(text: str, source_lang: str, target_lang: str) -> Tuple[str, float]:
170
  """Translate text with provided source language"""
171
  start_time = time.time()
172
 
 
183
  source_sents_subworded,
184
  batch_type="tokens",
185
  max_batch_size=2048,
186
+ beam_size=settings.beam_size,
187
  target_prefix=target_prefix
188
  )
189
 
 
196
 
197
  return translated_text, inference_time
198
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
199
 
200
+ def models_loaded() -> bool:
201
+ """Check if all models are loaded"""
202
+ return all([lang_model, sp_model, translator])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app/utils/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Utility functions
app/utils/helpers.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Utility helper functions
3
+ """
4
+
5
+ import uuid
6
+ from datetime import datetime
7
+ import pytz
8
+
9
+
10
+ def get_nairobi_time():
11
+ """Get current time in Nairobi timezone"""
12
+ nairobi_timezone = pytz.timezone('Africa/Nairobi')
13
+ current_time_nairobi = datetime.now(nairobi_timezone)
14
+
15
+ curr_day = current_time_nairobi.strftime('%A')
16
+ curr_date = current_time_nairobi.strftime('%Y-%m-%d')
17
+ curr_time = current_time_nairobi.strftime('%H:%M:%S')
18
+
19
+ full_date = f"{curr_day} | {curr_date} | {curr_time}"
20
+ return full_date, curr_time
21
+
22
+
23
+ def generate_request_id() -> str:
24
+ """Generate a unique request ID"""
25
+ return str(uuid.uuid4())
docs/API_CAPABILITIES.md ADDED
@@ -0,0 +1,237 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Sema Translation API - Complete Capabilities
2
+
3
+ ## 🌍 **What Our API Can Do**
4
+
5
+ Your Sema Translation API is now a comprehensive, enterprise-grade translation service with extensive language support and developer-friendly features.
6
+
7
+ ## 🚀 **Core Translation Features**
8
+
9
+ ### **1. Text Translation**
10
+ - **200+ Languages**: Full FLORES-200 language support
11
+ - **Automatic Language Detection**: Smart source language detection
12
+ - **High-Quality Translation**: CTranslate2 optimized neural translation
13
+ - **Bidirectional Translation**: Translate between any supported language pair
14
+ - **Character Limit**: Up to 5000 characters per request
15
+ - **Performance**: ~0.2-0.5 seconds inference time
16
+
17
+ ### **2. Language Detection**
18
+ - **Automatic Detection**: Identifies source language when not specified
19
+ - **High Accuracy**: FastText-based language identification
20
+ - **200+ Language Support**: Detects all supported languages
21
+ - **Confidence Scoring**: Internal confidence metrics
22
+
23
+ ## 🗣️ **Language Support System**
24
+
25
+ ### **Complete Language Information**
26
+ Your API now knows everything about its supported languages:
27
+
28
+ #### **Language Metadata**
29
+ - **English Names**: "Swahili", "French", "Chinese"
30
+ - **Native Names**: "Kiswahili", "Français", "中文"
31
+ - **Geographic Regions**: Africa, Europe, Asia, Middle East, Americas
32
+ - **Writing Scripts**: Latin, Arabic, Cyrillic, Han, Devanagari, etc.
33
+ - **Language Codes**: FLORES-200 standard codes
34
+
35
+ #### **Regional Coverage**
36
+ - **African Languages** (25+): Swahili, Hausa, Yoruba, Kikuyu, Zulu, Xhosa, Amharic, Somali
37
+ - **European Languages** (40+): English, French, German, Spanish, Italian, Russian, Polish
38
+ - **Asian Languages** (80+): Chinese, Japanese, Korean, Hindi, Bengali, Thai, Vietnamese
39
+ - **Middle Eastern** (15+): Arabic, Hebrew, Persian, Turkish
40
+ - **Americas** (30+): Spanish, Portuguese, English, French, Indigenous languages
41
+
42
+ ## 📡 **API Endpoints**
43
+
44
+ ### **Translation Endpoints**
45
+ ```
46
+ POST /translate # Main translation endpoint
47
+ POST /api/v1/translate # Versioned endpoint
48
+ ```
49
+
50
+ ### **Language Information Endpoints**
51
+ ```
52
+ GET /languages # All supported languages
53
+ GET /languages/popular # Most commonly used languages
54
+ GET /languages/african # African languages specifically
55
+ GET /languages/region/{region} # Languages by geographic region
56
+ GET /languages/search?q={query} # Search languages by name/code
57
+ GET /languages/stats # Language statistics and coverage
58
+ GET /languages/{code} # Specific language information
59
+ ```
60
+
61
+ ### **Monitoring & Health**
62
+ ```
63
+ GET / # Basic health check
64
+ GET /health # Detailed health monitoring
65
+ GET /metrics # Prometheus metrics
66
+ GET /docs # Interactive API documentation
67
+ GET /redoc # Alternative documentation
68
+ ```
69
+
70
+ ## 🎯 **Developer Experience Features**
71
+
72
+ ### **1. Language Discovery**
73
+ - **Complete Language List**: Get all 200+ supported languages
74
+ - **Popular Languages**: Quick access to commonly used languages
75
+ - **Regional Filtering**: Filter by geographic region
76
+ - **Search Functionality**: Find languages by name, native name, or code
77
+ - **Language Validation**: Check if a language code is supported
78
+
79
+ ### **2. Frontend Integration Ready**
80
+ ```javascript
81
+ // Get all languages for dropdown
82
+ const languages = await fetch('/languages').then(r => r.json());
83
+
84
+ // Get popular languages for quick selection
85
+ const popular = await fetch('/languages/popular').then(r => r.json());
86
+
87
+ // Search languages for autocomplete
88
+ const results = await fetch('/languages/search?q=Swah').then(r => r.json());
89
+
90
+ // Validate language code
91
+ const langInfo = await fetch('/languages/swh_Latn').then(r => r.json());
92
+ ```
93
+
94
+ ### **3. Rich Metadata**
95
+ Each language includes:
96
+ ```json
97
+ {
98
+ "swh_Latn": {
99
+ "name": "Swahili",
100
+ "native_name": "Kiswahili",
101
+ "region": "Africa",
102
+ "script": "Latin"
103
+ }
104
+ }
105
+ ```
106
+
107
+ ## 📊 **Analytics & Monitoring**
108
+
109
+ ### **Usage Tracking**
110
+ - **Request Counting**: Total API requests by endpoint
111
+ - **Translation Metrics**: Translations by language pair
112
+ - **Character Counting**: Total characters translated
113
+ - **Performance Metrics**: Request duration and inference time
114
+ - **Error Tracking**: Error rates by type
115
+
116
+ ### **Language Statistics**
117
+ - **Coverage Stats**: Languages by region and script
118
+ - **Usage Patterns**: Most translated language pairs
119
+ - **Performance Data**: Translation speed by language
120
+ - **Regional Analytics**: Usage by geographic region
121
+
122
+ ## 🔒 **Enterprise Features**
123
+
124
+ ### **Rate Limiting**
125
+ - **60 requests/minute** per IP address
126
+ - **5000 characters** maximum per request
127
+ - **Graceful degradation** with clear error messages
128
+
129
+ ### **Request Tracking**
130
+ - **Unique Request IDs**: For debugging and support
131
+ - **Structured Logging**: JSON logs for analysis
132
+ - **Request/Response Logging**: Complete audit trail
133
+ - **Performance Monitoring**: Response time tracking
134
+
135
+ ### **Error Handling**
136
+ - **Comprehensive Validation**: Input validation with clear messages
137
+ - **HTTP Status Codes**: Standard REST API responses
138
+ - **Error Details**: Specific error information
139
+ - **Graceful Failures**: Service continues despite individual failures
140
+
141
+ ## 🎨 **Frontend Integration Examples**
142
+
143
+ ### **Language Selector Component**
144
+ ```javascript
145
+ // React component example
146
+ function LanguageSelector({ onSelect }) {
147
+ const [languages, setLanguages] = useState([]);
148
+ const [popular, setPopular] = useState([]);
149
+
150
+ useEffect(() => {
151
+ // Load popular languages first
152
+ fetch('/languages/popular')
153
+ .then(r => r.json())
154
+ .then(data => setPopular(Object.entries(data.languages)));
155
+
156
+ // Load all languages for search
157
+ fetch('/languages')
158
+ .then(r => r.json())
159
+ .then(data => setLanguages(Object.entries(data.languages)));
160
+ }, []);
161
+
162
+ return (
163
+ <select onChange={e => onSelect(e.target.value)}>
164
+ <optgroup label="Popular Languages">
165
+ {popular.map(([code, info]) => (
166
+ <option key={code} value={code}>
167
+ {info.name} ({info.native_name})
168
+ </option>
169
+ ))}
170
+ </optgroup>
171
+ <optgroup label="All Languages">
172
+ {languages.map(([code, info]) => (
173
+ <option key={code} value={code}>
174
+ {info.name} - {info.region}
175
+ </option>
176
+ ))}
177
+ </optgroup>
178
+ </select>
179
+ );
180
+ }
181
+ ```
182
+
183
+ ### **Translation Interface**
184
+ ```javascript
185
+ // Translation function with language validation
186
+ async function translateText(text, targetLang, sourceLang = null) {
187
+ // Validate target language
188
+ const langInfo = await fetch(`/languages/${targetLang}`);
189
+ if (!langInfo.ok) {
190
+ throw new Error(`Unsupported language: ${targetLang}`);
191
+ }
192
+
193
+ // Perform translation
194
+ const response = await fetch('/translate', {
195
+ method: 'POST',
196
+ headers: { 'Content-Type': 'application/json' },
197
+ body: JSON.stringify({
198
+ text,
199
+ target_language: targetLang,
200
+ source_language: sourceLang
201
+ })
202
+ });
203
+
204
+ return response.json();
205
+ }
206
+ ```
207
+
208
+ ## 🎯 **Perfect For**
209
+
210
+ ### **Web Applications**
211
+ - **Language Selectors**: Rich dropdowns with native names
212
+ - **Translation Interfaces**: Real-time translation with validation
213
+ - **Multi-language Support**: Dynamic language switching
214
+ - **Search & Autocomplete**: Find languages quickly
215
+
216
+ ### **Mobile Applications**
217
+ - **Offline Language Lists**: Cache language data locally
218
+ - **Quick Selection**: Popular languages for faster UX
219
+ - **Regional Filtering**: Show relevant languages by location
220
+ - **Voice Input**: Validate detected languages
221
+
222
+ ### **Business Intelligence**
223
+ - **Usage Analytics**: Track translation patterns
224
+ - **Language Coverage**: Monitor supported languages
225
+ - **Performance Metrics**: API response times and success rates
226
+ - **Regional Insights**: Usage by geographic region
227
+
228
+ ## 🚀 **Ready for Production**
229
+
230
+ Your API now provides:
231
+ - ✅ **Complete Language Awareness**: Knows all its capabilities
232
+ - ✅ **Developer-Friendly**: Easy integration with comprehensive docs
233
+ - ✅ **Frontend-Ready**: Perfect for building user interfaces
234
+ - ✅ **Enterprise-Grade**: Monitoring, logging, and analytics
235
+ - ✅ **Scalable**: Clean architecture for future enhancements
236
+
237
+ The API is now a complete translation platform that developers will love to work with! 🎉
docs/ARCHITECTURE.md ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Sema Translation API - Architecture Overview
2
+
3
+ ## 🏗️ Project Structure
4
+
5
+ This FastAPI application follows industry best practices for maintainable, scalable APIs:
6
+
7
+ ### Directory Structure
8
+ ```
9
+ app/
10
+ ├── main.py # Application entry point & FastAPI instance
11
+ ├── api/v1/endpoints.py # API route handlers (versioned)
12
+ ├── core/ # Core configuration & setup
13
+ │ ├── config.py # Settings management
14
+ │ ├── logging.py # Structured logging setup
15
+ │ └── metrics.py # Prometheus metrics definitions
16
+ ├── middleware/ # Custom middleware
17
+ │ └── request_middleware.py # Request tracking & metrics
18
+ ├── models/schemas.py # Pydantic data models
19
+ ├── services/translation.py # Business logic & model management
20
+ └── utils/helpers.py # Utility functions
21
+ ```
22
+
23
+ ## 🔧 Design Principles
24
+
25
+ ### 1. Separation of Concerns
26
+ - **API Layer**: Route definitions and request/response handling
27
+ - **Service Layer**: Business logic and model operations
28
+ - **Core Layer**: Configuration, logging, and metrics
29
+ - **Models Layer**: Data validation and serialization
30
+
31
+ ### 2. Dependency Injection
32
+ - Settings injected via Pydantic Settings
33
+ - Services accessed through proper imports
34
+ - Middleware applied declaratively
35
+
36
+ ### 3. Configuration Management
37
+ - Environment-based configuration
38
+ - Type-safe settings with Pydantic
39
+ - Centralized configuration in `core/config.py`
40
+
41
+ ### 4. Observability
42
+ - Structured JSON logging with structlog
43
+ - Prometheus metrics for monitoring
44
+ - Request tracking with unique IDs
45
+ - Health check endpoints
46
+
47
+ ## 🚀 Key Features
48
+
49
+ ### Enterprise-Grade Features
50
+ - **Rate Limiting**: IP-based rate limiting with SlowAPI
51
+ - **Request Tracking**: Unique request IDs for debugging
52
+ - **Metrics Collection**: Prometheus metrics for monitoring
53
+ - **Structured Logging**: JSON logs for easy parsing
54
+ - **Health Checks**: Comprehensive health monitoring
55
+
56
+ ### API Design
57
+ - **Versioned Routes**: `/api/v1/` for future compatibility
58
+ - **OpenAPI Documentation**: Auto-generated Swagger UI
59
+ - **Type Safety**: Full Pydantic validation
60
+ - **Error Handling**: Graceful error responses
61
+
62
+ ### Performance
63
+ - **Async/Await**: Full asynchronous request handling
64
+ - **Model Caching**: Models loaded once at startup
65
+ - **Efficient Translation**: CTranslate2 optimization
66
+
67
+ ## 🔒 Security (Testing Phase)
68
+
69
+ ### Current State
70
+ - Authentication **removed** for testing phase
71
+ - Rate limiting active (60 req/min per IP)
72
+ - Input validation with Pydantic
73
+ - CORS configured for development
74
+
75
+ ### Future Integration Points
76
+ - Supabase authentication ready
77
+ - User tracking infrastructure in place
78
+ - Usage analytics for billing prepared
79
+
80
+ ## 📊 Monitoring & Observability
81
+
82
+ ### Metrics Available
83
+ - Request count by endpoint and status
84
+ - Request duration histograms
85
+ - Translation count by language pair
86
+ - Character count tracking
87
+ - Error count by type
88
+
89
+ ### Logging
90
+ - Structured JSON logs
91
+ - Request/response tracking
92
+ - Translation event logging
93
+ - Error logging with context
94
+
95
+ ## 🔄 Development Workflow
96
+
97
+ ### Local Development
98
+ ```bash
99
+ cd backend/sema-api
100
+ pip install -r requirements.txt
101
+ uvicorn app.main:app --reload
102
+ ```
103
+
104
+ ### Docker Development
105
+ ```bash
106
+ docker build -t sema-api .
107
+ docker run -p 8000:8000 sema-api
108
+ ```
109
+
110
+ ### Testing
111
+ - Health check: `GET /health`
112
+ - Documentation: `GET /docs`
113
+ - Metrics: `GET /metrics`
114
+ - Translation: `POST /translate`
115
+
116
+ ## 🎯 Future Enhancements
117
+
118
+ ### Authentication Integration
119
+ - Supabase JWT validation
120
+ - User-based rate limiting
121
+ - API key authentication
122
+
123
+ ### Scaling Considerations
124
+ - Database integration for usage tracking
125
+ - Redis caching for performance
126
+ - Load balancer compatibility
127
+ - Horizontal scaling support
128
+
129
+ ### Monitoring Enhancements
130
+ - Grafana dashboards
131
+ - Alerting rules
132
+ - Performance profiling
133
+ - Usage analytics
134
+
135
+ ## 📝 Maintenance
136
+
137
+ ### Code Organization Benefits
138
+ - **Testability**: Each component can be tested independently
139
+ - **Maintainability**: Clear separation of concerns
140
+ - **Scalability**: Easy to add new features and endpoints
141
+ - **Debugging**: Structured logging and request tracking
142
+ - **Documentation**: Self-documenting code structure
143
+
144
+ ### Adding New Features
145
+ 1. **New Endpoints**: Add to `api/v1/endpoints.py`
146
+ 2. **Business Logic**: Add to appropriate service in `services/`
147
+ 3. **Data Models**: Add to `models/schemas.py`
148
+ 4. **Configuration**: Add to `core/config.py`
149
+ 5. **Middleware**: Add to `middleware/`
150
+
151
+ This architecture provides a solid foundation for a production-ready translation API that can scale and evolve with your needs.
docs/PROJECT_OVERVIEW.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Sema Translation API - Project Overview
2
+
3
+ ## 🎯 Project Summary
4
+
5
+ Enterprise-grade translation API supporting 200+ languages with automatic language detection, built with FastAPI and powered by the consolidated `sematech/sema-utils` model repository.
6
+
7
+ ## 📁 Project Structure
8
+
9
+ ```
10
+ backend/sema-api/
11
+ ├── app/ # Main application package
12
+ │ ├── main.py # Application entry point & FastAPI instance
13
+ │ ├── api/v1/endpoints.py # API route handlers (versioned)
14
+ │ ├── core/ # Core configuration & setup
15
+ │ │ ├── config.py # Settings management
16
+ │ │ ├── logging.py # Structured logging setup
17
+ │ │ └── metrics.py # Prometheus metrics definitions
18
+ │ ├── middleware/ # Custom middleware
19
+ │ │ └── request_middleware.py # Request tracking & metrics
20
+ │ ├── models/schemas.py # Pydantic data models
21
+ │ ├── services/translation.py # Business logic & model management
22
+ │ └── utils/helpers.py # Utility functions
23
+ ├── tests/ # Test suite
24
+ │ ├── test_model_download.py # Model download & loading tests
25
+ │ ├── test_api_client.py # API endpoint tests
26
+ │ └── README.md # Test documentation
27
+ ├── Dockerfile # Multi-stage Docker build
28
+ ├── requirements.txt # Python dependencies
29
+ ├── README.md # API documentation
30
+ ├── ARCHITECTURE.md # Technical architecture
31
+ └── PROJECT_OVERVIEW.md # This file
32
+ ```
33
+
34
+ ## 🚀 Key Features
35
+
36
+ ### Core Translation
37
+ - **200+ Language Support**: Full FLORES-200 language codes
38
+ - **Automatic Language Detection**: Optional source language detection
39
+ - **High Performance**: CTranslate2 optimized inference
40
+ - **Character Tracking**: Usage monitoring for billing/analytics
41
+
42
+ ### Enterprise Features
43
+ - **Rate Limiting**: 60 requests/minute per IP
44
+ - **Request Tracking**: Unique request IDs for debugging
45
+ - **Structured Logging**: JSON logs for easy parsing
46
+ - **Prometheus Metrics**: Comprehensive monitoring
47
+ - **Health Checks**: System status monitoring
48
+
49
+ ### API Quality
50
+ - **Comprehensive Swagger UI**: Interactive documentation
51
+ - **Type Safety**: Full Pydantic validation
52
+ - **Versioned Endpoints**: `/api/v1/` for future compatibility
53
+ - **Error Handling**: Graceful error responses
54
+ - **CORS Support**: Cross-origin resource sharing
55
+
56
+ ## 🔧 Technical Stack
57
+
58
+ ### Core Technologies
59
+ - **FastAPI**: Modern Python web framework
60
+ - **CTranslate2**: Optimized neural machine translation
61
+ - **SentencePiece**: Subword tokenization
62
+ - **FastText**: Language detection
63
+ - **HuggingFace Hub**: Model repository integration
64
+
65
+ ### Monitoring & Observability
66
+ - **Prometheus**: Metrics collection
67
+ - **Structlog**: Structured JSON logging
68
+ - **SlowAPI**: Rate limiting
69
+ - **Uvicorn**: ASGI server
70
+
71
+ ### Development & Deployment
72
+ - **Docker**: Multi-stage containerization
73
+ - **Pydantic**: Data validation and settings
74
+ - **Pytest**: Testing framework (ready)
75
+ - **HuggingFace Spaces**: Cloud deployment
76
+
77
+ ## 📊 API Endpoints
78
+
79
+ ### Health & Monitoring
80
+ | Endpoint | Method | Description |
81
+ |----------|--------|-------------|
82
+ | `/` | GET | Basic health check |
83
+ | `/health` | GET | Detailed health monitoring |
84
+ | `/metrics` | GET | Prometheus metrics |
85
+ | `/docs` | GET | Swagger UI documentation |
86
+ | `/redoc` | GET | ReDoc documentation |
87
+
88
+ ### Translation
89
+ | Endpoint | Method | Description |
90
+ |----------|--------|-------------|
91
+ | `/translate` | POST | Main translation endpoint |
92
+ | `/api/v1/translate` | POST | Versioned translation endpoint |
93
+
94
+ ## 🔒 Security & Reliability
95
+
96
+ ### Current Implementation
97
+ - **Input Validation**: Comprehensive Pydantic validation
98
+ - **Rate Limiting**: IP-based request limiting
99
+ - **Error Handling**: Graceful error responses
100
+ - **Request Tracking**: Unique IDs for debugging
101
+
102
+ ### Future-Ready Features
103
+ - **Authentication Framework**: Ready for Supabase integration
104
+ - **Usage Analytics**: Character count and request tracking
105
+ - **Audit Logging**: Request/response logging
106
+
107
+ ## 📈 Performance & Scalability
108
+
109
+ ### Optimization Features
110
+ - **Async/Await**: Full asynchronous processing
111
+ - **Model Caching**: Models loaded once at startup
112
+ - **Efficient Translation**: CTranslate2 optimization
113
+ - **Connection Pooling**: Ready for database integration
114
+
115
+ ### Monitoring Metrics
116
+ - Request count by endpoint and status
117
+ - Request duration histograms
118
+ - Translation count by language pair
119
+ - Character count tracking
120
+ - Error count by type
121
+
122
+ ## 🧪 Testing
123
+
124
+ ### Test Coverage
125
+ - **Model Tests**: Download, loading, and translation pipeline
126
+ - **API Tests**: All endpoints, error handling, performance
127
+ - **Integration Tests**: End-to-end workflow validation
128
+
129
+ ### Test Commands
130
+ ```bash
131
+ # Model download and loading tests
132
+ cd tests && python test_model_download.py
133
+
134
+ # API endpoint tests (local)
135
+ cd tests && python test_api_client.py
136
+
137
+ # API endpoint tests (production)
138
+ cd tests && python test_api_client.py https://sematech-sema-api.hf.space
139
+ ```
140
+
141
+ ## 🚀 Deployment
142
+
143
+ ### Local Development
144
+ ```bash
145
+ cd backend/sema-api
146
+ pip install -r requirements.txt
147
+ uvicorn app.main:app --reload
148
+ ```
149
+
150
+ ### Docker Development
151
+ ```bash
152
+ docker build -t sema-api .
153
+ docker run -p 8000:8000 sema-api
154
+ ```
155
+
156
+ ### HuggingFace Spaces
157
+ - Automatic deployment from git push
158
+ - Multi-stage Docker build for optimization
159
+ - Model pre-downloading for faster startup
160
+
161
+ ## 🔮 Future Enhancements
162
+
163
+ ### Planned Features
164
+ - **Supabase Authentication**: User management and API keys
165
+ - **Database Integration**: Usage tracking and analytics
166
+ - **Redis Caching**: Performance optimization
167
+ - **Advanced Monitoring**: Grafana dashboards and alerting
168
+
169
+ ### Scaling Considerations
170
+ - **Load Balancing**: Stateless design for horizontal scaling
171
+ - **Database Sharding**: For high-volume usage tracking
172
+ - **CDN Integration**: For global performance
173
+ - **Auto-scaling**: Based on request volume
174
+
175
+ ## 📝 Development Guidelines
176
+
177
+ ### Code Organization
178
+ - **Separation of Concerns**: Clear module boundaries
179
+ - **Type Safety**: Full type hints and Pydantic validation
180
+ - **Error Handling**: Comprehensive exception management
181
+ - **Documentation**: Inline docs and comprehensive README
182
+
183
+ ### Adding New Features
184
+ 1. **Endpoints**: Add to `app/api/v1/endpoints.py`
185
+ 2. **Business Logic**: Add to appropriate service in `app/services/`
186
+ 3. **Data Models**: Add to `app/models/schemas.py`
187
+ 4. **Configuration**: Add to `app/core/config.py`
188
+ 5. **Tests**: Add to `tests/` directory
189
+
190
+ ## 📞 Support & Maintenance
191
+
192
+ ### Documentation
193
+ - **API Docs**: Available at `/docs` endpoint
194
+ - **Architecture**: See `ARCHITECTURE.md`
195
+ - **Tests**: See `tests/README.md`
196
+
197
+ ### Monitoring
198
+ - **Health**: Monitor `/health` endpoint
199
+ - **Metrics**: Scrape `/metrics` for Prometheus
200
+ - **Logs**: Structured JSON logs for analysis
201
+
202
+ This project provides a solid foundation for a production-ready translation API that can scale and evolve with your needs.
deploy_to_hf.md → docs/deploy_to_hf.md RENAMED
File without changes
requirements.txt CHANGED
@@ -1,8 +1,23 @@
1
- fastapi
2
- uvicorn[standard]
3
- ctranslate2
4
- sentencepiece
5
- fasttext-wheel
6
- huggingface_hub
7
- pydantic
8
- pytz
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core FastAPI and server
2
+ fastapi>=0.104.0
3
+ uvicorn[standard]>=0.24.0
4
+ pydantic>=2.0.0
5
+ pydantic-settings>=2.0.0
6
+
7
+ # Translation models and processing
8
+ ctranslate2>=4.0.0
9
+ sentencepiece>=0.1.99
10
+ fasttext-wheel>=0.9.2
11
+ huggingface_hub>=0.17.0
12
+
13
+ # Security and rate limiting
14
+ slowapi>=0.1.9
15
+ python-jose[cryptography]>=3.3.0
16
+ passlib[bcrypt]>=1.7.4
17
+
18
+ # Monitoring and logging
19
+ prometheus-client>=0.17.0
20
+ structlog>=23.0.0
21
+
22
+ # Utilities
23
+ pytz>=2023.3
tests/README.md ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Sema Translation API - Tests
2
+
3
+ This directory contains test scripts for the Sema Translation API.
4
+
5
+ ## Test Files
6
+
7
+ ### `test_model_download.py`
8
+ Tests the model downloading and loading functionality:
9
+ - Downloads models from `sematech/sema-utils` repository
10
+ - Tests model loading (SentencePiece, FastText, CTranslate2)
11
+ - Validates complete translation pipeline
12
+ - Includes cleanup functionality
13
+
14
+ **Usage:**
15
+ ```bash
16
+ cd tests
17
+ python test_model_download.py
18
+ ```
19
+
20
+ ### `test_api_client.py`
21
+ Tests the API endpoints and functionality:
22
+ - Health check endpoints
23
+ - Translation with auto-detection
24
+ - Translation with specified source language
25
+ - Error handling validation
26
+ - Performance testing with multiple requests
27
+ - Documentation endpoint testing
28
+
29
+ **Usage:**
30
+ ```bash
31
+ # Test local development server
32
+ cd tests
33
+ python test_api_client.py
34
+
35
+ # Test production server
36
+ python test_api_client.py https://sematech-sema-api.hf.space
37
+ ```
38
+
39
+ ## Running Tests
40
+
41
+ ### Prerequisites
42
+ ```bash
43
+ pip install requests huggingface_hub ctranslate2 sentencepiece fasttext-wheel
44
+ ```
45
+
46
+ ### Local Testing
47
+ 1. Start the API server:
48
+ ```bash
49
+ cd backend/sema-api
50
+ uvicorn app.main:app --reload
51
+ ```
52
+
53
+ 2. Run API tests:
54
+ ```bash
55
+ cd tests
56
+ python test_api_client.py
57
+ ```
58
+
59
+ ### Production Testing
60
+ ```bash
61
+ cd tests
62
+ python test_api_client.py https://sematech-sema-api.hf.space
63
+ ```
64
+
65
+ ## Test Coverage
66
+
67
+ ### Model Tests
68
+ - ✅ Model downloading from HuggingFace Hub
69
+ - ✅ SentencePiece model loading
70
+ - ✅ FastText language detection model loading
71
+ - ✅ CTranslate2 translation model loading
72
+ - ✅ End-to-end translation pipeline
73
+
74
+ ### API Tests
75
+ - ✅ Health check endpoints (`/` and `/health`)
76
+ - ✅ Translation endpoint (`/translate`)
77
+ - ✅ Auto language detection
78
+ - ✅ Manual source language specification
79
+ - ✅ Error handling (empty text, invalid requests)
80
+ - ✅ Rate limiting behavior
81
+ - ✅ Documentation endpoints (`/docs`, `/openapi.json`)
82
+ - ✅ Metrics endpoint (`/metrics`)
83
+
84
+ ### Performance Tests
85
+ - ✅ Multiple concurrent requests
86
+ - ✅ Response time measurement
87
+ - ✅ Character count validation
88
+ - ✅ Request tracking with unique IDs
89
+
90
+ ## Expected Results
91
+
92
+ ### Model Download Test
93
+ ```
94
+ 🚀 Starting Sema Utils Model Test
95
+
96
+ 🧪 Testing model download from sematech/sema-utils...
97
+ ✅ SentencePiece model downloaded
98
+ ✅ Language detection model downloaded
99
+ ✅ Translation model downloaded
100
+ ✅ All models loaded successfully
101
+ 🎉 Translation successful!
102
+ ```
103
+
104
+ ### API Client Test
105
+ ```
106
+ 🧪 Testing Sema Translation API
107
+
108
+ ✅ Health check passed
109
+ ✅ Auto-detection translation successful
110
+ ✅ Specified source translation successful
111
+ ✅ Empty text error handling works correctly
112
+ ✅ Performance test completed
113
+ ✅ OpenAPI docs accessible
114
+ 🎉 All API tests passed!
115
+ ```
116
+
117
+ ## Troubleshooting
118
+
119
+ ### Common Issues
120
+
121
+ **Model Download Fails:**
122
+ - Check internet connection
123
+ - Verify HuggingFace Hub access
124
+ - Ensure sufficient disk space
125
+
126
+ **API Tests Fail:**
127
+ - Verify API server is running
128
+ - Check correct URL/port
129
+ - Ensure all dependencies installed
130
+
131
+ **Permission Errors:**
132
+ - Check file permissions in test directory
133
+ - Ensure write access for model downloads
134
+
135
+ ### Debug Mode
136
+ Add debug prints to test scripts for detailed troubleshooting:
137
+ ```python
138
+ import logging
139
+ logging.basicConfig(level=logging.DEBUG)
140
+ ```
tests/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Test package
test_api_client.py → tests/test_api_client.py RENAMED
File without changes
tests/test_language_endpoints.py ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Test script for language information endpoints
3
+ """
4
+
5
+ import requests
6
+ import json
7
+
8
+ def test_language_endpoints(base_url="http://localhost:8000"):
9
+ """Test all language-related endpoints"""
10
+
11
+ print("🌍 Testing Language Information Endpoints\n")
12
+
13
+ # Test 1: Get all languages
14
+ print("1️⃣ Testing /languages endpoint...")
15
+ try:
16
+ response = requests.get(f"{base_url}/languages")
17
+ if response.status_code == 200:
18
+ data = response.json()
19
+ print(f"✅ All languages: {data['total_count']} languages found")
20
+ print(f" Sample: {list(data['languages'].keys())[:5]}")
21
+ else:
22
+ print(f"❌ Failed: {response.status_code}")
23
+ return False
24
+ except Exception as e:
25
+ print(f"❌ Error: {e}")
26
+ return False
27
+
28
+ # Test 2: Get popular languages
29
+ print("\n2️⃣ Testing /languages/popular endpoint...")
30
+ try:
31
+ response = requests.get(f"{base_url}/languages/popular")
32
+ if response.status_code == 200:
33
+ data = response.json()
34
+ print(f"✅ Popular languages: {data['total_count']} languages")
35
+ for code, info in list(data['languages'].items())[:3]:
36
+ print(f" {code}: {info['name']} ({info['native_name']})")
37
+ else:
38
+ print(f"❌ Failed: {response.status_code}")
39
+ return False
40
+ except Exception as e:
41
+ print(f"❌ Error: {e}")
42
+ return False
43
+
44
+ # Test 3: Get African languages
45
+ print("\n3️⃣ Testing /languages/african endpoint...")
46
+ try:
47
+ response = requests.get(f"{base_url}/languages/african")
48
+ if response.status_code == 200:
49
+ data = response.json()
50
+ print(f"✅ African languages: {data['total_count']} languages")
51
+ for code, info in list(data['languages'].items())[:3]:
52
+ print(f" {code}: {info['name']} ({info['native_name']})")
53
+ else:
54
+ print(f"❌ Failed: {response.status_code}")
55
+ return False
56
+ except Exception as e:
57
+ print(f"❌ Error: {e}")
58
+ return False
59
+
60
+ # Test 4: Get languages by region
61
+ print("\n4️⃣ Testing /languages/region/Europe endpoint...")
62
+ try:
63
+ response = requests.get(f"{base_url}/languages/region/Europe")
64
+ if response.status_code == 200:
65
+ data = response.json()
66
+ print(f"✅ European languages: {data['total_count']} languages")
67
+ for code, info in list(data['languages'].items())[:3]:
68
+ print(f" {code}: {info['name']} ({info['native_name']})")
69
+ else:
70
+ print(f"❌ Failed: {response.status_code}")
71
+ return False
72
+ except Exception as e:
73
+ print(f"❌ Error: {e}")
74
+ return False
75
+
76
+ # Test 5: Search languages
77
+ print("\n5️⃣ Testing /languages/search?q=Swahili endpoint...")
78
+ try:
79
+ response = requests.get(f"{base_url}/languages/search?q=Swahili")
80
+ if response.status_code == 200:
81
+ data = response.json()
82
+ print(f"✅ Search results: {data['total_count']} languages found")
83
+ for code, info in data['languages'].items():
84
+ print(f" {code}: {info['name']} ({info['native_name']})")
85
+ else:
86
+ print(f"❌ Failed: {response.status_code}")
87
+ return False
88
+ except Exception as e:
89
+ print(f"❌ Error: {e}")
90
+ return False
91
+
92
+ # Test 6: Get language statistics
93
+ print("\n6️⃣ Testing /languages/stats endpoint...")
94
+ try:
95
+ response = requests.get(f"{base_url}/languages/stats")
96
+ if response.status_code == 200:
97
+ data = response.json()
98
+ print(f"✅ Language statistics:")
99
+ print(f" Total languages: {data['total_languages']}")
100
+ print(f" Regions: {data['regions']}")
101
+ print(f" Scripts: {data['scripts']}")
102
+ print(f" By region: {data['by_region']}")
103
+ else:
104
+ print(f"❌ Failed: {response.status_code}")
105
+ return False
106
+ except Exception as e:
107
+ print(f"❌ Error: {e}")
108
+ return False
109
+
110
+ # Test 7: Get specific language info
111
+ print("\n7️⃣ Testing /languages/swh_Latn endpoint...")
112
+ try:
113
+ response = requests.get(f"{base_url}/languages/swh_Latn")
114
+ if response.status_code == 200:
115
+ data = response.json()
116
+ print(f"✅ Swahili info:")
117
+ print(f" Name: {data['name']}")
118
+ print(f" Native: {data['native_name']}")
119
+ print(f" Region: {data['region']}")
120
+ print(f" Script: {data['script']}")
121
+ else:
122
+ print(f"❌ Failed: {response.status_code}")
123
+ return False
124
+ except Exception as e:
125
+ print(f"❌ Error: {e}")
126
+ return False
127
+
128
+ # Test 8: Test invalid language code
129
+ print("\n8️⃣ Testing invalid language code...")
130
+ try:
131
+ response = requests.get(f"{base_url}/languages/invalid_code")
132
+ if response.status_code == 404:
133
+ print("✅ Invalid language code properly rejected")
134
+ else:
135
+ print(f"❌ Expected 404, got: {response.status_code}")
136
+ return False
137
+ except Exception as e:
138
+ print(f"❌ Error: {e}")
139
+ return False
140
+
141
+ return True
142
+
143
+ def test_frontend_integration_example(base_url="http://localhost:8000"):
144
+ """Test a realistic frontend integration scenario"""
145
+
146
+ print("\n🎨 Testing Frontend Integration Scenario\n")
147
+
148
+ # Scenario: Building a language selector
149
+ print("📋 Scenario: Building a language selector for a translation app")
150
+
151
+ # Step 1: Get popular languages for quick selection
152
+ print("\n1️⃣ Getting popular languages for quick selection...")
153
+ popular_response = requests.get(f"{base_url}/languages/popular")
154
+ popular_langs = popular_response.json()['languages']
155
+ print(f" Found {len(popular_langs)} popular languages")
156
+
157
+ # Step 2: Get all languages for comprehensive search
158
+ print("\n2️⃣ Getting all languages for search functionality...")
159
+ all_response = requests.get(f"{base_url}/languages")
160
+ all_langs = all_response.json()['languages']
161
+ print(f" Found {len(all_langs)} total languages")
162
+
163
+ # Step 3: Validate a user's language selection
164
+ print("\n3️⃣ Validating user's language selection (swh_Latn)...")
165
+ validation_response = requests.get(f"{base_url}/languages/swh_Latn")
166
+ if validation_response.status_code == 200:
167
+ lang_info = validation_response.json()
168
+ print(f" ✅ Valid: {lang_info['name']} ({lang_info['native_name']})")
169
+
170
+ # Step 4: Perform translation with validated languages
171
+ print("\n4️⃣ Performing translation with validated languages...")
172
+ translation_data = {
173
+ "text": "Habari ya asubuhi",
174
+ "target_language": "eng_Latn"
175
+ }
176
+
177
+ translation_response = requests.post(
178
+ f"{base_url}/translate",
179
+ headers={"Content-Type": "application/json"},
180
+ data=json.dumps(translation_data)
181
+ )
182
+
183
+ if translation_response.status_code == 200:
184
+ result = translation_response.json()
185
+ print(f" ✅ Translation: '{translation_data['text']}' → '{result['translated_text']}'")
186
+ print(f" 🔍 Detected source: {result['source_language']}")
187
+
188
+ print("\n🎉 Frontend integration scenario completed successfully!")
189
+
190
+ if __name__ == "__main__":
191
+ import sys
192
+
193
+ # Allow custom base URL
194
+ base_url = "http://localhost:8000"
195
+ if len(sys.argv) > 1:
196
+ base_url = sys.argv[1]
197
+
198
+ print(f"🎯 Testing Language Endpoints at: {base_url}")
199
+ print("⚠️ Make sure the API server is running!\n")
200
+
201
+ # Run language endpoint tests
202
+ success = test_language_endpoints(base_url)
203
+
204
+ if success:
205
+ # Run frontend integration test
206
+ test_frontend_integration_example(base_url)
207
+ print("\n🎉 All language endpoint tests passed!")
208
+ else:
209
+ print("\n❌ Some language endpoint tests failed!")
210
+ sys.exit(1)
test_model_download.py → tests/test_model_download.py RENAMED
File without changes