Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
cesear64 
posted an update 13 days ago
Post
4115
Just published: how we built production Sango (Central African Republic) translation without fine-tuning, parallel corpus, or training compute.

The method — vocabulary-augmented prompting with a 581-entry native-speaker-verified lexicon — generalizes to any of the ~2,000 African languages at the same data-poverty level. Recipe, dataset, and code template all included.

📄 Blog: https://huggingface.co/blog/MEYNG/sangoai
📦 Dataset: MEYNG/sango-vocabulary

Would especially value feedback from anyone working on other low-resource African languages — Ewondo, Lingala, Wolof next on our roadmap.

I'm not working on low-resource African languages but this method sounds interesting.

So you put the orthography, grammar, and vocabulary in the prompt and then get the LLM to translate a language that it doesn't know. Clever!

Then once you have enough native speaker-verified Sango-French translations, you can bootstrap it to a full-fledged dataset...

·

Exactly right on the bootstrapping loop — that's precisely the progression we're running.

Small precision on the mechanism: the model has seen some Sango during pretraining (it appears in Common Crawl), but not enough to produce coherent translations cold. The vocabulary injection doesn't teach the language from scratch — it gives the model enough anchoring signal to activate what it weakly learned. The grammar rules and orthography notes handle the parts pretraining didn't cover reliably (tonal distinctions, diacritics, Sango-specific syntax).

And yes, the loop you're describing is live: the vocabulary-augmented outputs → native-speaker verification → parallel corpus → fine-tuned NMT model. We just published BENCH-001 results on the fine-tune: +5.70 BLEU over baseline on French→Sango, +9.10 on Sango→French. The vocabulary-augmented prompting approach (BLEU 2.92 on the same task, zero fine-tuning) is the floor; the fine-tune is what you get once the dataset is big enough.

The data pipeline post documenting that second step just went up here: https://huggingface.co/blog/MEYNG/sango-vocabulary-pipeline

The interesting open question is where the ceiling is for a 600M-parameter model on a language with ~5M speakers and sparse digitized text. We're nowhere near it yet.

In this post