Dhwani - Indic Speech To Text Translation
Introduction
Dhwani enables Speech-to-Text Translation for Indic Languages. It supports translation from Indic Language (X) β English and vice-versa.
Model Summary
Current model trained using SALMONN architecture.
PreTraining
- Speech Encoder: Utilizes the Whisper model's speech encoder to process speech inputs.
- Audio Encoder: Employs the BEATs audio encoder for non-speech audio inputs, such as environmental sounds and music.
- Connection Module: Uses the Window-Level Query Transformer (Q-Former) to bridge the audio encoders and the Large Language Model (LLM).
- Large Language Model (LLM): The Krutrim LLM receives the processed tokens, handling audio-derived information.
- Adaptation Mechanism: Low-Rank Adaptation (LoRA) is applied to fine-tune the LLM to align the audio inputs with the model's output.
PostTraining
To adapt Q-former and LoRA weights, we used techniques mentioned in the paper IndicST. Along with the IndicST translation dataset, we also used in-house-collected translation data to further improve the performance of translation results.
Evaluation Results
En β Indic (X) BLEU Scores:
| Language Pair | BLEU Score |
|---|---|
| en β hin | 57.7 |
| en β guj | 44.3 |
| en β mar | 43.3 |
| en β ben | 49.0 |
| en β tam | 47.0 |
| en β tel | 40.8 |
| en β mal | 39.0 |
| en β kan | 47.0 |
| Average | 46.0 |
Indic (X) β En BLEU Scores:
| Language Pair | BLEU Score |
|---|---|
| hin β en | 35.7 |
| guj β en | 34.6 |
| mar β en | 33.2 |
| ben β en | 19.2 |
| tam β en | 25.4 |
| tel β en | 17.4 |
| mal β en | 38.9 |
| kan β en | 28.0 |
| Average | 30.0 |
API Platform
Visit Dhwani Online to access the model via the web interface.
How to inference in CLI
- Clone the repository:
git clone https://github.com/ola-krutrim/Dhwani - Install the environment:
conda create -n dhwani_env python=3.9.17 - Activate the environment:
conda activate dhwani_env - Install the requirements:
pip install -r requirements.txt - Run CLI:
python3 cli_inference.py --cfg-path configs/decode_config.yamlin A100-SXM-80GB. Now you can inputwav_pathandprompt. Enjoy yourself !
How to infer the model
- Same as How to inference in CLI: 1-3.
- Running with
python3 infer.py --cfg-path configs/decode_config.yamlin A100-SXM-80GB.
License
This code repository and the model weights are licensed under the Krutrim Community License.
Citation
@inproceedings{
sanket2025IndicST,
title={{IndicST}: Indian Multilingual Translation Corpus For Evaluating Speech Large Language Models},
author={Sanket Shah, Kavya Ranjan Saxena, Kancharana Manideep Bharadwaj, Sharath Adavanne, Nagaraj Adiga},
booktitle={Proc. ICASSP},
year={2025},
}
Contact
Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub.
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
