You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Dhwani - Indic Speech To Text Translation

Introduction

Dhwani enables Speech-to-Text Translation for Indic Languages. It supports translation from Indic Language (X) → English and vice-versa.

Model Summary

Current model trained using SALMONN architecture.

PreTraining

Speech Encoder: Utilizes the Whisper model's speech encoder to process speech inputs.
Audio Encoder: Employs the BEATs audio encoder for non-speech audio inputs, such as environmental sounds and music.
Connection Module: Uses the Window-Level Query Transformer (Q-Former) to bridge the audio encoders and the Large Language Model (LLM).
Large Language Model (LLM): The Krutrim LLM receives the processed tokens, handling audio-derived information.
Adaptation Mechanism: Low-Rank Adaptation (LoRA) is applied to fine-tune the LLM to align the audio inputs with the model's output.

PostTraining

To adapt Q-former and LoRA weights, we used techniques mentioned in the paper IndicST. Along with the IndicST translation dataset, we also used in-house-collected translation data to further improve the performance of translation results.

Evaluation Results

En → Indic (X) BLEU Scores:

Language Pair	BLEU Score
en → hin	57.7
en → guj	44.3
en → mar	43.3
en → ben	49.0
en → tam	47.0
en → tel	40.8
en → mal	39.0
en → kan	47.0
Average	46.0

Indic (X) → En BLEU Scores:

Language Pair	BLEU Score
hin → en	35.7
guj → en	34.6
mar → en	33.2
ben → en	19.2
tam → en	25.4
tel → en	17.4
mal → en	38.9
kan → en	28.0
Average	30.0

API Platform

Visit Dhwani Online to access the model via the web interface.

How to inference in CLI

Clone the repository: git clone https://github.com/ola-krutrim/Dhwani
Install the environment: conda create -n dhwani_env python=3.9.17
Activate the environment: conda activate dhwani_env
Install the requirements: pip install -r requirements.txt
Run CLI: python3 cli_inference.py --cfg-path configs/decode_config.yaml in A100-SXM-80GB. Now you can input wav_path and prompt. Enjoy yourself !

How to infer the model

Same as How to inference in CLI: 1-3.
Running with python3 infer.py --cfg-path configs/decode_config.yaml in A100-SXM-80GB.

License

This code repository and the model weights are licensed under the Krutrim Community License.

Citation

@inproceedings{
  sanket2025IndicST,
  title={{IndicST}: Indian Multilingual Translation Corpus For Evaluating Speech Large Language Models},
  author={Sanket Shah, Kavya Ranjan Saxena, Kancharana Manideep Bharadwaj, Sharath Adavanne, Nagaraj Adiga},
  booktitle={Proc. ICASSP},
  year={2025},
}

Contact

Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub.

Downloads last month: 3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support