Few questions coming into my mind
First of all a big thanks for training and sharing this version, it looks very promising, but lacking some generalization (I know its 0.1)!
I was waiting for a German version of chatterbox since the beginning.
- Did you think about using an open source dataset for training like the Emilia?
- What about the license of the provided audio? Do you like to use only material which is allowed to use commercially?
- How long did the training take on your 2xRTX3090?
If I can contribute something to your project, I am happy to assist.
- I just used a subset of my own dataset which is 600k out of 2.4M datapoints. Using additional opensource datasource is possible like Emilia-Yodas subset, common voice etc. However really high quality german data is kind of the missing part. This should include vocal expressions ([laugh], [yawn]) and more diverse transcripts (including "..." for pauses, different writing styles, enumerations, ...)
- I only scraped permissive licensed audio files (cc-by(-sa)) and I will not use any NC data.
- Training takes around 8h for the 600k samples.
For the training pipeline, I still need to address a few issues, such as training T3CondEnc
. Currently, it is frozen, and training it results in strange audio being generated at the start. The weather is getting better, but with temperatures nearing 40 °C inside due to the GPUs, it's becoming difficult to enjoy model training. So, expect new results rather later then soon 😃.
Thanks for the detailed reply. Funnily I just realized that we are from the same region after checking your profile.
I have currently only one rtx4090, but if we manage to fit it in 24gb vram I am happy to perform some training runs.