Few questions coming into my mind

#1
by SL1D - opened

First of all a big thanks for training and sharing this version, it looks very promising, but lacking some generalization (I know its 0.1)!
I was waiting for a German version of chatterbox since the beginning.

  1. Did you think about using an open source dataset for training like the Emilia?
  2. What about the license of the provided audio? Do you like to use only material which is allowed to use commercially?
  3. How long did the training take on your 2xRTX3090?

If I can contribute something to your project, I am happy to assist.

  1. I just used a subset of my own dataset which is 600k out of 2.4M datapoints. Using additional opensource datasource is possible like Emilia-Yodas subset, common voice etc. However really high quality german data is kind of the missing part. This should include vocal expressions ([laugh], [yawn]) and more diverse transcripts (including "..." for pauses, different writing styles, enumerations, ...)
  2. I only scraped permissive licensed audio files (cc-by(-sa)) and I will not use any NC data.
  3. Training takes around 8h for the 600k samples.

For the training pipeline, I still need to address a few issues, such as training T3CondEnc. Currently, it is frozen, and training it results in strange audio being generated at the start. The weather is getting better, but with temperatures nearing 40 °C inside due to the GPUs, it's becoming difficult to enjoy model training. So, expect new results rather later then soon 😃.

Thanks for the detailed reply. Funnily I just realized that we are from the same region after checking your profile.
I have currently only one rtx4090, but if we manage to fit it in 24gb vram I am happy to perform some training runs.

Sign up or log in to comment