Few questions coming into my mind

by SL1D - opened Jun 18

SL1D

Jun 18

First of all a big thanks for training and sharing this version, it looks very promising, but lacking some generalization (I know its 0.1)!
I was waiting for a German version of chatterbox since the beginning.

Did you think about using an open source dataset for training like the Emilia?
What about the license of the provided audio? Do you like to use only material which is allowed to use commercially?
How long did the training take on your 2xRTX3090?

If I can contribute something to your project, I am happy to assist.

SebastianBodza

Owner Jun 19

I just used a subset of my own dataset which is 600k out of 2.4M datapoints. Using additional opensource datasource is possible like Emilia-Yodas subset, common voice etc. However really high quality german data is kind of the missing part. This should include vocal expressions ([laugh], [yawn]) and more diverse transcripts (including "..." for pauses, different writing styles, enumerations, ...)
I only scraped permissive licensed audio files (cc-by(-sa)) and I will not use any NC data.
Training takes around 8h for the 600k samples.

For the training pipeline, I still need to address a few issues, such as training T3CondEnc. Currently, it is frozen, and training it results in strange audio being generated at the start. The weather is getting better, but with temperatures nearing 40 °C inside due to the GPUs, it's becoming difficult to enjoy model training. So, expect new results rather later then soon 😃.

SL1D

Jun 20

Thanks for the detailed reply. Funnily I just realized that we are from the same region after checking your profile.
I have currently only one rtx4090, but if we manage to fit it in 24gb vram I am happy to perform some training runs.

martin-nguyen

Jul 1

How long is the 600k samples you used to train this model?

SebastianBodza

Owner Jul 2

The full 2.4M are around 7k hours. So should be around 1.75k hours

oddadmix

Jul 2

Great work, @SebastianBodza , Would be able to share the training scripts? I am trying to do something similar for Arabic.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment