Inconsistent in selecting the speaker voice
I tried generating output for speaker ID 0 at a time, but I'm not getting a consistent voice—the voice seems to change across generations. Did you experience the same issue? Were you able to get the same voice every time, or did it vary for you too?
Hello! The reason you are getting random voices is that this model is the base model - speaker IDs just exist to ensure speaker consistency in conversation. In order to generate a consistent voice in the base model, you need to provide context. Try giving one or two samples of the voice you want to generate beforehand! That way, you can get a consistent voice.
We have fine-tuned the model with 1,000 samples from a single speaker using a single speaker ID, yet we still can't get a consistent speaker voice from the model.
Try adjusting LR and batch size, make sure model overfits in training sample a little! If that doesn't work, perhaps some rejection training of wrong samples are required, prerferedly with offline RL methods such as DPO, KTO. We have DPO, KTO implementation in the GitHub repository if you want to try out, but please note that it's very experimental!