Technical questions

by alien79 - opened Dec 5, 2024

Dec 5, 2024

Hello.
I'm in the process of trying to create an italian model (building in public here https://huggingface.co/alien79/f5-ita-test), this is my first attempt at training so I'd love to have feedback from someone with more experience that achieved that.

I'm actually using ~250 hours and I rented 2 RTX4090 on vast.ai , I have some doubt about the various settings to use.
I'm using the gradio interface these are my settings:

The batch size is something not really straightforward to me, I had to try some trial and error before so I'm not sure if there is an "easy rule" to follow to set that value, beside that, I've noticed my model to lose the ability to clone the reference voice, maybe is it a sign of overtraining? or maybe it's the opposite and I need to run it more?

I've actually ran 50 Epoch on 250h in italian (of course I've lost the english because my dataset isn't mixed language)

Also is your a finetune or a train from scratch? How many epoch of your dataset?
When I read people talking about steps it's confusing to me, because I see that the steps number changes based on your memory and batch size, so isn't better to talk about hour of train X number of epochs?

Did you have any tips to share?

Thanks and congrats for your work

marduk-ra

Owner Dec 8, 2024

First of all, I did finetuning using only the gradio interface. I used a total of 40-50 hours of audio data for the Turkish training. I had the interface automatically adjust the finetuning settings. I don't know how much usable audio is remaining when this much audio data is segmented and transcribed. But it was automatically set to 70 epochs.
After finishing the finetuning, I took a look at the project and realized that there are some things that could be improved, e.g:

It uses whisper large turbo for the transcript. This increases the amount of incorrect transcriptions.
Trascripts are cased. Whereas we don't talk cased when we talk in real life. That's why I think uncased text should be used for training and inference.
If there are other sounds other than speech in the audio dataset, they also cause the model to give bad results. For this reason, I renewed the dataset in the Turkish model and trained the model for the second time.

In a few weeks I will be taking some vacation time for the new year, during which time I plan to write a further development of this project, post it on github and train new tts models with it.

Thank you very much for your interest.

serdarcaglar

27 days ago

Thank you for sharing these details! Your efforts to improve the Turkish model and your plans for further development sound incredibly exciting. We’re eagerly looking forward to seeing the updates on GitHub and learning more about your new TTS models.

If possible, could you also share more details about the training process? For example:

What GPU did you use, and how long did the training take?
Was the dataset composed of transcripts from a single speaker or multiple speakers?
Any specific challenges you faced while preparing or processing the data?
Your insights would be very valuable for those of us interested in similar projects. Best of luck with your work, and enjoy your vacation when it comes!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment