hexgrad/Kokoro-82M · [FAQ] Alternatives to Finetuning Kokoro

A very frequently asked question is how to finetune Kokoro, i.e. continue training the checkpoint uploaded in this repository. This is currently not feasible: for a number of reasons, the additional models required to facilitate this have not yet been open-sourced.

However, there are a few alternatives available to you in the meantime (staying in the realm of open source speech models). Please be aware that there are varying degrees of difficulty to each option, and some could be unsuitable depending on your technical abilities and/or requirements.

1. Use a Speech-to-Speech model like RVC

First, generate the speech using Kokoro. Then pipe the TTS output into RVC-Project/Retrieval-based-Voice-Conversion-WebUI or a similar speech-to-speech model (Beatrice v2, see also w-okada/voice-changer).

Pros: There are many pre-trained RVC models readily available (search "rvc models"), and you can also train your own RVC model.
Cons: You have to run a separate model after TTS, which impacts latency and increases inference-time compute footprint. Results will also probably fall short of a true base model finetune.

2. Train your own StyleTTS 2 Model

Kokoro v0.19 was trained on relatively little data and transparently uses a StyleTTS 2 architecture, so if you have proficiency in training models and the required compute, you can train your own from the public checkpoints. Here are some resources:

StyleTTS 2 repo: https://github.com/yl4579/StyleTTS2
Notes on Finetuning: https://github.com/yl4579/StyleTTS2/discussions/81
Colab notebooks: https://github.com/yl4579/StyleTTS2/tree/main/Colab

It is also possible to train StyleTTS 2 models in other languages, although this is can be more difficult than English for tokenization & g2p and/or data procurement:

@Respair has done Japanese: https://huggingface.co/spaces/Respair/Tsukasa_Speech
@patriotyk has done Ukranian: https://huggingface.co/spaces/patriotyk/styletts2-ukrainian
I have heard (of) models in: Korean, German, French, Italian, Spanish, Persian

Pros: Full customizability and ownership over your trained model.
Cons: Requires compute, data, and technical skills.

3. Zero-shot or train a different TTS architecture

In no particular order, here are some links to other open-source TTS models (although not all are permissive):

XTTS v2: https://hf.co/coqui/XTTS-v2
MaskGCT: https://hf.co/amphion/MaskGCT
E2/F5-TTS: https://github.com/SWivid/F5-TTS
GPT-SoVITS: https://github.com/RVC-Boss/GPT-SoVITS
Fish Speech: https://hf.co/fishaudio/fish-speech-1.5
Piper TTS: https://github.com/rhasspy/piper

Pros: Training may not be required, or in some cases if the base models were trained on more data, less training may be required to finetune.
Cons: Licenses, parameter counts, and resulting output quality vary.