FreeSVC: Zero-shot Multilingual Singing Voice Conversion

FreeSVC is a state-of-the-art multilingual singing voice conversion model designed for zero-shot learning. It enables the conversion of singing voices across various languages without the need for extensive language-specific training. GitHub repository.

Supported Languages

Language	ID	Status	Speech Data	Singing Data
Chinese	0	✅ Full	255h	70h
Dutch	1	✅ Full	Part of CML	-
English	2	✅ Full	921h	47h
French	3	✅ Full	Part of CML	-
German	4	✅ Full	Part of CML	-
Italian	5	✅ Full	Part of CML	-
Japanese	6	✅ Full	30h	-
Other*	7	⚠️ Partial	-	10h
Polish	8	✅ Full	Part of CML	-
Portuguese	9	✅ Full	Part of CML	-
Spanish	10	✅ Full	Part of CML	-

*Note: The "Other" category is used for vocal techniques without content.

Model Overview

FreeSVC leverages an enhanced VITS architecture integrated with Speaker-invariant Clustering (SPIN) and the ECAPA2 speaker encoder. This combination effectively separates speaker characteristics from linguistic content, ensuring high-quality and natural-sounding voice conversions across multiple languages.

Training Datasets

FreeSVC was trained on a diverse set of speech and singing datasets covering multiple languages:

Dataset	Hours	Language	Type
AISHELL-1	170h	Chinese	Speech
AISHELL-3	85h	Chinese	Speech
CML-TTS	3.1k	7 Languages	Speech
HiFiTTS	292h	English	Speech
JVS	30h	Japanese	Speech
LibriTTS-R	585h	English	Speech
NUS (NHSS)	7h	English	Speech, Singing
OpenSinger	50h	Chinese	Singing
Opencpop	5h	Chinese	Singing
PopBuTFy	10h, 40h	Chinese, English	Singing
POPCS	5h	Chinese	Singing
VCTK	44h	English	Speech
VocalSet	10h	Other	Singing

Citation

@misc{}