FreeSVC: Zero-shot Multilingual Singing Voice Conversion
FreeSVC is a state-of-the-art multilingual singing voice conversion model designed for zero-shot learning. It enables the conversion of singing voices across various languages without the need for extensive language-specific training. GitHub repository.
Supported Languages
Language | ID | Status | Speech Data | Singing Data |
---|---|---|---|---|
Chinese | 0 | β Full | 255h | 70h |
Dutch | 1 | β Full | Part of CML | - |
English | 2 | β Full | 921h | 47h |
French | 3 | β Full | Part of CML | - |
German | 4 | β Full | Part of CML | - |
Italian | 5 | β Full | Part of CML | - |
Japanese | 6 | β Full | 30h | - |
Other* | 7 | β οΈ Partial | - | 10h |
Polish | 8 | β Full | Part of CML | - |
Portuguese | 9 | β Full | Part of CML | - |
Spanish | 10 | β Full | Part of CML | - |
*Note: The "Other" category is used for vocal techniques without content.
Model Overview
FreeSVC leverages an enhanced VITS architecture integrated with Speaker-invariant Clustering (SPIN) and the ECAPA2 speaker encoder. This combination effectively separates speaker characteristics from linguistic content, ensuring high-quality and natural-sounding voice conversions across multiple languages.
Training Datasets
FreeSVC was trained on a diverse set of speech and singing datasets covering multiple languages:
Dataset | Hours | Language | Type |
---|---|---|---|
AISHELL-1 | 170h | Chinese | Speech |
AISHELL-3 | 85h | Chinese | Speech |
CML-TTS | 3.1k | 7 Languages | Speech |
HiFiTTS | 292h | English | Speech |
JVS | 30h | Japanese | Speech |
LibriTTS-R | 585h | English | Speech |
NUS (NHSS) | 7h | English | Speech, Singing |
OpenSinger | 50h | Chinese | Singing |
Opencpop | 5h | Chinese | Singing |
PopBuTFy | 10h, 40h | Chinese, English | Singing |
POPCS | 5h | Chinese | Singing |
VCTK | 44h | English | Speech |
VocalSet | 10h | Other | Singing |
Citation
@misc{}