File size: 2,895 Bytes
e75d5c7
9e94911
 
 
 
 
acc0a56
 
 
 
9e94911
 
 
 
 
 
e75d5c7
 
6cb714c
 
 
e75d5c7
9e94911
 
e75d5c7
6cb714c
 
 
 
 
e75d5c7
 
 
 
 
6cb714c
9e94911
e75d5c7
9e94911
e75d5c7
6cb714c
 
 
 
e75d5c7
6cb714c
e75d5c7
6cb714c
 
 
 
e75d5c7
9e94911
e75d5c7
6cb714c
e75d5c7
9e94911
 
6cb714c
e75d5c7
6cb714c
e75d5c7
6cb714c
e75d5c7
6cb714c
e75d5c7
6cb714c
 
 
 
e75d5c7
6cb714c
 
e75d5c7
6cb714c
 
 
 
 
 
e75d5c7
6cb714c
e75d5c7
6cb714c
 
e75d5c7
6cb714c
e75d5c7
6cb714c
 
e75d5c7
6cb714c
e75d5c7
6cb714c
e75d5c7
6cb714c
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
license: cc-by-4.0
datasets:
- speechcolab/gigaspeech
- parler-tts/mls_eng_10k
- reach-vb/jenny_tts_dataset
- MikhailT/hifi-tts
- ylacombe/expresso
- keithito/lj_speech
- collabora/ai4bharat-shrutilipi
language:
- en
- hi
base_model:
- openai-community/gpt2
pipeline_tag: text-to-speech
---

# Model Card for indri-0.1-125m-tts

Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model (125M) in our series and supports TTS tasks in 2 languages:

1. English
2. Hindi

We have open-sourced our training scripts, inference, and other details.

- **Repository:** [GitHub](https://github.com/cmeraki/indri)
- **Demo:** [Website](https://www.indrivoice.ai/)
- **Implementation details**: [Release Blog](#TODO)

## Model Details

### Model Description

`indri-0.1-125m-tts` is a novel, ultra-small, and lightweight TTS model based on the transformer architecture.
It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.

### Key features

1. Based on GPT-2 architecture. The methodology can be extended to any transformer-based architecture.
2. Supports voice cloning with small prompts (<5s).
3. Code mixing text input in 2 languages - English and Hindi.
4. Ultra-fast. Can generate 5 seconds of audio per second on Amphere generation NVIDIA GPUs, and up to 10 seconds of audio per second on Ada generation NVIDIA GPUs.

### Details

1. Model Type: GPT-2 based language model
2. Size: 125M parameters
3. Language Support: English, Hindi
4. License: CC BY 4.0

## Technical details

Here's a brief of how the model works:

1. Converts input text into tokens
2. Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
3. Decodes audio tokens (from [Kyutai/mimi](https://huggingface.co/kyutai/mimi)) to audio

Please read our blog [here](#TODO) for more technical details on how it was built.

## How to Get Started with the Model

Use the code below to get started with the model. Pipelines are the best way to get started with the model.

```python
import torch
import torchaudio
from transformers import pipeline

task = 'indri-tts'
model_id = '11mlabs/indri-0.1-125m-tts'

pipe = pipeline(
 task,
    model=model_id,
    device=torch.device('cuda:0'), # Update this based on your hardware,
    trust_remote_code=True
)

output = pipe(['Hi, my name is Indri and I like to talk.'])

torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)
```

## Credits

1. [Kyutai/mimi](https://huggingface.co/kyutai/mimi)
2. [nanoGPT](https://github.com/karpathy/nanoGPT)

## Citation

To cite our work

```
@misc{indri-0.1-125m-tts,
  author       = {11mlabs},
  title        = {indri-0.1-125m-tts},
  year         = 2024,
  publisher    = {Hugging Face},
  journal      = {GitHub Repository},
  howpublished = {\url{https://github.com/cmeraki/indri}},
}
```