File size: 6,557 Bytes
5325fcc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
# AudioGen: Textually-guided audio generation

AudioCraft provides the code and a model re-implementing AudioGen, a [textually-guided audio generation][audiogen_arxiv]
model that performs text-to-sound generation.

The provided AudioGen reimplementation follows the LM model architecture introduced in [MusicGen][musicgen_arxiv]
and is a single stage auto-regressive Transformer model trained over a 16kHz
<a href="https://github.com/facebookresearch/encodec">EnCodec tokenizer</a> with 4 codebooks sampled at 50 Hz.
This model variant reaches similar audio quality than the original implementation introduced in the AudioGen publication
while providing faster generation speed given the smaller frame rate.

**Important note:** The provided models are NOT the original models used to report numbers in the
[AudioGen publication][audiogen_arxiv]. Refer to the model card to learn more about architectural changes.

Listen to samples from the **original AudioGen implementation** in our [sample page][audiogen_samples].


## Model Card

See [the model card](../model_cards/AUDIOGEN_MODEL_CARD.md).


## Installation

Please follow the AudioCraft installation instructions from the [README](../README.md).

AudioCraft requires a GPU with at least 16 GB of memory for running inference with the medium-sized models (~1.5B parameters).

## API and usage

We provide a simple API and 1 pre-trained models for AudioGen:

`facebook/audiogen-medium`: 1.5B model, text to sound - [🤗 Hub](https://huggingface.co/facebook/audiogen-medium)

You can play with AudioGen by running the jupyter notebook at [`demos/audiogen_demo.ipynb`](../demos/audiogen_demo.ipynb) locally (if you have a GPU).

See after a quick example for using the API.

```python
import torchaudio
from audiocraft.models import AudioGen
from audiocraft.data.audio import audio_write

model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)  # generate 5 seconds.
descriptions = ['dog barking', 'sirene of an emergency vehicle', 'footsteps in a corridor']
wav = model.generate(descriptions)  # generates 3 samples.

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
```

## Training

The [AudioGenSolver](../audiocraft/solvers/audiogen.py) implements the AudioGen's training pipeline
used to develop the released model. Note that this may not fully reproduce the results presented in the paper.
Similarly to MusicGen, it defines an autoregressive language modeling task over multiple streams of
discrete tokens extracted from a pre-trained EnCodec model (see [EnCodec documentation](./ENCODEC.md)
for more details on how to train such model) with dataset-specific changes for environmental sound
processing.

Note that **we do NOT provide any of the datasets** used for training AudioGen.

### Example configurations and grids

We provide configurations to reproduce the released models and our research.
AudioGen solvers configuration are available in [config/solver/audiogen](../config/solver/audiogen).
The base training configuration used for the released models is the following:
[`solver=audiogen/audiogen_base_16khz`](../config/solver/audiogen/audiogen_base_16khz.yaml)

Please find some example grids to train AudioGen at
[audiocraft/grids/audiogen](../audiocraft/grids/audiogen/).

```shell
# text-to-sound
dora grid audiogen.audiogen_base_16khz
```

### Sound dataset and metadata

AudioGen's underlying dataset is an AudioDataset augmented with description metadata.
The AudioGen dataset implementation expects the metadata to be available as `.json` files
at the same location as the audio files or through specified external folder.
Learn more in the [datasets section](./DATASETS.md).

### Evaluation stage

By default, evaluation stage is also computing the cross-entropy and the perplexity over the
evaluation dataset. Indeed the objective metrics used for evaluation can be costly to run
or require some extra dependencies. Please refer to the [metrics documentation](./METRICS.md)
for more details on the requirements for each metric.

We provide an off-the-shelf configuration to enable running the objective metrics
for audio generation in
[config/solver/audiogen/evaluation/objective_eval](../config/solver/audiogen/evaluation/objective_eval.yaml).

One can then activate evaluation the following way:
```shell
# using the configuration
dora run solver=audiogen/debug solver/audiogen/evaluation=objective_eval
# specifying each of the fields, e.g. to activate KL computation
dora run solver=audiogen/debug evaluate.metrics.kld=true
```

See [an example evaluation grid](../audiocraft/grids/audiogen/audiogen_pretrained_16khz_eval.py).

### Generation stage

The generation stage allows to generate samples conditionally and/or unconditionally and to perform
audio continuation (from a prompt). We currently support greedy sampling (argmax), sampling
from softmax with a given temperature, top-K and top-P (nucleus) sampling. The number of samples
generated and the batch size used are controlled by the `dataset.generate` configuration
while the other generation parameters are defined in `generate.lm`.

```shell
# control sampling parameters
dora run solver=audiogen/debug generate.lm.gen_duration=5 generate.lm.use_sampling=true generate.lm.top_k=15
```

## More information

Refer to [MusicGen's instructions](./MUSICGEN.md).

### Learn more

Learn more about AudioCraft training pipelines in the [dedicated section](./TRAINING.md).


## Citation

AudioGen
```
@article{kreuk2022audiogen,
    title={Audiogen: Textually guided audio generation},
    author={Kreuk, Felix and Synnaeve, Gabriel and Polyak, Adam and Singer, Uriel and D{\'e}fossez, Alexandre and Copet, Jade and Parikh, Devi and Taigman, Yaniv and Adi, Yossi},
    journal={arXiv preprint arXiv:2209.15352},
    year={2022}
}
```

MusicGen
```
@article{copet2023simple,
    title={Simple and Controllable Music Generation},
    author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
    year={2023},
    journal={arXiv preprint arXiv:2306.05284},
}
```

## License

See license information in the [model card](../model_cards/AUDIOGEN_MODEL_CARD.md).

[audiogen_arxiv]: https://arxiv.org/abs/2209.15352
[musicgen_arxiv]: https://arxiv.org/abs/2306.05284
[audiogen_samples]: https://felixkreuk.github.io/audiogen/