This is the place where converting facebook/musicgen-stereo-medium safetensors to PyTorch .bin Model for easier usage for everyone

We further release a set of stereophonic capable models. Those were fine tuned for 200k updates starting from the mono models. The training data is otherwise identical and capabilities and limitations are shared with the base modes. The stereo models work by getting 2 streams of tokens from the EnCodec model, and interleaving those using the delay pattern.

Stereophonic sound, also known as stereo, is a technique used to reproduce sound with depth and direction. It uses two separate audio channels played through speakers (or headphones), which creates the impression of sound coming from multiple directions.

MusicGen is a text-to-music model capable of genreating high-quality music samples conditioned on text descriptions or audio prompts. It is a single stage auto-regressive Transformer model trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike existing methods, like MusicLM, MusicGen doesn't require a self-supervised semantic representation, and it generates all 4 codebooks in one pass. By introducing a small delay between the codebooks, we show we can predict them in parallel, thus having only 50 auto-regressive steps per second of audio.