File size: 6,493 Bytes
0f75c5c e598fae 0f75c5c b2ca557 0263898 b8a4df8 0f75c5c bfdf50b 0f75c5c c0ab532 0f75c5c fd22a1e 0f75c5c fd22a1e 0f75c5c fd22a1e 0f75c5c fd22a1e 0f75c5c fd22a1e 0f75c5c fd22a1e 0f75c5c bfdf50b c0ab532 fd22a1e c0ab532 fd22a1e c0ab532 fd22a1e c0ab532 bfdf50b fd22a1e c0ab532 fd22a1e ee1e087 c0ab532 fd22a1e ee1e087 0f75c5c fd22a1e 0f75c5c fd22a1e 0f75c5c bfdf50b fd22a1e 0f75c5c fd22a1e 0f75c5c fd22a1e 0f75c5c fd22a1e 0f75c5c bfdf50b 0f75c5c fd22a1e 0f75c5c fd22a1e 0f75c5c bfdf50b 0f75c5c bfdf50b 0f75c5c fd22a1e 0f75c5c e598fae |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
---
inference: false
tags:
- SeamlessM4T
- seamless_m4t
license: cc-by-nc-4.0
library_name: transformers
pipeline_tag: text-to-speech
---
# SeamlessM4T Large
SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different
linguistic communities to communicate effortlessly through speech and text.
This repository hosts π€ Hugging Face's [implementation](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t) of SeamlessM4T.
-------------------
**π SeamlessM4T v2, an improved version of this version with a novel architecture, has been released [here](https://huggingface.co/facebook/seamless-m4t-v2-large).
This new model improves over SeamlessM4T v1 in quality as well as inference speed in speech generation tasks.**
**SeamlessM4T v2 is also supported by π€ Transformers, more on it [in the model card of this new version](https://huggingface.co/facebook/seamless-m4t-v2-large#transformers-usage) or directly in [π€ Transformers docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2).**
-------------------
SeamlessM4T Large covers:
- π₯ 101 languages for speech input
- β¨οΈ [96 Languages](https://huggingface.co/ylacombe/hf-seamless-m4t-large/blob/main/generation_config.json#L48-L145) for text input/output
- π£οΈ [35 languages](https://huggingface.co/ylacombe/hf-seamless-m4t-large/blob/main/generation_config.json#L149-L184) for speech output.
This is the "large" variant of the unified model, which enables multiple tasks without relying on multiple separate models:
- Speech-to-speech translation (S2ST)
- Speech-to-text translation (S2TT)
- Text-to-speech translation (T2ST)
- Text-to-text translation (T2TT)
- Automatic speech recognition (ASR)
You can perform all the above tasks from one single model, [`SeamlessM4TModel`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel), but each task also has its own dedicated sub-model.
## π€ Usage
First, load the processor and a checkpoint of the model:
```python
>>> from transformers import AutoProcessor, SeamlessM4TModel
>>> processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-large")
>>> model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-large")
```
You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
Here is how to use the processor to process text and audio:
```python
>>> # let's load an audio sample from an Arabic speech corpus
>>> from datasets import load_dataset
>>> dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True)
>>> audio_sample = next(iter(dataset))["audio"]
>>> # now, process it
>>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")
>>> # now, process some English test as well
>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
```
### Speech
[`SeamlessM4TModel`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel) can *seamlessly* generate text or speech with few or no changes. Let's target Russian voice translation:
```python
>>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
>>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
```
With basically the same code, I've translated English text and Arabic speech to Russian speech samples.
### Text
Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass `generate_speech=False` to [`SeamlessM4TModel.generate`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel.generate).
This time, let's translate to French.
```python
>>> # from audio
>>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_audio = processor.decode(output_tokens[0].tolist(), skip_special_tokens=True)
>>> # from text
>>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_text = processor.decode(output_tokens[0].tolist(), skip_special_tokens=True)
```
### Tips
#### 1. Use dedicated models
[`SeamlessM4TModel`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel) is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint.
For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code:
```python
>>> from transformers import SeamlessM4TForSpeechToSpeech
>>> model = SeamlessM4TForSpeechToSpeech.from_pretrained("facebook/hf-seamless-m4t-large")
```
Or you can replace the text-to-text generation snippet with the model dedicated to the T2TT task, you only have to remove `generate_speech=False`.
```python
>>> from transformers import SeamlessM4TForTextToText
>>> model = SeamlessM4TForTextToText.from_pretrained("facebook/hf-seamless-m4t-large")
```
Feel free to try out [`SeamlessM4TForSpeechToText`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TForSpeechToText) and [`SeamlessM4TForTextToSpeech`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TForTextToSpeech) as well.
#### 2. Change the speaker identity
You have the possibility to change the speaker used for speech synthesis with the `spkr_id` argument. Some `spkr_id` works better than other for some languages!
#### 3. Change the generation strategy
You can use different [generation strategies](https://huggingface.co/docs/transformers/v4.34.1/en/generation_strategies#text-generation-strategies) for speech and text generation, e.g `.generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)` which will successively perform beam-search decoding on the text model, and multinomial sampling on the speech model.
#### 4. Generate speech and text at the same time
Use `return_intermediate_token_ids=True` with [`SeamlessM4TModel`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel) to return both speech and text ! |