Seamless M4T audio input sample rate

#34
by wildcard00 - opened

I'm using the code provided (https://huggingface.co/spaces/facebook/seamless_m4t/blob/main/app.py) for Seamless M4T v1 to do translation for some audio files I have extracted from mp4 video recordings using ffmpeg (cmd used below for reference).

ffmpeg video_recording.mp4 -vn -acodec pcm_s16le -t 30 video_recording_0%d.wav

My understanding is that Seamless M4T v1 was trained on 16K audio . I had a couple of questions.

  1. If the audio files I am providing have an original sample rate of 48K and the code resamples it to 16K, would that throw off the translations?

  2. If Seamless M4T v1 was trained on 16K audio, can I pass it 48K audio or would that provide suboptimal translations?

  3. Can Seamless M4T v4output the intermediate transcription of the audio before it performs the translation?

Good morning, I hope my answer can help you out with the task:

  1. Sampling Rate Differences: If your audio files have a different sampling rate (48k) than what the model was trained on (16k), the input will be usually resampled to 16k, if not automatically done, it is highly suggested. This process changes the audio's frequency content and is a good practice to have the audio compatible with the model. While there might be slight variations in how the audio is perceived by the model due to this resampling, speech processing models are generally designed to handle such differences. However, for best performances, matching the model's training conditions as closely as possible is suggested, indeed not working with the same frequency may lead to audio misunderstanding or at worst changing the content of it.
  2. Model Suitability: Passing 48k audio to a model trained on 16k audio is manageable, but remember to check if the audio is correctly resampled to 16k. This ensures the model works with audio at its trained sampling rate, aiming to minimize any potential loss in translation quality due to mismatched sampling rates.
  3. Printing Intermediate Transcription: To print the output of the transcription before translation, you can try modifying the UI settings in the code. For the ASR task, ensuring visible=True for input_text in the update_input_ui function in line 207 at (https://huggingface.co/spaces/facebook/seamless_m4t/blob/main/app.py) which claims to allow users to see the transcription result before it's translated.

Thank you @guzzy ! Regarding outputting the transcription before translation, is there a way to enable the output of that from the model without involving the UI?

I'm not using the UI so curious if there is a way to pass this flag to the model. I wasn't able to trace back how setting visible=True gets propagated to the model so appreciate any guidance here.

Sorry for late response, you might have a look in their model git page, you might start out using this reference 'https://github.com/facebookresearch/seamless_communication/blob/main/Seamless_Tutorial.ipynb' using the same M4T model, but you can play with it using the 'Transcriber' class instead of the 'Translator' one. 'https://github.com/facebookresearch/seamless_communication/blob/main/src/seamless_communication/inference/transcriber.py'.

Sign up or log in to comment