Automatic Speech Recognition
Transformers
Safetensors
whisper
audio
hf-asr-leaderboard
Inference Endpoints

Whisper

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. More details about it are available here.

whisper-v2-d3-e3 is a version of whisper-large-v2, fine-tuned by ivrit.ai to improve Hebrew ASR using crowd-sourced labeling.

Model details

This model comes as a single checkpoint, whisper-v2-d3-e3. It is a 1550M parameters multi-lingual ASR solution.

Usage

To transcribe audio samples, the model has to be used alongside a WhisperProcessor.

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

SAMPLING_RATE = 16000

has_cuda = torch.cuda.is_available()
model_path = 'ivrit-ai/whisper-v2-d3-e3'

model = WhisperForConditionalGeneration.from_pretrained(model_path)
if has_cuda:
    model.to('cuda:0')

processor = WhisperProcessor.from_pretrained(model_path)

# audio_resample based on entry being part of an existing dataset.
# Alternatively, this can be loaded from an audio file.
audio_resample = librosa.resample(entry['audio']['array'], orig_sr=entry['audio']['sampling_rate'], target_sr=SAMPLING_RATE)

input_features = processor(audio_resample, sampling_rate=SAMPLING_RATE, return_tensors="pt").input_features
if has_cuda:
  input_features = input_features.to('cuda:0')

predicted_ids = model.generate(input_features, language='he', num_beams=5)
transcript = processor.batch_decode(predicted_ids, skip_special_tokens=True)

print(f'Transcript: {transcription[0]}')

Evaluation

You can use the evaluate_model.py reference on GitHub to evalute the model's quality.

Long-Form Transcription

The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:

>>> import torch
>>> from transformers import pipeline
>>> from datasets import load_dataset

>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"

>>> pipe = pipeline(
>>>   "automatic-speech-recognition",
>>>   model="ivrit-ai/whisper-v2-d3-e3",
>>>   chunk_length_s=30,
>>>   device=device,
>>> )

>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> sample = ds[0]["audio"]

>>> prediction = pipe(sample.copy(), batch_size=8)["text"]
" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."

>>> # we can also return timestamps for the predictions
>>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
[{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
  'timestamp': (0.0, 5.44)}]

Refer to the blog post ASR Chunking for more details on the chunking algorithm.

BibTeX entry and citation info

ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development

@misc{marmor2023ivritai,
      title={ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development}, 
      author={Yanir Marmor and Kinneret Misgav and Yair Lifshitz},
      year={2023},
      eprint={2307.08720},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}
Downloads last month
662
Safetensors
Model size
1.54B params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train ivrit-ai/whisper-v2-d3-e3

Spaces using ivrit-ai/whisper-v2-d3-e3 4