voidful's picture
model documentation (#3)
5eb0674
metadata
language:
  - multilingual
  - ar
  - as
  - br
  - ca
  - cnh
  - cs
  - cv
  - cy
  - de
  - dv
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - fi
  - fr
  - hi
  - hsb
  - hu
  - ia
  - id
  - ja
  - ka
  - ky
  - lg
  - lt
  - ly
  - mn
  - mt
  - nl
  - or
  - pl
  - pt
  - ro
  - ru
  - sah
  - sl
  - ta
  - th
  - tr
  - tt
  - uk
  - vi
license: apache-2.0
tags:
  - audio
  - automatic-speech-recognition
  - hf-asr-leaderboard
  - robust-speech-event
  - speech
  - xlsr-fine-tuning-week
datasets:
  - common_voice
language_bcp47:
  - fy-NL
  - ga-IE
  - pa-IN
  - rm-sursilv
  - rm-vallader
  - sy-SE
  - zh-CN
  - zh-HK
  - zh-TW
model-index:
  - name: XLSR Wav2Vec2 for 56 language by Voidful
    results:
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          name: Common Voice
          type: common_voice
        metrics:
          - type: cer
            value: 23.21
            name: Test CER

Model Card for wav2vec2-xlsr-multilingual-56

Model Details

Model Description

  • Developed by: voidful
  • Shared by [Optional]: Hugging Face
  • Model type: automatic-speech-recognition
  • Language(s) (NLP): multilingual (56 language, 1 model Multilingual ASR)
  • License: Apache-2.0
  • Related Models:
    • Parent Model: wav2vec
  • Resources for more information:

Uses

Direct Use

This model can be used for the task of automatic-speech-recognition

Downstream Use [Optional]

More information needed

Out-of-Scope Use

The model should not be used to intentionally create hostile or alienating environments for people.

Bias, Risks, and Limitations

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

Training Details

Training Data

See the common_voice dataset card Fine-tuned facebook/wav2vec2-large-xlsr-53 on 56 language using the Common Voice.

Training Procedure

Preprocessing

More information needed

Speeds, Sizes, Times

When using this model, make sure that your speech input is sampled at 16kHz.

Evaluation

Testing Data, Factors & Metrics

Testing Data

More information needed

Factors

Metrics

More information needed

Results

Click to expand
Common Voice Languages Num. of data Hour WER CER
ar 21744 81.5 75.29 31.23
as 394 1.1 95.37 46.05
br 4777 7.4 93.79 41.16
ca 301308 692.8 24.80 10.39
cnh 1563 2.4 68.11 23.10
cs 9773 39.5 67.86 12.57
cv 1749 5.9 95.43 34.03
cy 11615 106.7 67.03 23.97
de 262113 822.8 27.03 6.50
dv 4757 18.6 92.16 30.15
el 3717 11.1 94.48 58.67
en 580501 1763.6 34.87 14.84
eo 28574 162.3 37.77 6.23
es 176902 337.7 19.63 5.41
et 5473 35.9 86.87 20.79
eu 12677 90.2 44.80 7.32
fa 12806 290.6 53.81 15.09
fi 875 2.6 93.78 27.57
fr 314745 664.1 33.16 13.94
fy-NL 6717 27.2 72.54 26.58
ga-IE 1038 3.5 92.57 51.02
hi 292 2.0 90.95 57.43
hsb 980 2.3 89.44 27.19
hu 4782 9.3 97.15 36.75
ia 5078 10.4 52.00 11.35
id 3965 9.9 82.50 22.82
it 70943 178.0 39.09 8.72
ja 1308 8.2 99.21 62.06
ka 1585 4.0 90.53 18.57
ky 3466 12.2 76.53 19.80
lg 1634 17.1 98.95 43.84
lt 1175 3.9 92.61 26.81
lv 4554 6.3 90.34 30.81
mn 4020 11.6 82.68 30.14
mt 3552 7.8 84.18 22.96
nl 14398 71.8 57.18 19.01
or 517 0.9 90.93 27.34
pa-IN 255 0.8 87.95 42.03
pl 12621 112.0 56.14 12.06
pt 11106 61.3 53.24 16.32
rm-sursilv 2589 5.9 78.17 23.31
rm-vallader 931 2.3 73.67 21.76
ro 4257 8.7 83.84 21.95
ru 23444 119.1 61.83 15.18
sah 1847 4.4 94.38 38.46
sl 2594 6.7 84.21 20.54
sv-SE 4350 20.8 83.68 30.79
ta 3788 18.4 84.19 21.60
th 4839 11.7 141.87 37.16
tr 3478 22.3 66.77 15.55
tt 13338 26.7 86.80 33.57
uk 7271 39.4 70.23 14.34
vi 421 1.7 96.06 66.25
zh-CN 27284 58.7 89.67 23.96
zh-HK 12678 92.1 81.77 18.82
zh-TW 6402 56.6 85.08 29.07
# Model Examination

More information needed

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: More information needed
  • Hours used: More information needed
  • Cloud Provider: More information needed
  • Compute Region: More information needed
  • Carbon Emitted: More information needed

Technical Specifications [optional]

Model Architecture and Objective

More information needed

Compute Infrastructure

More information needed

Hardware

More information needed

Software

More information needed

Citation

BibTeX:

More information needed

APA:

More information needed

Glossary [optional]

More information needed

More Information [optional]

More information needed

Model Card Authors [optional]

voidful in collaboration with Ezi Ozoani and the Hugging Face team

Model Card Contact

More information needed

How to Get Started with the Model

Use the code below to get started with the model.

Click to expand

Env setup:

!pip install torchaudio
!pip install datasets transformers
!pip install asrp
!wget -O lang_ids.pk https://huggingface.co/voidful/wav2vec2-xlsr-multilingual-56/raw/main/lang_ids.pk

Usage

import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Processor,
    AutoTokenizer, 
    AutoModelWithLMHead 
)
import torch
import re
import sys
import soundfile as sf
model_name = "voidful/wav2vec2-xlsr-multilingual-56"
device = "cuda"
processor_name = "voidful/wav2vec2-xlsr-multilingual-56"
 
import pickle
with open("lang_ids.pk", 'rb') as output:
    lang_ids = pickle.load(output)
    
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(processor_name)
 
model.eval()
 
def load_file_to_data(file,sampling_rate=16_000):
    batch = {}
    speech, _ = torchaudio.load(file)
    if sampling_rate != '16_000' or sampling_rate != '16000':
        resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16_000)
        batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
        batch["sampling_rate"] = resampler.new_freq
    else:
        batch["speech"] = speech.squeeze(0).numpy()
        batch["sampling_rate"] = '16000'
    return batch
 
 
def predict(data):
    features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
        decoded_results = []
        for logit in logits:
            pred_ids = torch.argmax(logit, dim=-1)
            mask = pred_ids.ge(1).unsqueeze(-1).expand(logit.size())
            vocab_size = logit.size()[-1]
            voice_prob = torch.nn.functional.softmax((torch.masked_select(logit, mask).view(-1,vocab_size)),dim=-1)
            comb_pred_ids = torch.argmax(voice_prob, dim=-1)
            decoded_results.append(processor.decode(comb_pred_ids))
 
    return decoded_results
 
def predict_lang_specific(data,lang_code):
    features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
        decoded_results = []
        for logit in logits:
            pred_ids = torch.argmax(logit, dim=-1)
            mask = ~pred_ids.eq(processor.tokenizer.pad_token_id).unsqueeze(-1).expand(logit.size())
            vocab_size = logit.size()[-1]
            voice_prob = torch.nn.functional.softmax((torch.masked_select(logit, mask).view(-1,vocab_size)),dim=-1)
            filtered_input = pred_ids[pred_ids!=processor.tokenizer.pad_token_id].view(1,-1).to(device)
            if len(filtered_input[0]) == 0:
                decoded_results.append("")
            else:
                lang_mask = torch.empty(voice_prob.shape[-1]).fill_(0)
                lang_index = torch.tensor(sorted(lang_ids[lang_code]))
                lang_mask.index_fill_(0, lang_index, 1)
                lang_mask = lang_mask.to(device)
                comb_pred_ids = torch.argmax(lang_mask*voice_prob, dim=-1)
                decoded_results.append(processor.decode(comb_pred_ids))
                
    return decoded_results
 
 
predict(load_file_to_data('audio file path',sampling_rate=16_000)) # beware of the audio file sampling rate
 
predict_lang_specific(load_file_to_data('audio file path',sampling_rate=16_000),'en') # beware of the audio file sampling rate
 
{{ get_started_code | default("More information needed", true)}}