Step by step on how to use language model KenLM with the model

#1
by huseinzol05 - opened
Mesolitica org

Very simple actually,

  1. install necessary libraries, I would like to choose pyctcdecode,
pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121

The version is very important, if you try to bump pyctcdecode above 0.1.0, steps below are no longer working.

  1. Download language model,
wget https://huggingface.co/huseinzol05/language-model-bahasa-manglish-combined/resolve/main/model.klm

Read https://github.com/huseinzol05/malaya-speech/blob/master/pretrained-model/prepare-lm/build-lm-mixed-combined.ipynb how to create your own language model.

  1. Load the model and language model,
from transformers import AutoModelForCTC
from pyctcdecode import build_ctcdecoder
import kenlm

kenlm_model = kenlm.Model('model.klm')
decoder = build_ctcdecoder(
    unique_vocab,
    kenlm_model,
    alpha=0.2,
    beta=1.0,
    ctc_token_idx=tokenizer.pad_token_id
)

model = AutoModelForCTC.from_pretrained(
    'mesolitica/wav2vec2-xls-r-300m-mixed',
)

o_pt = model(inputs)
o_pt = o_pt.logits.detach().cpu().numpy()
out = decoder.decode_beams(o_pt[0], prune_history=True)

Sign up or log in to comment