HiTZ
/

Automatic Speech Recognition
NeMo
PyTorch
Basque
speech
audio
CTC
Conformer
NeMo
Transformer
Eval Results
asierhv commited on
Commit
18369b0
·
verified ·
1 Parent(s): ef62a40

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +198 -0
README.md CHANGED
@@ -1,3 +1,201 @@
1
  ---
2
  license: cc-by-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
+ language:
4
+ - eu
5
+ library_name: nemo
6
+ datasets:
7
+ - mozilla-foundation/common_voice_16_1
8
+ - gttsehu/basque_parliament_1
9
+ - openslr
10
+ metrics:
11
+ - wer
12
+ pipeline_tag: automatic-speech-recognition
13
+ tags:
14
+ - automatic-speech-recognition
15
+ - speech
16
+ - audio
17
+ - CTC
18
+ - Conformer
19
+ - NeMo
20
+ - pytorch
21
+ - Transformer
22
+ model-index:
23
+ - name: stt_eu_conformer_ctc_large
24
+ results:
25
+ - task:
26
+ type: Automatic Speech Recognition
27
+ name: speech-recognition
28
+ dataset:
29
+ name: Mozilla Common Voice 16.1
30
+ type: mozilla-foundation/common_voice_16_1
31
+ config: eu
32
+ split: test
33
+ args:
34
+ language: eu
35
+ metrics:
36
+ - name: Test WER
37
+ type: wer
38
+ value: 2.42
39
+ - task:
40
+ type: Automatic Speech Recognition
41
+ name: speech-recognition
42
+ dataset:
43
+ name: Basque Parliament
44
+ type: gttsehu/basque_parliament_1
45
+ config: eu
46
+ split: test
47
+ args:
48
+ language: eu
49
+ metrics:
50
+ - name: Test WER
51
+ type: wer
52
+ value: 4.21
53
+ - task:
54
+ type: Automatic Speech Recognition
55
+ name: speech-recognition
56
+ dataset:
57
+ name: Basque Parliament
58
+ type: gttsehu/basque_parliament_1
59
+ config: eu
60
+ split: validation
61
+ args:
62
+ language: eu
63
+ metrics:
64
+ - name: Dev WER
65
+ type: wer
66
+ value: 4.3
67
  ---
68
+
69
+ # HiTZ/Aholab's Basque Speech-to-Text model
70
+ ## Model Description
71
+
72
+ <style>
73
+ img {
74
+ display: inline;
75
+ }
76
+ </style>
77
+
78
+ | [![Model architecture](https://img.shields.io/badge/Model_Arch-Conformer--CTC-lightgrey#model-badge)](#model-architecture)
79
+ | [![Model size](https://img.shields.io/badge/Params-121M-lightgrey#model-badge)](#model-architecture)
80
+ | [![Language](https://img.shields.io/badge/Language-eu-lightgrey#model-badge)](#datasets)
81
+
82
+ This model transcribes speech in lowercase Basque alphabet including spaces, and was trained on a composite dataset comprising of 548 hours of Basque speech. The model was fine-tuned from a pre-trained Spanish [stt_es_conformer_ctc_large](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_es_conformer_ctc_large) model using the [Nvidia NeMo](https://github.com/NVIDIA/NeMo) toolkit. It is a non-autoregressive "large" variant of Conformer, with around 121 million parameters.
83
+ See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc) for complete architecture details.
84
+
85
+ ## Usage
86
+ To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
87
+
88
+ ```bash
89
+ pip install nemo_toolkit['all']
90
+ ```
91
+
92
+ ### Transcribing using Python
93
+ Clone repository to download the model:
94
+
95
+ ```bash
96
+ git clone https://huggingface.co/asierhv/stt_eu_conformer_ctc_large
97
+ ```
98
+
99
+ Given `NEMO_MODEL_FILEPATH` is the path that points to the downloaded `stt_eu_conformer_ctc_large.nemo` file.
100
+
101
+ ```python
102
+ import nemo.collections.asr as nemo_asr
103
+
104
+ # Load the model
105
+ asr_model = nemo_asr.models.EncDecCTCModelBPE.restore_from(NEMO_MODEL_FILEPATH)
106
+
107
+ # Create a list pointing to the audio files
108
+ audio = ["audio_1.wav","audio_2.wav", ..., "audio_n.wav"]
109
+
110
+ # Fix the batch_size to whatever number suits your purpouse
111
+ batch_size = 8
112
+
113
+ # Transcribe the audio files
114
+ transcriptions = asr_model.transcribe(audio=audio, batch_size=batch_size)
115
+
116
+ # Visualize the transcriptions
117
+ print(transcriptions)
118
+ ```
119
+ #### Change decoding strategy
120
+ Optionally you can add some lines before transcribing the audio to change the decoding strategy and use Beam Search with N-gram Language Model. The previous installation of the beam search decoders has been made using the [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/asr_language_modeling/ngram_lm/install_beamsearch_decoders.sh) provided by the NeMo Toolkit [3]. Given `KENLM_MODEL_FILEPATH` is the path that points to the downloaded `kenlm_unigram_v256_model.bin` file.
121
+
122
+ ```python
123
+ from omegaconf import OmegaConf, open_dict
124
+
125
+ with open_dict(asr_model.cfg):
126
+ asr_model.cfg.decoding.strategy = "beam"
127
+ asr_model.cfg.decoding.beam.beam_size = 32 # Desired Beam Size
128
+ asr_model.cfg.decoding.beam.beam_alpha = 1 # Desired Beam Alpha
129
+ asr_model.cfg.decoding.beam.beam_beta = 1 # Desired Beam Beta
130
+ asr_model.cfg.decoding.beam.kenlm_path = KENLM_MODEL_FILEPATH
131
+ asr_model.change_decoding_strategy(asr_model.cfg.decoding)
132
+ ```
133
+
134
+ ## Input
135
+ This model accepts 16000 kHz Mono-channel Audio (wav files) as input.
136
+
137
+ ## Output
138
+ This model provides transcribed speech as a string for a given audio sample.
139
+
140
+ ## Model Architecture
141
+ Conformer-CTC model is a non-autoregressive variant of Conformer model [1] for Automatic Speech Recognition which uses CTC loss/decoding instead of Transducer. You may find more info on the detail of this model here: [Conformer-CTC Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc).
142
+
143
+ ## Training
144
+ ### Data preparation
145
+ This model has been trained on a composite dataset comprising 548 hours of Basque speech that contains:
146
+ - A processed subset of the `validated` split of the basque version of the public dataset [Mozilla Common Voice 16.1](https://huggingface.co/datasets/mozilla-foundation/common_voice_16_1): We have processed the `validated` split, which originally contains the `train`, `dev` and `test` splits, to create a subset free of sentences equal to the ones that are in the `test` split, to avoid leakage.
147
+ - The `train_clean` split of the basque version of the public dataset [Basque Parliament](https://huggingface.co/datasets/gttsehu/basque_parliament_1)
148
+ - A processed subset of the basque version of the public dataset [OpenSLR](https://huggingface.co/datasets/openslr#slr76-crowdsourced-high-quality-basque-speech-data-set): This subset has been cleaned from numerical characters and acronyms.
149
+
150
+ The composite dataset for training has been precisely cleaned from any sentence that equals the ones in the `test` datasets where the WER metrics will be computed.
151
+
152
+ ### Training procedure
153
+ This model was trained starting from the pre-trained Spanish model [stt_es_conformer_ctc_large](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_es_conformer_ctc_large) over several hundred of epochs in a GPU device, using the NeMo toolkit [3]
154
+ The tokenizer for these model was built using the text transcripts of the composite train dataset with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py), with a total of 256 basque language tokens.
155
+
156
+ ## Performance
157
+ Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding in the following table.
158
+ | Tokenizer | Vocabulary Size | MCV 16.1 Test | Basque Parliament Test | Basque Parliament Dev | Train Dataset |
159
+ |-----------------------|-----------------|---------------|------------------------|-----------------------|------------------------------|
160
+ | SentencePiece Unigram | 256 | 4.72 | 4.51 | 4.85 | Composite Dataset (548 h) |
161
+
162
+ A N-gram Language model has been trained using the [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/asr_language_modeling/ngram_lm/train_kenlm.py) provided in the NeMo Toolkit [3] with a corpus comprissed of 27 million basque language sentences from accesible open sources like:
163
+ - Tatoeba, OpenSubtitles, TED, GlobalVoices, and other corpora from [OPUS](https://opus.nlpl.eu/)
164
+ - [Wikipedia dump (2023-09-20)](https://dumps.wikimedia.org/euwiki/)
165
+ - [EusCrawl 1.0](https://ixa.ehu.eus/euscrawl/)
166
+
167
+ Performances of the ASR models are reported in terms of Word Error Rate (WER%) with beam-search decoding with N-gram LM in the following table.
168
+
169
+ | N | Beam Size | Beam Alpha | Beam Beta | MCV 16.1 Test | Basque Parliament Test | Basque Parliament Dev |
170
+ |---|-----------|------------|-----------|---------------|------------------------|-----------------------|
171
+ | 6 | 32 | 1 | 1 | 2.42 | 4.21 | 4.3 |
172
+
173
+ ## Limitations
174
+ Since this model was trained on almost publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
175
+
176
+ # Aditional Information
177
+ ## Author
178
+ HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU.
179
+
180
+ ## Copyright
181
+ Copyright (c) 2024 HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU.
182
+
183
+ ## Licensing Information
184
+ [Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)
185
+
186
+ ## Funding
187
+ This project with reference 2022/TL22/00215335 has been parcially funded by the Ministerio de Transformación Digital and by the Plan de Recuperación, Transformación y Resiliencia – Funded by the European Union – NextGenerationEU [ILENIA](https://proyectoilenia.es/) and by the project [IkerGaitu](https://www.hitz.eus/iker-gaitu/) funded by the Basque Government.
188
+
189
+ ## References
190
+ - [1] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
191
+ - [2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
192
+ - [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
193
+
194
+ ## Disclaimer
195
+ <details>
196
+ <summary>Click to expand</summary>
197
+ The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
198
+
199
+ When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
200
+
201
+ In no event shall the owner and creator of the models (Aholab Signal Processing Laboratory from HiTZ: Basque Center for Language Technology at the UPV/EHU) be liable for any results arising from the use made by third parties of these models.