|
--- |
|
language: |
|
- ur |
|
library_name: nemo |
|
datasets: |
|
- mozilla-foundation/common_voice_12_0 |
|
thumbnail: null |
|
tags: |
|
- automatic-speech-recognition |
|
- speech |
|
- audio |
|
- Transducer |
|
- FastConformer |
|
- Conformer |
|
- pytorch |
|
- NeMo |
|
license: cc-by-4.0 |
|
widget: |
|
- Title: Common Voice Urdu Sample |
|
src: https://cdn-media.huggingface.co/speech_samples/sample_urdu.flac |
|
model-index: |
|
- name: parakeet-rnnt-0.6b-urdu |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Mozilla Common Voice 12.0 (Urdu) |
|
type: mozilla-foundation/common_voice_12_0 |
|
split: test |
|
args: |
|
language: ur |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 25.513 |
|
metrics: |
|
- wer |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
# Fine-Tuned Parakeet RNNT 0.6B (Urdu) |
|
|
|
This repository contains the fine-tuned version of the **Parakeet RNNT 0.6B** model for **Urdu** Automatic Speech Recognition (ASR). The base model, developed by **NVIDIA NeMo** and **Suno.ai**, was fine-tuned on the Urdu dataset from Mozilla's Common Voice 12.0. This fine-tuning enables the model to perform speech-to-text tasks in Urdu with improved accuracy and domain-specific adaptation. |
|
|
|
--- |
|
|
|
## Model Overview |
|
|
|
The **Parakeet RNNT** is an XL version of the FastConformer Transducer with **600 million parameters**, optimized for ASR tasks. The fine-tuned model supports Urdu transcription, enabling applications such as subtitling, speech analytics, and voice-assisted interfaces. |
|
|
|
Base model details can be found on 🤗 [Hugging Face](https://huggingface.co/nvidia/parakeet-rnnt-0.6b). |
|
|
|
--- |
|
|
|
## Training Details |
|
|
|
### Dataset |
|
The fine-tuning was performed using the **Urdu dataset** from Mozilla's [Common Voice 12.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0). This dataset provides diverse speech samples in Urdu, ensuring robust training. |
|
|
|
### Hardware |
|
- **Google Colab Pro** |
|
- **NVIDIA A100 GPU** |
|
|
|
--- |
|
|
|
## Results |
|
|
|
The model achieved a **Word Error Rate (WER)** of **25.513%** on the test split of the Common Voice Urdu dataset. While this may seem high, the model demonstrates impressive accuracy in many transcriptions: |
|
|
|
- **Reference**: کچھ بھی ہو سکتا ہے۔ |
|
**Predicted**: کچھ بھی ہو سکتا ہے۔ |
|
|
|
--- |
|
|
|
- **Reference**: اورکوئی جمہوریت کو کوس رہا ہے۔ |
|
**Predicted**: اور کوئ جمہوریت کو کو س رہا ہے۔ |
|
|
|
This WER is slightly higher than OpenAI's **Whisper model**, which achieved **23%** without fine-tuning ([reference](https://arxiv.org/html/2409.11252v1)), but demonstrates the potential of the Parakeet RNNT with further fine-tuning. |
|
|
|
--- |
|
|
|
## How to Use this Model |
|
|
|
### Loading the Model |
|
|
|
You can load the fine-tuned model using NVIDIA NeMo: |
|
|
|
```python |
|
import nemo.collections.asr as nemo_asr |
|
asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="hash2004/parakeet-fine-tuned-urdu") |
|
``` |
|
|
|
## How to Fine Tune this Model |
|
You can find all resources on fine-tuning the Parakeet RNNT (0.6B) model on [this GitHub Repository](https://github.com/hash2004/conformer-fine-tuned-urdu). |
|
|
|
|
|
|