metadata

language:
  - ur
library_name: nemo
datasets:
  - mozilla-foundation/common_voice_12_0
thumbnail: null
tags:
  - automatic-speech-recognition
  - speech
  - audio
  - Transducer
  - FastConformer
  - Conformer
  - pytorch
  - NeMo
license: cc-by-4.0
widget:
  - Title: Common Voice Urdu Sample
    src: https://cdn-media.huggingface.co/speech_samples/sample_urdu.flac
model-index:
  - name: parakeet-rnnt-0.6b-urdu
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Mozilla Common Voice 12.0 (Urdu)
          type: mozilla-foundation/common_voice_12_0
          split: test
          args:
            language: ur
        metrics:
          - name: Test WER
            type: wer
            value: 25.513
metrics:
  - wer
pipeline_tag: automatic-speech-recognition

Fine-Tuned Parakeet RNNT 0.6B (Urdu)

This repository contains the fine-tuned version of the Parakeet RNNT 0.6B model for Urdu Automatic Speech Recognition (ASR). The base model, developed by NVIDIA NeMo and Suno.ai, was fine-tuned on the Urdu dataset from Mozilla's Common Voice 12.0. This fine-tuning enables the model to perform speech-to-text tasks in Urdu with improved accuracy and domain-specific adaptation.

Model Overview

The Parakeet RNNT is an XL version of the FastConformer Transducer with 600 million parameters, optimized for ASR tasks. The fine-tuned model supports Urdu transcription, enabling applications such as subtitling, speech analytics, and voice-assisted interfaces.

Base model details can be found on 🤗 Hugging Face.

Training Details

Dataset

The fine-tuning was performed using the Urdu dataset from Mozilla's Common Voice 12.0. This dataset provides diverse speech samples in Urdu, ensuring robust training.

Hardware

Google Colab Pro
NVIDIA A100 GPU

Results

The model achieved a Word Error Rate (WER) of 25.513% on the test split of the Common Voice Urdu dataset. While this may seem high, the model demonstrates impressive accuracy in many transcriptions:

Reference: کچھ بھی ہو سکتا ہے۔
Predicted: کچھ بھی ہو سکتا ہے۔

Reference: اورکوئی جمہوریت کو کوس رہا ہے۔
Predicted: اور کوئ جمہوریت کو کو س رہا ہے۔

This WER is slightly higher than OpenAI's Whisper model, which achieved 23% without fine-tuning (reference), but demonstrates the potential of the Parakeet RNNT with further fine-tuning.

How to Use this Model

Loading the Model

You can load the fine-tuned model using NVIDIA NeMo:

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="hash2004/parakeet-fine-tuned-urdu")

How to Fine Tune this Model

You can find all resources on fine-tuning the Parakeet RNNT (0.6B) model on this GitHub Repository.