hash2004's picture
Update README.md
c66b17c verified
metadata
language:
  - ur
library_name: nemo
datasets:
  - mozilla-foundation/common_voice_12_0
thumbnail: null
tags:
  - automatic-speech-recognition
  - speech
  - audio
  - Transducer
  - FastConformer
  - Conformer
  - pytorch
  - NeMo
license: cc-by-4.0
widget:
  - Title: Common Voice Urdu Sample
    src: https://cdn-media.huggingface.co/speech_samples/sample_urdu.flac
model-index:
  - name: parakeet-rnnt-0.6b-urdu
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Mozilla Common Voice 12.0 (Urdu)
          type: mozilla-foundation/common_voice_12_0
          split: test
          args:
            language: ur
        metrics:
          - name: Test WER
            type: wer
            value: 25.513
metrics:
  - wer
pipeline_tag: automatic-speech-recognition

Fine-Tuned Parakeet RNNT 0.6B (Urdu)

This repository contains the fine-tuned version of the Parakeet RNNT 0.6B model for Urdu Automatic Speech Recognition (ASR). The base model, developed by NVIDIA NeMo and Suno.ai, was fine-tuned on the Urdu dataset from Mozilla's Common Voice 12.0. This fine-tuning enables the model to perform speech-to-text tasks in Urdu with improved accuracy and domain-specific adaptation.


Model Overview

The Parakeet RNNT is an XL version of the FastConformer Transducer with 600 million parameters, optimized for ASR tasks. The fine-tuned model supports Urdu transcription, enabling applications such as subtitling, speech analytics, and voice-assisted interfaces.

Base model details can be found on 🤗 Hugging Face.


Training Details

Dataset

The fine-tuning was performed using the Urdu dataset from Mozilla's Common Voice 12.0. This dataset provides diverse speech samples in Urdu, ensuring robust training.

Hardware

  • Google Colab Pro
  • NVIDIA A100 GPU

Results

The model achieved a Word Error Rate (WER) of 25.513% on the test split of the Common Voice Urdu dataset. While this may seem high, the model demonstrates impressive accuracy in many transcriptions:

  • Reference: کچھ بھی ہو سکتا ہے۔
    Predicted: کچھ بھی ہو سکتا ہے۔

  • Reference: اورکوئی جمہوریت کو کوس رہا ہے۔
    Predicted: اور کوئ جمہوریت کو کو س رہا ہے۔

This WER is slightly higher than OpenAI's Whisper model, which achieved 23% without fine-tuning (reference), but demonstrates the potential of the Parakeet RNNT with further fine-tuning.


How to Use this Model

Loading the Model

You can load the fine-tuned model using NVIDIA NeMo:

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="hash2004/parakeet-fine-tuned-urdu")

How to Fine Tune this Model

You can find all resources on fine-tuning the Parakeet RNNT (0.6B) model on this GitHub Repository.