hash2004
/

parakeet-fine-tuned-urdu

Automatic Speech Recognition

Model card Files Files and versions Community

parakeet-fine-tuned-urdu / README.md

hash2004's picture

Update README.md

c66b17c verified about 1 month ago

|

history blame contribute delete

3.16 kB

	---
	language:
	- ur
	library_name: nemo
	datasets:
	- mozilla-foundation/common_voice_12_0
	thumbnail: null
	tags:
	- automatic-speech-recognition
	- speech
	- audio
	- Transducer
	- FastConformer
	- Conformer
	- pytorch
	- NeMo
	license: cc-by-4.0
	widget:
	- Title: Common Voice Urdu Sample
	src: https://cdn-media.huggingface.co/speech_samples/sample_urdu.flac
	model-index:
	- name: parakeet-rnnt-0.6b-urdu
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Mozilla Common Voice 12.0 (Urdu)
	type: mozilla-foundation/common_voice_12_0
	split: test
	args:
	language: ur
	metrics:
	- name: Test WER
	type: wer
	value: 25.513
	metrics:
	- wer
	pipeline_tag: automatic-speech-recognition
	---
	# Fine-Tuned Parakeet RNNT 0.6B (Urdu)

	This repository contains the fine-tuned version of the Parakeet RNNT 0.6B model for Urdu Automatic Speech Recognition (ASR). The base model, developed by NVIDIA NeMo and Suno.ai, was fine-tuned on the Urdu dataset from Mozilla's Common Voice 12.0. This fine-tuning enables the model to perform speech-to-text tasks in Urdu with improved accuracy and domain-specific adaptation.

	---

	## Model Overview

	The Parakeet RNNT is an XL version of the FastConformer Transducer with 600 million parameters, optimized for ASR tasks. The fine-tuned model supports Urdu transcription, enabling applications such as subtitling, speech analytics, and voice-assisted interfaces.

	Base model details can be found on 🤗 [Hugging Face](https://huggingface.co/nvidia/parakeet-rnnt-0.6b).

	---

	## Training Details

	### Dataset
	The fine-tuning was performed using the Urdu dataset from Mozilla's [Common Voice 12.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0). This dataset provides diverse speech samples in Urdu, ensuring robust training.

	### Hardware
	- Google Colab Pro
	- NVIDIA A100 GPU

	---

	## Results

	The model achieved a Word Error Rate (WER) of 25.513% on the test split of the Common Voice Urdu dataset. While this may seem high, the model demonstrates impressive accuracy in many transcriptions:

	- Reference: کچھ بھی ہو سکتا ہے۔
	Predicted: کچھ بھی ہو سکتا ہے۔

	---

	- Reference: اورکوئی جمہوریت کو کوس رہا ہے۔
	Predicted: اور کوئ جمہوریت کو کو س رہا ہے۔

	This WER is slightly higher than OpenAI's Whisper model, which achieved 23% without fine-tuning ([reference](https://arxiv.org/html/2409.11252v1)), but demonstrates the potential of the Parakeet RNNT with further fine-tuning.

	---

	## How to Use this Model

	### Loading the Model

	You can load the fine-tuned model using NVIDIA NeMo:

	```python
	import nemo.collections.asr as nemo_asr
	asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="hash2004/parakeet-fine-tuned-urdu")
	```

	## How to Fine Tune this Model
	You can find all resources on fine-tuning the Parakeet RNNT (0.6B) model on [this GitHub Repository](https://github.com/hash2004/conformer-fine-tuned-urdu).