Overview
This is a slightly smaller model trained on OSCAR Sinhala dedup dataset. As Sinhala is one of those low resource languages, there are only a handful of models been trained. So, this would be a great place to start training for more downstream tasks.
Model Specification
The model chosen for training is Roberta with the following specifications:
- vocab_size=52000
- max_position_embeddings=514
- num_attention_heads=12
- num_hidden_layers=6
- type_vocab_size=1
How to Use
You can use this model directly with a pipeline for masked language modeling:
from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
model = AutoModelWithLMHead.from_pretrained("keshan/SinhalaBERTo")
tokenizer = AutoTokenizer.from_pretrained("keshan/SinhalaBERTo")
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
fill_mask("මම ගෙදර <mask>.")
- Downloads last month
- 132
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.