finewebedu_32000
About
🇬🇧 An English tokenizer, trained on the FineWeb-Edu dataset.
Description
This is a character-level (mainly) English (en) tokenizer, trained on the CC-MAIN-2024-10 subset of FineWeb-Edu. It has a vocabulary size of 32,000 (multiple of 128), which makes it fast for integration in models.
Usage
import tokenizers
dataset = tokenizers.Tokenizer.from_pretrained("gvlassis/finewebedu_32000")