Tokenizer inconsistencies in GemmaTokenizerFast

#76
by sanderland - opened

The Huggingface tokenizer gives different results from the SentencePiece tokenizer, probably due to a regex preprocessor.
Some noteable tokens affected include HTML tags which seem to have been added manually to the vocabulary.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b")
token_ids =  tokenizer.encode('What is <tbody>? "<tbody>" is an html tag')
[(i,tokenizer.decode([i])) for i in token_ids]

gives

[(2, '<bos>'),
 (1841, 'What'),
 (603, ' is'),
 (968, ' <'),
 (80309, 'tbody'),
 (93540, '>?'),
 (15114, ' "<'),
 (80309, 'tbody'),
 (28760, '>"'),
 (603, ' is'),
 (671, ' an'),
 (11060, ' html'),
 (5886, ' tag')]

Whereas using

vocab = spm.SentencePieceProcessor()
vocab.Load("gemma_tokenizer.model")
input_ids = vocab.EncodeAsIds('What is <tbody>? "<tbody>" is an html tag')
[(i, vocab.DecodeIds([i])) for i in input_ids]

gives

[(1841, 'What'),
 (603, ' is'),
 (235248, ' '),
 (172, '<tbody>'),
 (235336, '?'),
 (664, ' "'),
 (172, '<tbody>'),
 (235281, '"'),
 (603, ' is'),
 (671, ' an'),
 (11060, ' html'),
 (5886, ' tag')]

I think the problem arises when we use the AutoTokenizer class, which is instantiated from GemmaTokenizerFast. I think the GemmaTokenizer tokenizes text similarly to the original spm tokenizer.

Thanks @PedramR , indeed

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b", use_fast=False)

does give token 172. I've updated the title.

sanderland changed discussion title from Tokenizer inconsistencies to Tokenizer inconsistencies in GemmaFastTokenizer
sanderland changed discussion title from Tokenizer inconsistencies in GemmaFastTokenizer to Tokenizer inconsistencies in GemmaTokenizerFast

@suryabhupa Can you fix this? Thanks!

We're looking into this now, thanks for raising! Should have an update soon.

If you pull latest transformers code, the two tokenizers should produce the same sequence. FYI, the fix is https://github.com/huggingface/transformers/pull/29473.

Yes, @minyiccp @sanderland that should be the fix here -- let us know if it doesn't work!

Sign up or log in to comment