-
Notifications
You must be signed in to change notification settings - Fork 13k
model: fix tokenization issues with spm tokenizer #10081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6eeaebd
to
e139256
Compare
if id := spm.vocab.Encode(string(left.runes) + string(right.runes)); id < 0 { | ||
continue | ||
} | ||
if string(left.runes) == "" || string(right.runes) == "" || len(string(left.runes))+len(string(right.runes)) != pair.size { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tries to stay as close to the existing implementation as possible, but in a follow-up change .runes
could be replaced with a string value instead.
I believe this line was the cause of the breakage since we were comparing len
of a rune slice instead of the string length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just a couple of small comments.
} | ||
} | ||
|
||
ids = append(ids, result...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to guard here if the result slice is empty because we didn't have the byte tokens in the vocab.
This PR fixes inconsistencies in the SPM tokenizer for Gemma 3
Note: while this fixes tokenizing certain utf-8 characters (e.g. certain Korean characters) it doesn't fix de-tokenizing them yet