Skip to content

Conversation

jmorganca
Copy link
Member

@jmorganca jmorganca commented Apr 2, 2025

This PR fixes inconsistencies in the SPM tokenizer for Gemma 3

Note: while this fixes tokenizing certain utf-8 characters (e.g. certain Korean characters) it doesn't fix de-tokenizing them yet

@jmorganca jmorganca requested review from pdevine and jessegross April 2, 2025 03:47
@jmorganca jmorganca force-pushed the jmorganca/spm branch 2 times, most recently from 6eeaebd to e139256 Compare April 2, 2025 03:56
if id := spm.vocab.Encode(string(left.runes) + string(right.runes)); id < 0 {
continue
}
if string(left.runes) == "" || string(right.runes) == "" || len(string(left.runes))+len(string(right.runes)) != pair.size {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tries to stay as close to the existing implementation as possible, but in a follow-up change .runes could be replaced with a string value instead.

I believe this line was the cause of the breakage since we were comparing len of a rune slice instead of the string length.

Copy link
Contributor

@pdevine pdevine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just a couple of small comments.

}
}

ids = append(ids, result...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to guard here if the result slice is empty because we didn't have the byte tokens in the vocab.

@jmorganca jmorganca merged commit b51e0f3 into main Apr 2, 2025
8 checks passed
@jmorganca jmorganca deleted the jmorganca/spm branch April 2, 2025 20:22
halfcrazy pushed a commit to halfcrazy/ollama that referenced this pull request Jun 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants