model: fix tokenization issues with spm tokenizer #10081

jmorganca · 2025-04-02T03:47:41Z

This PR fixes inconsistencies in the SPM tokenizer for Gemma 3

Note: while this fixes tokenizing certain utf-8 characters (e.g. certain Korean characters) it doesn't fix de-tokenizing them yet

jmorganca · 2025-04-02T08:13:41Z

model/process_text_spm.go

-				if id := spm.vocab.Encode(string(left.runes) + string(right.runes)); id < 0 {
-					continue
-				}
+			if string(left.runes) == "" || string(right.runes) == "" || len(string(left.runes))+len(string(right.runes)) != pair.size {


This tries to stay as close to the existing implementation as possible, but in a follow-up change .runes could be replaced with a string value instead.

I believe this line was the cause of the breakage since we were comparing len of a rune slice instead of the string length.

pdevine

lgtm, just a couple of small comments.

pdevine · 2025-04-02T18:44:04Z

model/process_text_spm.go

 					}
 				}
+
+				ids = append(ids, result...)


you might want to guard here if the result slice is empty because we didn't have the byte tokens in the vocab.

model/process_text_spm.go

jmorganca requested review from pdevine and jessegross April 2, 2025 03:47

jmorganca force-pushed the jmorganca/spm branch 2 times, most recently from 6eeaebd to e139256 Compare April 2, 2025 03:56

Mkhaled35 approved these changes Apr 2, 2025

View reviewed changes

jmorganca commented Apr 2, 2025

View reviewed changes

jmorganca mentioned this pull request Apr 2, 2025

model: remove verbose debug tokenizer logging #10047

Closed

jmorganca added 15 commits April 2, 2025 11:37

model: fix issues with spm tokenizer

5ae3cf1

updates to lower changes

9bb6f22

revert differences

c104d7f

move code

9864032

less diff

2b03c22

lower diff

5f1b3db

less diff

fa14508

progress

e192573

wip

493c882

less diff

0ebcce2

less diff

ffe1ba7

less diff

c33c6dc

update tests

1146611

small tweak

28dc639

fix partial unicode, add tests

1b86cc4

jmorganca force-pushed the jmorganca/spm branch from 7fc76a6 to 1b86cc4 Compare April 2, 2025 18:37

pdevine approved these changes Apr 2, 2025

View reviewed changes

shethaadit approved these changes Apr 2, 2025

View reviewed changes

address comments

dd29b11

jmorganca merged commit b51e0f3 into main Apr 2, 2025
8 checks passed

jmorganca deleted the jmorganca/spm branch April 2, 2025 20:22

jmorganca mentioned this pull request Apr 2, 2025

Gemma3 Model Tokenization Issue with Unicode token. #9815

Closed

halfcrazy pushed a commit to halfcrazy/ollama that referenced this pull request Jun 19, 2025

model: fix issues with spm tokenizer for Gemma 3 (ollama#10081)

41680c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

model: fix tokenization issues with spm tokenizer #10081

model: fix tokenization issues with spm tokenizer #10081

Uh oh!

jmorganca commented Apr 2, 2025 •

edited

Loading

Uh oh!

jmorganca Apr 2, 2025

Uh oh!

pdevine left a comment

Uh oh!

pdevine Apr 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

model: fix tokenization issues with spm tokenizer #10081

model: fix tokenization issues with spm tokenizer #10081

Uh oh!

Conversation

jmorganca commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmorganca Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

pdevine left a comment

Choose a reason for hiding this comment

Uh oh!

pdevine Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmorganca commented Apr 2, 2025 •

edited

Loading