Fix Chinese stopwords #6765

JojiiOfficial · 2025-06-26T11:53:21Z

Depends on #6762

This PR removes pinyin conversion (eg 中国 => ["zhong", "guo"]) in our new multilingual tokenizer implementation. This fixes Chinese stopwords, because the tokens were in pinyin but our stopword list is written with Chinese letters (so basically we did assert!(["是", "上去", ...].contains("shì")))

Since Pinyin is just the romanized phonetic version of a Chinese word, we can use the original Chinese word without loosing information about the word. We even improve precision, because multiple Chinese words can map to the same Pinyin, e.g. 忘记 and 旺季 is both ["wang", "ji"] (https://www.quora.com/Is-pinyin-romanization-a-bijective-map).

* Implement new multilingual tokenizer * Remove unnecessary clones * Codespell * filter stopwords before stemmer * Fix Chinese stopwords (#6765) * Fix Chinese stopwords * remove todo --------- Co-authored-by: Andrey Vasnetsov <andrey@vasnetsov.com>

JojiiOfficial added 2 commits June 26, 2025 13:36

Fix Chinese stopwords

5db144c

remove todo

e061041

JojiiOfficial requested review from agourlay, generall and timvisee June 26, 2025 11:54

generall approved these changes Jun 26, 2025

View reviewed changes

generall merged commit 606b840 into implement_new_multilingual_tokenizer Jun 26, 2025
17 checks passed

generall deleted the fix_chinese_stopwords branch June 26, 2025 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Chinese stopwords #6765

Fix Chinese stopwords #6765

Uh oh!

JojiiOfficial commented Jun 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Fix Chinese stopwords #6765

Fix Chinese stopwords #6765

Uh oh!

Conversation

JojiiOfficial commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JojiiOfficial commented Jun 26, 2025 •

edited

Loading