Skip to content

Conversation

JojiiOfficial
Copy link
Contributor

@JojiiOfficial JojiiOfficial commented Jun 26, 2025

Depends on #6762

This PR removes pinyin conversion (eg 中国 => ["zhong", "guo"]) in our new multilingual tokenizer implementation. This fixes Chinese stopwords, because the tokens were in pinyin but our stopword list is written with Chinese letters (so basically we did assert!(["是", "上去", ...].contains("shì")))

Since Pinyin is just the romanized phonetic version of a Chinese word, we can use the original Chinese word without loosing information about the word. We even improve precision, because multiple Chinese words can map to the same Pinyin, e.g. 忘记 and 旺季 is both ["wang", "ji"] (https://www.quora.com/Is-pinyin-romanization-a-bijective-map).

@generall generall merged commit 606b840 into implement_new_multilingual_tokenizer Jun 26, 2025
17 checks passed
@generall generall deleted the fix_chinese_stopwords branch June 26, 2025 12:46
generall added a commit that referenced this pull request Jun 26, 2025
* Implement new multilingual tokenizer

* Remove unnecessary clones

* Codespell

* filter stopwords before stemmer

* Fix Chinese stopwords (#6765)

* Fix Chinese stopwords

* remove todo

---------

Co-authored-by: Andrey Vasnetsov <andrey@vasnetsov.com>
generall added a commit that referenced this pull request Jul 17, 2025
* Implement new multilingual tokenizer

* Remove unnecessary clones

* Codespell

* filter stopwords before stemmer

* Fix Chinese stopwords (#6765)

* Fix Chinese stopwords

* remove todo

---------

Co-authored-by: Andrey Vasnetsov <andrey@vasnetsov.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants