Skip to content

feat: Append ascii name if any 8bit UTF8 chars #9173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 23, 2025

Conversation

richsalz
Copy link
Collaborator

@richsalz richsalz commented Jul 19, 2025

Inspirited by Peter Yee's earlier work.

Fixes: #7167

Copy link

codecov bot commented Jul 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.74%. Comparing base (f380b1a) to head (43e4aba).
Report is 19 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff            @@
##             main    #9173    +/-   ##
========================================
  Coverage   88.74%   88.74%            
========================================
  Files         321      320     -1     
  Lines       41853    41649   -204     
========================================
- Hits        37144    36963   -181     
+ Misses       4709     4686    -23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@bkmgit
Copy link
Contributor

bkmgit commented Jul 19, 2025

It maybe helpful to later add the option of using Meng Sheng Pinyin fonts or Hanzi Pinyin fonts as romanization to ascii can result in information loss for many tonal languages, as an example:

  • "mother" (mā, 妈)
  • "ant" (má, 蚂)
  • "horse" (mǎ, 马)
  • "scold" (mà, 骂)

There are libraries that will also do this, for example xpinyin. Support for other languages could be added as need arises.

@richsalz
Copy link
Collaborator Author

Hiya @bkmgit , could you create a new issue with your comment? That greatly expands the scope of this rather simple approach.

@richsalz richsalz closed this Jul 20, 2025
@richsalz richsalz deleted the fix-7167 branch July 20, 2025 00:50
@richsalz richsalz restored the fix-7167 branch July 20, 2025 00:54
@richsalz
Copy link
Collaborator Author

Hit wrong GH button, re-opening

Copy link
Member

@jennifer-richards jennifer-richards left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll likely want to allow some additional Latin characters (e.g., "É" and other accented characters generally recognizable to readers of US-ASCII) without adding the ascii name, but this is a step forward.

@richsalz
Copy link
Collaborator Author

According to https://www.lookuptables.com/text/extended-ascii-table, it looks like all the accented characters are in the decimal range 128-154 in case we want to make an exception for them.

thanks for the review.

@bkmgit
Copy link
Contributor

bkmgit commented Jul 20, 2025

If the language is known, pyicu has options for transliteration, there is an example in the cheatsheet.

However, it maybe easier to do an NFC decomposition of each character and check if it contains an ascii letter, if all the NFC decompositions contain ascii characters, keep the name, otherwise use the ascii name. This could be done using unicodedata.

Ideally each person would be able to update this field since the readme of Unidecode indicates there will be many corner cases that will be difficult to cover with existing software.

@richsalz
Copy link
Collaborator Author

richsalz commented Jul 20, 2025 via email

@jennifer-richards
Copy link
Member

Thanks for the insights, Benson, I had looked briefly and more naively at the unicodedata module. I agree it will be useful. I think Rich's test as implemented will inform us a lot as to where we get bitten by pointless extra text in practice.

(And, perfect being enemy of the good and all, merging this will fix the entirely non-Latin text cases that inspired the issue this addresses; follow-ups that deal with additional subtleties will be welcome)

@rjsparks rjsparks merged commit 5a862b2 into ietf-tools:main Jul 23, 2025
17 checks passed
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 27, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTF8 only in Authors list
4 participants