feat: Append ascii name if any 8bit UTF8 chars #9173

richsalz · 2025-07-19T09:52:38Z

Inspirited by Peter Yee's earlier work.

Fixes: #7167

Fixes: 7167

codecov · 2025-07-19T10:19:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.74%. Comparing base (f380b1a) to head (43e4aba).
Report is 19 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff            @@
##             main    #9173    +/-   ##
========================================
  Coverage   88.74%   88.74%            
========================================
  Files         321      320     -1     
  Lines       41853    41649   -204     
========================================
- Hits        37144    36963   -181     
+ Misses       4709     4686    -23

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

bkmgit · 2025-07-19T18:37:22Z

It maybe helpful to later add the option of using Meng Sheng Pinyin fonts or Hanzi Pinyin fonts as romanization to ascii can result in information loss for many tonal languages, as an example:

"mother" (mā, 妈)
"ant" (má, 蚂)
"horse" (mǎ, 马)
"scold" (mà, 骂)

There are libraries that will also do this, for example xpinyin. Support for other languages could be added as need arises.

richsalz · 2025-07-20T00:46:42Z

Hiya @bkmgit , could you create a new issue with your comment? That greatly expands the scope of this rather simple approach.

richsalz · 2025-07-20T00:55:02Z

Hit wrong GH button, re-opening

jennifer-richards

I think we'll likely want to allow some additional Latin characters (e.g., "É" and other accented characters generally recognizable to readers of US-ASCII) without adding the ascii name, but this is a step forward.

richsalz · 2025-07-20T16:12:36Z

According to https://www.lookuptables.com/text/extended-ascii-table, it looks like all the accented characters are in the decimal range 128-154 in case we want to make an exception for them.

thanks for the review.

bkmgit · 2025-07-20T19:39:30Z

If the language is known, pyicu has options for transliteration, there is an example in the cheatsheet.

However, it maybe easier to do an NFC decomposition of each character and check if it contains an ascii letter, if all the NFC decompositions contain ascii characters, keep the name, otherwise use the ascii name. This could be done using unicodedata.

Ideally each person would be able to update this field since the readme of Unidecode indicates there will be many corner cases that will be difficult to cover with existing software.

richsalz · 2025-07-20T20:52:39Z

After checking with some John Levin and John Klensin, the current test – see if any byte has the 0x80 bit sit – was said to be good enough.

jennifer-richards · 2025-07-20T22:20:30Z

Thanks for the insights, Benson, I had looked briefly and more naively at the unicodedata module. I agree it will be useful. I think Rich's test as implemented will inform us a lot as to where we get bitten by pointless extra text in practice.

(And, perfect being enemy of the good and all, merging this will fix the entirely non-Latin text cases that inspired the issue this addresses; follow-ups that deal with additional subtleties will be welcome)

feat: Append ascii name if any 8bit UTF8 chars

c594e4c

Fixes: 7167

rjsparks requested a review from jennifer-richards July 19, 2025 10:54

Merge branch 'main' into fix-7167

43e4aba

richsalz closed this Jul 20, 2025

richsalz deleted the fix-7167 branch July 20, 2025 00:50

richsalz restored the fix-7167 branch July 20, 2025 00:54

richsalz reopened this Jul 20, 2025

bkmgit mentioned this pull request Jul 20, 2025

feat: Include tones in Mandarin name translations displayed in Pinyin #9197

Open

1 task

jennifer-richards approved these changes Jul 20, 2025

View reviewed changes

rjsparks merged commit 5a862b2 into ietf-tools:main Jul 23, 2025
17 checks passed

github-actions bot locked as resolved and limited conversation to collaborators Jul 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Append ascii name if any 8bit UTF8 chars #9173

feat: Append ascii name if any 8bit UTF8 chars #9173

Uh oh!

richsalz commented Jul 19, 2025 •

edited by rjsparks

Loading

Uh oh!

codecov bot commented Jul 19, 2025 •

edited

Loading

Uh oh!

bkmgit commented Jul 19, 2025

Uh oh!

richsalz commented Jul 20, 2025

Uh oh!

richsalz commented Jul 20, 2025

Uh oh!

jennifer-richards left a comment

Uh oh!

richsalz commented Jul 20, 2025

Uh oh!

bkmgit commented Jul 20, 2025

Uh oh!

richsalz commented Jul 20, 2025 via email

Uh oh!

jennifer-richards commented Jul 20, 2025

Uh oh!

Uh oh!

Uh oh!

feat: Append ascii name if any 8bit UTF8 chars #9173

feat: Append ascii name if any 8bit UTF8 chars #9173

Uh oh!

Conversation

richsalz commented Jul 19, 2025 • edited by rjsparks Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bkmgit commented Jul 19, 2025

Uh oh!

richsalz commented Jul 20, 2025

Uh oh!

richsalz commented Jul 20, 2025

Uh oh!

jennifer-richards left a comment

Choose a reason for hiding this comment

Uh oh!

richsalz commented Jul 20, 2025

Uh oh!

bkmgit commented Jul 20, 2025

Uh oh!

richsalz commented Jul 20, 2025 via email

Uh oh!

jennifer-richards commented Jul 20, 2025

Uh oh!

Uh oh!

Uh oh!

richsalz commented Jul 19, 2025 •

edited by rjsparks

Loading

codecov bot commented Jul 19, 2025 •

edited

Loading