Skip to content

Conversation

rolandwalker
Copy link
Contributor

Synced up to a newer version of Markus Kuhn's wcwidth().

  • several more width-2 characters
  • many more width-0 characters
  • change control characters to width-0
  • don't change NUL but make it explicit with notes

Example improvements

  • "PRESENTATION FORM FOR VERTICAL QUESTION MARK" was width 1, now 2
  • ט֑ Tet composed with "HEBREW ACCENT ETNAHTA" was width 2, now 1

@rolandwalker rolandwalker force-pushed the unicode-width-update branch from 10b6b24 to 42173c7 Compare July 19, 2017 02:14
@jonas
Copy link
Owner

jonas commented Jul 19, 2017

It looks like this also "further" fixes the emoji test. (OK, I didn't QA that one very well).

BTW, I looked into switching to https://github.com/JuliaLang/utf8proc at some point. Do you know that one? While it is quite large/heavy it would also improve support for islower and isupper etc.

@rolandwalker
Copy link
Contributor Author

Odd that I can't duplicate the Travis failure on OS X. Force-pushing a hack now where control characters are left as before.

I haven't used utf8proc. libicu is very complete, which is nice because Unicode is entirely made of edge cases. But libicu is mostly not UTF8-oriented; you have to convert to/from UTF16.

@rolandwalker rolandwalker force-pushed the unicode-width-update branch 2 times, most recently from c7c9091 to e38be72 Compare July 19, 2017 03:38
@rolandwalker rolandwalker changed the title make unicode_width() understand more Unicode characters WIP make unicode_width() understand more Unicode characters Jul 19, 2017
 * several more width-2 characters
 * many more width-0 characters
 * change control characters to width-0
 * don't change NUL but make it explicit with notes
 * doc some apparent bugs
@rolandwalker rolandwalker force-pushed the unicode-width-update branch from e38be72 to 9c80109 Compare July 19, 2017 15:35
@rolandwalker rolandwalker changed the title WIP make unicode_width() understand more Unicode characters make unicode_width() understand more Unicode characters Jul 19, 2017
@rolandwalker
Copy link
Contributor Author

Your emoji test is great because it catches the issue which is now worked around and commented BUG. It may be a difficult bug to solve but it shouldn't be hard to narrow it down to a TODO test. I have noted some platforms, but it could easily be related to libiconv version, locale environment vars, etc.

In the meantime this patch should only improve correctness, where correctness is guessing what the terminal is going to do.

@jonas jonas merged commit a090093 into jonas:master Jul 20, 2017
@jonas
Copy link
Owner

jonas commented Jul 20, 2017

This is amazing. Most of the Unicode/UTF-8 code was copied from ELinks with very minimal changes. Very nice to have this improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants