-
Notifications
You must be signed in to change notification settings - Fork 3.4k
[C++] Implement standalone Unicode encoding and decoding handling #3398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@parrt FYI Will hopefuly be ready to review later this week. To make life easier for everyone, we are contributing a port of CEL's C++ UTF-8 decoding and encoding logic. |
@mike-lischke FYI, not ready for review yet. Hopefully later this week. |
1555652
to
aa5b233
Compare
033ec8f
to
fdf6b9f
Compare
@mike-lischke This is ready now. I added unit tests as well that use googletest and hooked up the scripts to run them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love it! Just fix the small casing issues please.
@jcking Not sure if that would show up in the patch: did you also remove the utf8cpp submodule from git? |
There are some problems with tests. Do they related to these changes? |
@mike-lischke I do not see any submodule utf8cpp present. What path is the submodule at? |
@jcking You are right, there's no submodule in git. I didn't check first, but answered from memory. So from this side we should be fine here. @KvanTTT The tests are odd. All those marked as failure actually seem to hang. It's difficult to say what's wrong there. Also Swift tests failed, which are not affected by this patch. The CircleCI tests succeed, however, I think the failures have nothing to do with this patch. @parrt Even though some tests fail, I'm pretty sure this patch is ok. So I'd say we can merge it. |
Historically ANTLRv4 C++ runtime has relied on deprecated facilities provided by the standard library to convert between UTF-8, UTF-16 (Windows only) and UTF-32. More recently https://github.com/nemtrif/utfcpp was added as an option. ANTLR has historically kept external dependencies for its runtimes to a minimum, basically nothing except the languages standard library. This change restores that by implementing full UTF-8 and UTF-32 handling based on a port off CEL's C++ implementation which is itself based on Go's
unicode/utf8
package. UTF-16 support (indirectly viaws2s
ands2ws
) has been dropped as it was only relevant for Windows. On Windows users should useMultiByteToWideChar
andWideCharToMultiByte
.With this change ANTLRv4 C++ is now dependency free, has full UTF-8 and UTF-32 handling, and
ANTLRInputStream
can support invalid UTF-8 strings by replacing illegal bytes with the Unicode replacement character.