-
Notifications
You must be signed in to change notification settings - Fork 164
Rewrite make_unicode_tables.awk in Java. #104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9004e97
to
d5101b0
Compare
re2j @ HEAD:
re2j with new tables:
No changes outside the margin of error of measurement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you summarize (in the CL description) any significant differences in the generated tables?
import com.google.common.collect.TreeMultimap; | ||
import com.google.googlejavaformat.java.Formatter; | ||
import com.google.googlejavaformat.java.FormatterException; | ||
import com.ibm.icu.impl.UPropertyAliases; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the icu dependency necessary? What does Character.getType not provide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any case folding information provided in java.lang.Character, but maybe I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Character.toUpperCaseEx will case-fold a single code point. (Of course, it doesn't properly normalize because that's a function from strings to strings.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, one learns something every day!
I also couldn't find a source for unicode category names (e.g. 15 -> "Cc") in Java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See Character.UnicodeBlock:
https://docs.oracle.com/javase/8/docs/api/java/lang/Character.UnicodeBlock.html
It provides this method Character.UnicodeBlock of(int codepoint)
, which is odd because a code point may belong to many blocks. But I think the API is sufficient for your needs: RE2/J should provide the list of block names from its spec, and call UnicodeBlock.fromName(name)
on each one to build the range table.
Two advantages of this approach are that (1) it is likely to be more consistent with java.util.regexp, and (2), freed from external dependencies, you could generate the table dynamically on first use instead of at build time. But I would first measure how long it takes before you do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generating the tables at startup would be excellent, it would solve the problem of mismatched RE2/J <-> JDK data.
I'll modify the impl to remove the ICU4J dependency. Thanks for the pointer.
UnicodeTablesGenerator uses Unicode data from ICU4J to generate Unicode tables for consumption by RE2/J. Output is google-java-formatted before it is written. No new runtime dependencies are added to RE2/J. The generator uses ICU4J 4.8.2 which bundles Unicode 6.0.0. This keeps it compatible with Java 8, which RE2/J targets. Consideration should be given for how we might upgrade to later Unicode versions without introducing inconsistencies (e.g. RE2/J matches something that shouldn't match according to java.lang.Character data). There are some differences in the generated tables: * the new tables do not contain binary property character ranges (e.g. ASCII_Hex_digit), as those tables are currently unused in RE2/J. * Cc (control) char class now contains NUL (u+0000), this is correct and was also the subject of google#26. See https://github.com/google/re2j/files/4725343/diff.txt for a full list of differences between the old tables and the new.
77bb164
to
b00ee05
Compare
Done, please see the updated change description and the full diff: diff.txt |
Thanks. I looked over the list and it seems to be in all cases a more faithful approximation to the set of \p{...} Unicode classes documented here: https://github.com/google/re2/wiki/Syntax |
I'm going to check this in as-is. I will work on removing the ICU4J dependency, as well as improving performance. Ideally, this could all be moved to an init step inside RE2/J, which would mean Unicode tables that always matched the running JDK. |
make_unicode_tables.awk is now UnicodeTablesGenerator
UnicodeTablesGenerator uses Unicode data from ICU4J to generate Unicode
tables for consumption by RE2/J. Output is google-java-formatted before
it is written.
No new runtime dependencies are added to RE2/J.
The generator uses ICU4J 4.8.2 which bundles Unicode 6.0.0. This keeps
it compatible with Java 8, which RE2/J targets. Consideration should be
given for how we might upgrade to later Unicode versions without
introducing inconsistencies (e.g. RE2/J matches something that shouldn't
match according to java.lang.Character data).
There are some differences in the generated tables:
the new tables do not contain binary property character ranges (e.g.
ASCII_Hex_digit), as those tables are currently unused in RE2/J.
Cc (control) char class now contains NUL (u+0000), this is correct
and was also the subject of Fix range of make_Cc #26.
See https://github.com/google/re2j/files/4725343/diff.txt for a full
list of differences between the old tables and the new.