Rewrite make_unicode_tables.awk in Java. #104

sjamesr · 2020-06-02T04:10:28Z

make_unicode_tables.awk is now UnicodeTablesGenerator

UnicodeTablesGenerator uses Unicode data from ICU4J to generate Unicode
tables for consumption by RE2/J. Output is google-java-formatted before
it is written.

No new runtime dependencies are added to RE2/J.

The generator uses ICU4J 4.8.2 which bundles Unicode 6.0.0. This keeps
it compatible with Java 8, which RE2/J targets. Consideration should be
given for how we might upgrade to later Unicode versions without
introducing inconsistencies (e.g. RE2/J matches something that shouldn't
match according to java.lang.Character data).

There are some differences in the generated tables:

the new tables do not contain binary property character ranges (e.g.
ASCII_Hex_digit), as those tables are currently unused in RE2/J.
Cc (control) char class now contains NUL (u+0000), this is correct
and was also the subject of Fix range of make_Cc #26.

See https://github.com/google/re2j/files/4725343/diff.txt for a full
list of differences between the old tables and the new.

sjamesr · 2020-06-02T16:39:41Z

./gradlew :benchmarks:run -- --args='-f 1 -wi 5 -w 1 -r 1 -bm avgt -p impl=RE2J'

re2j @ HEAD:

Benchmark                           (impl)  (regex)  (repeats)  Mode  Cnt     Score     Error  Units
BenchmarkBacktrack.matched            RE2J      N/A          5  avgt    5     0.862 ±   0.008  us/op
BenchmarkBacktrack.matched            RE2J      N/A         10  avgt    5     2.642 ±   0.026  us/op
BenchmarkBacktrack.matched            RE2J      N/A         15  avgt    5     5.723 ±   0.227  us/op
BenchmarkBacktrack.matched            RE2J      N/A         20  avgt    5     9.737 ±   0.150  us/op
BenchmarkCompile.compile              RE2J     DATE        N/A  avgt    5  2728.880 ±  34.549  ns/op
BenchmarkCompile.compile              RE2J    EMAIL        N/A  avgt    5   954.103 ±  62.838  ns/op
BenchmarkCompile.compile              RE2J    PHONE        N/A  avgt    5  1505.841 ±  54.906  ns/op
BenchmarkCompile.compile              RE2J   RANDOM        N/A  avgt    5  5537.989 ±  97.365  ns/op
BenchmarkCompile.compile              RE2J   SOCIAL        N/A  avgt    5  1075.517 ±  32.970  ns/op
BenchmarkCompile.compile              RE2J   STATES        N/A  avgt    5  7749.248 ± 121.829  ns/op
BenchmarkFullMatch.matched            RE2J      N/A        N/A  avgt    5   463.097 ±  12.807  ns/op
BenchmarkFullMatch.notMatched         RE2J      N/A        N/A  avgt    5   398.083 ±   4.268  ns/op
BenchmarkSubMatch.findPhoneNumbers    RE2J      N/A        N/A  avgt    5    12.263 ±   0.149  ms/op

re2j with new tables:

Benchmark                           (impl)  (regex)  (repeats)  Mode  Cnt     Score     Error  Units
BenchmarkBacktrack.matched            RE2J      N/A          5  avgt    5     0.864 ±   0.040  us/op
BenchmarkBacktrack.matched            RE2J      N/A         10  avgt    5     2.616 ±   0.025  us/op
BenchmarkBacktrack.matched            RE2J      N/A         15  avgt    5     5.730 ±   0.217  us/op
BenchmarkBacktrack.matched            RE2J      N/A         20  avgt    5     9.055 ±   0.432  us/op
BenchmarkCompile.compile              RE2J     DATE        N/A  avgt    5  2750.478 ± 106.574  ns/op
BenchmarkCompile.compile              RE2J    EMAIL        N/A  avgt    5   986.257 ±  19.007  ns/op
BenchmarkCompile.compile              RE2J    PHONE        N/A  avgt    5  1533.456 ±  91.302  ns/op
BenchmarkCompile.compile              RE2J   RANDOM        N/A  avgt    5  5601.138 ±  56.878  ns/op
BenchmarkCompile.compile              RE2J   SOCIAL        N/A  avgt    5  1093.514 ±  16.875  ns/op
BenchmarkCompile.compile              RE2J   STATES        N/A  avgt    5  7603.482 ± 153.502  ns/op
BenchmarkFullMatch.matched            RE2J      N/A        N/A  avgt    5   460.424 ±   8.249  ns/op
BenchmarkFullMatch.notMatched         RE2J      N/A        N/A  avgt    5   395.796 ±   5.899  ns/op
BenchmarkSubMatch.findPhoneNumbers    RE2J      N/A        N/A  avgt    5    12.211 ±   0.150  ms/op

No changes outside the margin of error of measurement.

alandonovan

Can you summarize (in the CL description) any significant differences in the generated tables?

alandonovan · 2020-06-03T15:02:12Z

unicode/src/main/java/com/google/re2j/UnicodeTablesGenerator.java

+import com.google.common.collect.TreeMultimap;
+import com.google.googlejavaformat.java.Formatter;
+import com.google.googlejavaformat.java.FormatterException;
+import com.ibm.icu.impl.UPropertyAliases;


Is the icu dependency necessary? What does Character.getType not provide?

I don't think there is any case folding information provided in java.lang.Character, but maybe I'm wrong.

Character.toUpperCaseEx will case-fold a single code point. (Of course, it doesn't properly normalize because that's a function from strings to strings.)

Thank you, one learns something every day!

I also couldn't find a source for unicode category names (e.g. 15 -> "Cc") in Java.

See Character.UnicodeBlock:
https://docs.oracle.com/javase/8/docs/api/java/lang/Character.UnicodeBlock.html
It provides this method Character.UnicodeBlock of(int codepoint), which is odd because a code point may belong to many blocks. But I think the API is sufficient for your needs: RE2/J should provide the list of block names from its spec, and call UnicodeBlock.fromName(name) on each one to build the range table.

Two advantages of this approach are that (1) it is likely to be more consistent with java.util.regexp, and (2), freed from external dependencies, you could generate the table dynamically on first use instead of at build time. But I would first measure how long it takes before you do that.

Generating the tables at startup would be excellent, it would solve the problem of mismatched RE2/J <-> JDK data.

I'll modify the impl to remove the ICU4J dependency. Thanks for the pointer.

UnicodeTablesGenerator uses Unicode data from ICU4J to generate Unicode tables for consumption by RE2/J. Output is google-java-formatted before it is written. No new runtime dependencies are added to RE2/J. The generator uses ICU4J 4.8.2 which bundles Unicode 6.0.0. This keeps it compatible with Java 8, which RE2/J targets. Consideration should be given for how we might upgrade to later Unicode versions without introducing inconsistencies (e.g. RE2/J matches something that shouldn't match according to java.lang.Character data). There are some differences in the generated tables: * the new tables do not contain binary property character ranges (e.g. ASCII_Hex_digit), as those tables are currently unused in RE2/J. * Cc (control) char class now contains NUL (u+0000), this is correct and was also the subject of google#26. See https://github.com/google/re2j/files/4725343/diff.txt for a full list of differences between the old tables and the new.

sjamesr · 2020-06-03T18:15:53Z

Done, please see the updated change description and the full diff: diff.txt

alandonovan · 2020-06-03T18:31:21Z

Done, please see the updated change description and the full diff: diff.txt

Thanks. I looked over the list and it seems to be in all cases a more faithful approximation to the set of \p{...} Unicode classes documented here: https://github.com/google/re2/wiki/Syntax
(which suggests the Go implementation is too lenient).

sjamesr · 2020-06-03T22:15:57Z

I'm going to check this in as-is. I will work on removing the ICU4J dependency, as well as improving performance. Ideally, this could all be moved to an init step inside RE2/J, which would mean Unicode tables that always matched the running JDK.

googlebot added the cla: yes label Jun 2, 2020

sjamesr force-pushed the rewrite_generate_tables branch 5 times, most recently from 9004e97 to d5101b0 Compare June 2, 2020 16:29

alandonovan reviewed Jun 3, 2020

View reviewed changes

sjamesr added 2 commits June 3, 2020 11:09

regenerate tables

b00ee05

sjamesr force-pushed the rewrite_generate_tables branch from 77bb164 to b00ee05 Compare June 3, 2020 18:15

Merge branch 'master' into rewrite_generate_tables

79defcd

sjamesr merged commit aba633e into google:master Jun 3, 2020

mszabo-wikia mentioned this pull request Jul 19, 2022

Reduce the incidence of infinite loops while case folding #160

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rewrite make_unicode_tables.awk in Java. #104

Rewrite make_unicode_tables.awk in Java. #104

Uh oh!

sjamesr commented Jun 2, 2020 •

edited

Loading

Uh oh!

sjamesr commented Jun 2, 2020

Uh oh!

alandonovan left a comment

Uh oh!

alandonovan Jun 3, 2020

Uh oh!

sjamesr Jun 3, 2020

Uh oh!

alandonovan Jun 3, 2020

Uh oh!

sjamesr Jun 3, 2020

Uh oh!

alandonovan Jun 3, 2020 •

edited

Loading

Uh oh!

sjamesr Jun 3, 2020

Uh oh!

sjamesr commented Jun 3, 2020

Uh oh!

alandonovan commented Jun 3, 2020

Uh oh!

sjamesr commented Jun 3, 2020

Uh oh!

Uh oh!

Rewrite make_unicode_tables.awk in Java. #104

Rewrite make_unicode_tables.awk in Java. #104

Uh oh!

Conversation

sjamesr commented Jun 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjamesr commented Jun 2, 2020

Uh oh!

alandonovan left a comment

Choose a reason for hiding this comment

Uh oh!

alandonovan Jun 3, 2020

Choose a reason for hiding this comment

Uh oh!

sjamesr Jun 3, 2020

Choose a reason for hiding this comment

Uh oh!

alandonovan Jun 3, 2020

Choose a reason for hiding this comment

Uh oh!

sjamesr Jun 3, 2020

Choose a reason for hiding this comment

Uh oh!

alandonovan Jun 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sjamesr Jun 3, 2020

Choose a reason for hiding this comment

Uh oh!

sjamesr commented Jun 3, 2020

Uh oh!

alandonovan commented Jun 3, 2020

Uh oh!

sjamesr commented Jun 3, 2020

Uh oh!

Uh oh!

sjamesr commented Jun 2, 2020 •

edited

Loading

alandonovan Jun 3, 2020 •

edited

Loading