Skip to content

Conversation

sjamesr
Copy link
Contributor

@sjamesr sjamesr commented Jun 2, 2020

make_unicode_tables.awk is now UnicodeTablesGenerator

UnicodeTablesGenerator uses Unicode data from ICU4J to generate Unicode
tables for consumption by RE2/J. Output is google-java-formatted before
it is written.

No new runtime dependencies are added to RE2/J.

The generator uses ICU4J 4.8.2 which bundles Unicode 6.0.0. This keeps
it compatible with Java 8, which RE2/J targets. Consideration should be
given for how we might upgrade to later Unicode versions without
introducing inconsistencies (e.g. RE2/J matches something that shouldn't
match according to java.lang.Character data).

There are some differences in the generated tables:

  • the new tables do not contain binary property character ranges (e.g.
    ASCII_Hex_digit), as those tables are currently unused in RE2/J.

  • Cc (control) char class now contains NUL (u+0000), this is correct
    and was also the subject of Fix range of make_Cc #26.

See https://github.com/google/re2j/files/4725343/diff.txt for a full
list of differences between the old tables and the new.

@sjamesr sjamesr force-pushed the rewrite_generate_tables branch 5 times, most recently from 9004e97 to d5101b0 Compare June 2, 2020 16:29
@sjamesr
Copy link
Contributor Author

sjamesr commented Jun 2, 2020

./gradlew :benchmarks:run -- --args='-f 1 -wi 5 -w 1 -r 1 -bm avgt -p impl=RE2J'

re2j @ HEAD:

Benchmark                           (impl)  (regex)  (repeats)  Mode  Cnt     Score     Error  Units
BenchmarkBacktrack.matched            RE2J      N/A          5  avgt    5     0.862 ±   0.008  us/op
BenchmarkBacktrack.matched            RE2J      N/A         10  avgt    5     2.642 ±   0.026  us/op
BenchmarkBacktrack.matched            RE2J      N/A         15  avgt    5     5.723 ±   0.227  us/op
BenchmarkBacktrack.matched            RE2J      N/A         20  avgt    5     9.737 ±   0.150  us/op
BenchmarkCompile.compile              RE2J     DATE        N/A  avgt    5  2728.880 ±  34.549  ns/op
BenchmarkCompile.compile              RE2J    EMAIL        N/A  avgt    5   954.103 ±  62.838  ns/op
BenchmarkCompile.compile              RE2J    PHONE        N/A  avgt    5  1505.841 ±  54.906  ns/op
BenchmarkCompile.compile              RE2J   RANDOM        N/A  avgt    5  5537.989 ±  97.365  ns/op
BenchmarkCompile.compile              RE2J   SOCIAL        N/A  avgt    5  1075.517 ±  32.970  ns/op
BenchmarkCompile.compile              RE2J   STATES        N/A  avgt    5  7749.248 ± 121.829  ns/op
BenchmarkFullMatch.matched            RE2J      N/A        N/A  avgt    5   463.097 ±  12.807  ns/op
BenchmarkFullMatch.notMatched         RE2J      N/A        N/A  avgt    5   398.083 ±   4.268  ns/op
BenchmarkSubMatch.findPhoneNumbers    RE2J      N/A        N/A  avgt    5    12.263 ±   0.149  ms/op

re2j with new tables:

Benchmark                           (impl)  (regex)  (repeats)  Mode  Cnt     Score     Error  Units
BenchmarkBacktrack.matched            RE2J      N/A          5  avgt    5     0.864 ±   0.040  us/op
BenchmarkBacktrack.matched            RE2J      N/A         10  avgt    5     2.616 ±   0.025  us/op
BenchmarkBacktrack.matched            RE2J      N/A         15  avgt    5     5.730 ±   0.217  us/op
BenchmarkBacktrack.matched            RE2J      N/A         20  avgt    5     9.055 ±   0.432  us/op
BenchmarkCompile.compile              RE2J     DATE        N/A  avgt    5  2750.478 ± 106.574  ns/op
BenchmarkCompile.compile              RE2J    EMAIL        N/A  avgt    5   986.257 ±  19.007  ns/op
BenchmarkCompile.compile              RE2J    PHONE        N/A  avgt    5  1533.456 ±  91.302  ns/op
BenchmarkCompile.compile              RE2J   RANDOM        N/A  avgt    5  5601.138 ±  56.878  ns/op
BenchmarkCompile.compile              RE2J   SOCIAL        N/A  avgt    5  1093.514 ±  16.875  ns/op
BenchmarkCompile.compile              RE2J   STATES        N/A  avgt    5  7603.482 ± 153.502  ns/op
BenchmarkFullMatch.matched            RE2J      N/A        N/A  avgt    5   460.424 ±   8.249  ns/op
BenchmarkFullMatch.notMatched         RE2J      N/A        N/A  avgt    5   395.796 ±   5.899  ns/op
BenchmarkSubMatch.findPhoneNumbers    RE2J      N/A        N/A  avgt    5    12.211 ±   0.150  ms/op

No changes outside the margin of error of measurement.

Copy link
Contributor

@alandonovan alandonovan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you summarize (in the CL description) any significant differences in the generated tables?

import com.google.common.collect.TreeMultimap;
import com.google.googlejavaformat.java.Formatter;
import com.google.googlejavaformat.java.FormatterException;
import com.ibm.icu.impl.UPropertyAliases;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the icu dependency necessary? What does Character.getType not provide?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any case folding information provided in java.lang.Character, but maybe I'm wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Character.toUpperCaseEx will case-fold a single code point. (Of course, it doesn't properly normalize because that's a function from strings to strings.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, one learns something every day!

I also couldn't find a source for unicode category names (e.g. 15 -> "Cc") in Java.

Copy link
Contributor

@alandonovan alandonovan Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See Character.UnicodeBlock:
https://docs.oracle.com/javase/8/docs/api/java/lang/Character.UnicodeBlock.html
It provides this method Character.UnicodeBlock of(int codepoint), which is odd because a code point may belong to many blocks. But I think the API is sufficient for your needs: RE2/J should provide the list of block names from its spec, and call UnicodeBlock.fromName(name) on each one to build the range table.

Two advantages of this approach are that (1) it is likely to be more consistent with java.util.regexp, and (2), freed from external dependencies, you could generate the table dynamically on first use instead of at build time. But I would first measure how long it takes before you do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generating the tables at startup would be excellent, it would solve the problem of mismatched RE2/J <-> JDK data.

I'll modify the impl to remove the ICU4J dependency. Thanks for the pointer.

sjamesr added 2 commits June 3, 2020 11:09
UnicodeTablesGenerator uses Unicode data from ICU4J to generate Unicode
tables for consumption by RE2/J. Output is google-java-formatted before
it is written.

No new runtime dependencies are added to RE2/J.

The generator uses ICU4J 4.8.2 which bundles Unicode 6.0.0. This keeps
it compatible with Java 8, which RE2/J targets. Consideration should be
given for how we might upgrade to later Unicode versions without
introducing inconsistencies (e.g. RE2/J matches something that shouldn't
match according to java.lang.Character data).

There are some differences in the generated tables:
  * the new tables do not contain binary property character ranges (e.g.
    ASCII_Hex_digit), as those tables are currently unused in RE2/J.

  * Cc (control) char class now contains NUL (u+0000), this is correct
    and was also the subject of google#26.

See https://github.com/google/re2j/files/4725343/diff.txt for a full
list of differences between the old tables and the new.
@sjamesr sjamesr force-pushed the rewrite_generate_tables branch from 77bb164 to b00ee05 Compare June 3, 2020 18:15
@sjamesr
Copy link
Contributor Author

sjamesr commented Jun 3, 2020

Done, please see the updated change description and the full diff: diff.txt

@alandonovan
Copy link
Contributor

Done, please see the updated change description and the full diff: diff.txt

Thanks. I looked over the list and it seems to be in all cases a more faithful approximation to the set of \p{...} Unicode classes documented here: https://github.com/google/re2/wiki/Syntax
(which suggests the Go implementation is too lenient).

@sjamesr
Copy link
Contributor Author

sjamesr commented Jun 3, 2020

I'm going to check this in as-is. I will work on removing the ICU4J dependency, as well as improving performance. Ideally, this could all be moved to an init step inside RE2/J, which would mean Unicode tables that always matched the running JDK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants