-
Notifications
You must be signed in to change notification settings - Fork 383
Match regex implementation of re2j version 1.3 #2407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* This PR ports [re2j](https://github.com/google/re2j/) PR #[79](google/re2j#85) "Add Regexp.hashCode and tests.", dated 2018-10-23, to Scala Native. My thanks and appreciation to the re2j developer @sjamesr for the re2j PR. * Regexp#equals existed in re2j was originally ported to SN but Regexp#hashCode did not. SN refused to compile with this mismatch, so the former was dropped from the port. Now that the pair is available in re2j, both routines has been ported & implemented. * The original scalanative.regex code used the default reference equality in Regexp. equals(). Changing Regexp#equals to use content equality necessitated a change in Simplify.scala to use reference equality in two critical places. * The original re2j code supported only the Perl syntax for named capture groups "(?P<foo?)". Java supports only a slightly different syntax "(?<foo?)". The original port to scalanative changed changed the code to support only the Java syntax. This PR changes Parser.scala to support both idioms. This allows two re2j test cases for equals() and hashCode() to execute and pass. The change was documented, see below. Two java.util.regex Suites were edited to allow the now supported Perl syntax. * The unit-test Suite RegexpHashcodeEqualsTest.scala was added. Documentation: * The standard changelog entry is requested. * A section was added to `docs/lib/javalib.rst` to describe some RE2 Perl expressions now available in `java.util.regex`. These are not strictly Java 8 conformant but are likely to be useful to developers porting Perl regexen. Both the Java and Perl regular expressions for named capture groups are available. Testing: * Built and tested ("test-all") in release-fast mode using sbt 1.2.8 on X86_64 only . All tests pass. * A manual trial `sphinx-build -b html ` of javalib.rst was done and the result inspected in a browser.
* This PR ports [re2j](https://github.com/google/re2j/) PR scala-native#49, "Named group support", dated 2018-03-04, commit d0ec5a7cfec67a08735b720a956c92d1440b3789 to Scala Native. My thanks and appreciation to the re2j developer @sjamesr for the re2j PR. * Some named group support was activated/implemented when re2j was originally ported to Scala Native. This PR removes that code in favor to the quote "new" re2j named group support. I believe there is no major functional change. As more fully described below, a few Exceptions and their messages have been changed to conform to Java 8. My hope is that using some close relative of the re2j code will help future devos, myself included, identify and track relevant changes in the re2j repository. Requiescat in Pace, sn.regex named group support, we hardly knew ye! * Deviations of note from re2j code: + Adjusted Exception thrown and message to match Java 8. The re2j message was better in the sense that it gave the name not found, but SN is trying to hew close to Java 8. + Converted from namedGroups from Map[String, Integer] to Map[String, Int] to avoid boxing values which are known to be Ints. + Re-worked the named group methods start(name), end(name), and group(name) to use a common subroutine rather than open coding. This makes it easier to get the lookup & exception handling right and consistent. * unit-test points were ported/added: + sn.regex.MatcherSuite + sn.regex.ParserSuite + java.util.regex.MatcherSuite + java.util.regex.PatternSuite The majority of the testing of both Perl and Java syntax named group handling is done in sn.regex to stay close to the code which actually does the work. Documentation: * The standard changelog entry is requested. Testing: * Built and tested ("test-all") in release-fast mode using sbt 1.2.8 on X86_64 only . All tests pass. Add a probabilistic fix to "stack safe" test. This is a micro-edit point fix. A better comprehensive, deterministic fix is being exercised in work-in-progress for a future PR. Prior to this commit, the "stack safe" test could intermittently fail with duplicate groups. This fix increases the size of the string to drastically reduce the chance of a duplicate name.
…e V1.3 * This PR addresses a number of PRs in the parent code [re2j] (https://github.com/google/re2j/). It brings sn.regex up to date with all relevant PRs in the re2j repository dated 2019-08-21, commit 86028a5d722956800c6d6257e713d539e1bdd24d. In particular, it incorporates all relevant PRs included in re2j Release V1.3, dated 2019-07-22, commit 222899bc5e058380935311eba419a722f00b3d69 + This PR is most closely related to re2j PR "Add named capture group support to appendReplacement " dated 2018-03-05, commit 0e87e2a2bb403edb4196698b6707eaa2c524cf96 It looks like some of that work had been done as part of the original port to sn.regex. This PR more closely aligns the work, adding test cases and such. + re2j PR "simplify Matcher.appendReplacement(StringBuffer, String)" dated 2019-03-05 is not relevant because the subject code was eliminated during the port to sn.regex because it was never used. + Other re2j PRs up to and including the top of the stack dated 2019-08-21 were considered and found to be not relevant. * My thanks and appreciation to the re2j developers for the re2j PRs. * Deviations of note from re2j code: + Some Exceptions were changed from IllegalArgumentException to the IllegalStateException used by Java 8. The corresponding messages were changed to match Java 8 as closely as feasible. In many case the re2j message was better because it gave the name of the duplicated group or such. For sn.regex, Java 8 alignment is more important. Some of the indices given by "near index X" are only true but the value of X may not be as good an approximation as on JVM. Similarly, sn.regex may echo a failing group in printStack() as only the near context which failed. Java 8 tends to give the entire string, making it easier to see a duplicate or such. Improvements for the next generation. * unit-test points were ported from re2j PR and additional test points created, particularly for Exception type and messages. Documentation: * The standard changelog entry is requested. Testing: * Built and tested ("test-all") in release-fast mode using sbt 1.2.8 on X86_64 only . All tests pass.
Before, the test could spuriously fail if two generated names ended up being duplicates. Now we check if the names repeat with the use of a Set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you rebasing this PRs, I've left some comments, mainly about some minor issues
private def getOrThrow( | ||
_map: Map[String, Int], | ||
key: String, | ||
msg: String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All usages of this method use the same msg: No match found
we can inline it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also all of them use namedGroups
value as _map
argument, we probably can use it directly here. Then we should rename the method to getNamedGroupOrThrow
or something similiar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
msg is being used once differently with "No match available" (to match jvm), so I can't inline that, but I did adjust the rest
@@ -57,6 +59,8 @@ final class Matcher private (private var _pattern: Pattern) { | |||
// By convention a pair (-1, -1) indicates no or null match. | |||
private var _groups: Array[Int] = createGroups(_groupCount) | |||
|
|||
private val namedGroups: Map[String, Int] = _pattern.re2.namedGroups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
namedGroups
in RE2 is a var
. We should define it here as def
to make it's usage safe
@@ -793,6 +782,7 @@ object RE2 { | |||
if (!re2.prefix.isEmpty) { | |||
re2.prefixRune = re2.prefix.codePointAt(0) | |||
} | |||
re2.namedGroups = re.namedGroups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the only place where we set namedGroups
. We can potentially pass re.namedGroups
to the constructor of RE2
and make this field a val
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree, but considering clearly shown effort to stay close to the source (for easier future maintenance) would it be the right choice? This is exactly how it's being done in re2j
@@ -33,13 +34,15 @@ object Simplify { | |||
while (i < re.subs.length) { | |||
val sub = re.subs(i) | |||
val nsub = simplify(sub) | |||
if (nre == re && nsub != sub) { | |||
|
|||
if (nre.eq(re) && nsub != sub) { // object equality (.eq()) required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that reference equality
is the correct term for eq
operator
// Start a copy. | ||
nre = new Regexp(re) // shallow copy | ||
nre.runes = null | ||
nre.subs = Parser.subarray(re.subs, 0, re.subs.length) // clone | ||
} | ||
if (nre != re) { | ||
|
||
if (!nre.eq(re)) { // object inequality (! .eq()) required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of !(_ eq _)
we can use _ ne _
@@ -37,6 +37,19 @@ object BinaryIncompatibilities { | |||
|
|||
final val NativeLib = Seq( | |||
// Internal usage | |||
exclude[DirectMissingMethodProblem]( | |||
"scala.scalanative.regex.RE2.namedCaps" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you use `"scala.scalanative.regex.RE2.namedCaps*" to target both accessors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can just use exclude[DirectMissingMethodProblem]("scala.scalanative.regex.*")
since probably all methods defined there should be treated as internal
This PR is a rebased set of #1698, #1699 and #1700 PR’s by @LeeTibbert. Collectively, they serve as an update to SN re2 implementation matching re2j version 1.3. This update includes:
Original PR messages mention a fix to a test file that could not be found in the committed sources. Because of that, I added additional commit adding a fix like that and adjusted previous commit messages. Like before, I think it would be best if the commit messages were left intact in the merge to master, preserving their original intent.
Changes to the named groups implementation have also required disabling some mima compatibility checks.