Skip to content

named captures in sn.regex appendReplacement; parity with re2j release V1.3 #1700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

LeeTibbert
Copy link
Contributor

This PR requires earlier PR #1699 & earlier. Travis CI will fail until those are merged.

  • This PR addresses a number of PRs in the parent code [re2j]
    (https://github.com/google/re2j/). It brings sn.regex up to date
    with all relevant PRs in the re2j repository dated 2019-08-21,
    commit 86028a5d722956800c6d6257e713d539e1bdd24d.

    In particular, it incorporates all relevant PRs included in
    re2j Release V1.3, dated 2019-07-22, commit
    222899bc5e058380935311eba419a722f00b3d69

    • This PR is most closely related to re2j PR "Add named capture group
      support to appendReplacement " dated 2018-03-05,
      commit 0e87e2a2bb403edb4196698b6707eaa2c524cf96

    • It looks like some of that work had been done as part of the
      original port to sn.regex. This PR more closely aligns
      the work, adding test cases and such.

    • re2j PR "simplify Matcher.appendReplacement(StringBuffer, String)"
      dated 2019-03-05 is not relevant because the subject code
      was eliminated during the port to sn.regex because it was never
      used.

    • Other re2j PRs up to and including the top of the stack dated
      2019-08-21 were considered and found to be not relevant.

  • My thanks and appreciation to the re2j developers for the
    re2j PRs.

  • Deviations of note from re2j code:

    • Some Exceptions were changed from IllegalArgumentException
      to the IllegalStateException used by Java 8. The corresponding
      messages were changed to match Java 8 as closely as feasible.
      In many case the re2j message was better because it gave the
      name of the duplicated group or such. For sn.regex, Java 8
      alignment is more important.

      Some of the indices given by "near index X" are only true
      but the value of X may not be as good an approximation
      as on JVM. Similarly, sn.regex may echo a failing group
      in printStack() as only the near context which failed.
      Java 8 tends to give the entire string, making it easier to
      see a duplicate or such. Improvements for the next generation.

  • unit-test points were ported from re2j PR and additional test points
    created, particularly for Exception type and messages.

  • The code used by the "stack safe" test in NamedGroupSuite.scala
    was made more robust & more correct. A probabilistic fix had been added
    to PR Improved sn.regex named group support. #1699 to deal with one of two Heisenbugs. This PR makes
    a deterministic fixes to both. See file for details. Heisenbugs,
    especially in automated test code, are a massive pain.

    Normally in a test Suite, Random would be initialized with a seed
    so that the results would be reproducible, run over run. This
    was not done in NamedGroupSuite so that more group names would
    be tested over the long run. Yes, this can lead to Heisenbugs.
    A trade-off.

    I chose this approach because I want to know if there is a bad
    case lurking.

Documentation:

  • The standard changelog entry is requested.

Testing:

  • Built and tested ("test-all") in release-fast mode using sbt 1.2.8 on
    X86_64 only . All tests pass.

…e V1.3

#### This PR requires earlier PR scala-native#1699 & earlier. Travis CI will fail until
those are merged.

  * This PR addresses a number of PRs in the parent code [re2j]
    (https://github.com/google/re2j/). It brings sn.regex up to date
    with all relevant PRs in the re2j repository dated 2019-08-21,
    commit 86028a5d722956800c6d6257e713d539e1bdd24d.

    In particular, it incorporates all relevant PRs included in
    re2j Release V1.3, dated 2019-07-22, commit
    222899bc5e058380935311eba419a722f00b3d69

    + This PR is most closely related to re2j PR "Add named capture group
      support to appendReplacement " dated 2018-03-05,
      commit 0e87e2a2bb403edb4196698b6707eaa2c524cf96

      It looks like some of that work had been done as part of the
      original port to sn.regex.  This PR more closely aligns
      the work, adding test cases and such.

   + re2j PR "simplify Matcher.appendReplacement(StringBuffer, String)"
     dated 2019-03-05 is not relevant because the subject code
     was eliminated during the port to sn.regex because it was never
     used.

  + Other re2j PRs up to and including the top of the stack dated
    2019-08-21 were considered and found to be not relevant.

  * My thanks and appreciation to the re2j developers for the
    re2j PRs.

 * Deviations of note from re2j code:

      + Some Exceptions were changed from IllegalArgumentException
        to the IllegalStateException used by Java 8.  The corresponding
	messages were changed to match Java 8 as closely as feasible.
	In many case the re2j message was better because it gave the
	name of the duplicated group or such. For sn.regex, Java 8
	alignment is more important.

        Some of the indices given by "near index X" are only true
	but the value of X may not be as good an approximation
	as on JVM. Similarly, sn.regex may echo a failing group
	in printStack() as only the near context which failed.
	Java 8 tends to give the entire string, making it easier to
	see a duplicate or such. Improvements for the next generation.

  * unit-test points were ported from re2j PR and additional test points
    created, particularly for Exception type and messages.

  * The code used by the "stack safe" test in NamedGroupSuite.scala
    was made more robust & more correct.  A probabilistic fix had been added
    to PR scala-native#1699 to deal with one of two Heisenbugs. This PR makes
    a deterministic fixes to both. See file for details. Heisenbugs,
    especially in automated test code, are a massive pain.

    Normally in a test Suite, Random would be initialized with a seed
    so that the results would be reproducible, run over run.  This
    was not done in NamedGroupSuite so that more group names would
    be tested over the long run.  Yes, this can lead to Heisenbugs.
    A trade-off.

    I chose this approach because I want to know if there is a bad
    case lurking.

Documentation:

  * The standard changelog entry is requested.

Testing:

  * Built and tested ("test-all") in release-fast mode using sbt 1.2.8 on
    X86_64 only . All tests pass.
LeeTibbert added a commit to LeeTibbert/scala-native-fork that referenced this pull request Sep 3, 2019
##### This PR requires earlier PR scala-native#1700. That PR has its own set of
    prerequisites. Travis CI will fail until those are merged.

  * This PR addresses open issue 93, "Incorrect match found when capturing
    groups are not used ", dated 2019-08-13, in the
    [re2j](https://github.com/google/re2j/) GitHub repository.

    "a.*?c|a.*?b" is an example which evokes the reported failure.
    See the re2j issue for before/after details. MatcherSuite.scala
    has a test point using that string giving the exact after details.

    My thanks and appreciation to the @EricEdens for reporting the
    re2j Issue and supplying a URL to the fix in the upstream Go
    code.

    Thanks also to golang/go developers @junyer and @rsc for the
    [Go](https://github.com/golang/go/) PR "regexp/syntax: fix factoring
    of common prefixes in alternations", dated 2016-01-07,
    commit 5ccaf0255b75063a9c685009e77cee24e26a509e.

    This PR ports the Go code and fixes the reported issue reported
    in re2j.

 * I believe there are no additional intellectual properties or license issues
   introduced by porting directly from Go. sn.regex code already carries
   the license of the upstream Go code.

 * unit-test sn.regex MatcherSuite.scala notes:

      + The code for this PR allowed several tests previously commented out in
        re2j and carried over to sn to execute and pass.  One test
        remains commented out because it continues to fail for reasons
        needing to be investigated.

      + I synchronized the section marked "RE2 prefix_tests" with
        the latest [Go tests]
 (https://github.com/golang/go/blob/master/src/regexp/syntax/parse_test.go)

      + I deleted a commented out test that was commented out in re2j and
        does not exist in current Go.

      + It looks like sn.regex and re2j are both missing a fix to
        tested in the "Valid repetitions" section of the latest Go tests.

Documentation:

  * The standard changelog entry is requested.

Testing:

  * Built and tested ("test-all") in release-fast mode using sbt 1.2.8 on
    X86_64 only . All tests pass.
LeeTibbert added a commit to LeeTibbert/scala-native-fork that referenced this pull request Sep 3, 2019
…own set of prerequisites. Travis CI will fail until those are merged.

  * This PR addresses open issue 93, "Incorrect match found when capturing
    groups are not used ", dated 2019-08-13, in the
    [re2j](https://github.com/google/re2j/) GitHub repository.

    "a.*?c|a.*?b" is an example which evokes the reported failure.
    See the re2j issue for before/after details. MatcherSuite.scala
    has a test point using that string giving the exact after details.

    My thanks and appreciation to the @EricEdens for reporting the
    re2j Issue and supplying a URL to the fix in the upstream Go
    code.

    Thanks also to golang/go developers @junyer and @rsc for the
    [Go](https://github.com/golang/go/) PR "regexp/syntax: fix factoring
    of common prefixes in alternations", dated 2016-01-07,
    commit 5ccaf0255b75063a9c685009e77cee24e26a509e.

    This PR ports the Go code and fixes the reported issue reported
    in re2j.

 * I believe there are no additional intellectual properties or license issues
   introduced by porting directly from Go. sn.regex code already carries
   the license of the upstream Go code.

 * unit-test sn.regex MatcherSuite.scala notes:

      + The code for this PR allowed several tests previously commented out in
        re2j and carried over to sn to execute and pass.  One test
        remains commented out because it continues to fail for reasons
        needing to be investigated.

      + I synchronized the section marked "RE2 prefix_tests" with
        the latest [Go tests](https://github.com/golang/go/blob/master/src/regexp/syntax/parse_test.go)

      + I deleted a commented out test that was commented out in re2j and
        does not exist in current Go.

      + It looks like sn.regex and re2j are both missing a fix  tested in the
       "Valid repetitions" section of the latest Go tests. That defect is unrelated
       to this PR beyond the fact that it was discovered whilst preparing this PR.

Documentation:

  * The standard changelog entry is requested.

Testing:

  * Built and tested ("test-all") in release-fast mode using sbt 1.2.8 on
    X86_64 only . All tests pass.
LeeTibbert added a commit to LeeTibbert/scala-native-fork that referenced this pull request Sep 3, 2019
##### This PR requires earlier PR scala-native#1700. That PR has its own set of prerequisites. Travis CI will fail until those are merged.

  * This PR addresses open issue 93, "Incorrect match found when capturing
    groups are not used ", dated 2019-08-13, in the
    [re2j](https://github.com/google/re2j/) GitHub repository.

    "a.*?c|a.*?b" is an example which evokes the reported failure.
    See the re2j issue for before/after details. MatcherSuite.scala
    has a test point using that string giving the exact after details.

    My thanks and appreciation to the @EricEdens for reporting the
    re2j Issue and supplying a URL to the fix in the upstream Go
    code.

    Thanks also to golang/go developers @junyer and @rsc for the
    [Go](https://github.com/golang/go/) PR "regexp/syntax: fix factoring
    of common prefixes in alternations", dated 2016-01-07,
    commit 5ccaf0255b75063a9c685009e77cee24e26a509e.

    This PR ports the Go code and fixes the reported issue reported
    in re2j.

 * I believe there are no additional intellectual properties or license issues
   introduced by porting directly from Go. sn.regex code already carries
   the license of the upstream Go code.

 * unit-test sn.regex MatcherSuite.scala notes:

      + The code for this PR allowed several tests previously commented out in
        re2j and carried over to sn to execute and pass.  One test
        remains commented out because it continues to fail for reasons
        needing to be investigated.

      + I synchronized the section marked "RE2 prefix_tests" with
        the latest [Go tests](https://github.com/golang/go/blob/master/src/regexp/syntax/parse_test.go)

      + I deleted a commented out test that was commented out in re2j and
        does not exist in current Go.

      + It looks like sn.regex and re2j are both missing a fix  tested in the
       "Valid repetitions" section of the latest Go tests. That defect is unrelated
       to this PR beyond the fact that it was discovered whilst preparing this PR.

Documentation:

  * The standard changelog entry is requested.

Testing:

  * Built and tested ("test-all") in release-fast mode using sbt 1.2.8 on
    X86_64 only . All tests pass.
@WojciechMazur
Copy link
Contributor

Rebased and merged in #2407, thank you for providing the original solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants