Skip to content

Update RE2 regex implementation #2402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 5, 2021

Conversation

jchyb
Copy link
Contributor

@jchyb jchyb commented Oct 5, 2021

This PR is a rebased set of 3 of @LeeTibbert's PR's: #1690 #1691 and #1692. This includes:

  • Fixing detection of an invalid unique specification "\p"
  • Fixing the ANY_CHARACTER idiom
  • Mutual exclusion improvements
  • General updates to keep our implementation compatible with the original re2j one, for simplified maintenance

Original commit/PR messages were left untouched (mostly, as issue #1091 was fixed after the original commits, leading to some roundabout changes no longer being necessary) to preserve their intent and I think it would be best to merge with those messages intact.

  * This PR ports [re2j](https://github.com/google/re2j/) PR
    #[74](https://github.com/google/re2j/)
    "Detect invalid Unicode class specification", dated 2018-10-23, to
    Scala Native.

    Prior to this PR, compiling the pattern "\\p" would lead to an
    unexpected java.lang.StringIndexOutOfBoundsException.
    As of this PR the PatternSyntaxException used by the JVM is thrown.

    The text of the exception is slightly different. JVM reports the unknown
    property name, or {?}. Doing that in SN regex is harder that time allowed.

  * A new ERR_UNKNOWN_CHARACTER_PROPERTY_NAME message was added and
    and the initial word of several messages converted to Titlecase.
    Messages on the JVM start with an initial capital letter.

    Message texts from SN regex (née re2s) still do not match JVM
    exactly but they are moving closer.

  * The message text for the invalid specification "\\p{" was
    also changed.

Documentation:

    * The standard changelog entry is requested.

Testing:

  * Built and tested ("test-all") in debug mode, Java 8, using sbt 1.2.8 on
    X86_64 only . All tests pass.
@jchyb jchyb changed the title RE2 regex updates Update RE2 regex implementation Oct 5, 2021
LeeTibbert and others added 2 commits October 5, 2021 10:49
…m now works

  * This PR was motivated by Issue scala-native#1673 " regex RE2 wrongly ignores
    newline in OR groups".

    The defect was a botched rewrite of the Parser.scala method
    matchRune() in an attempt to remove multiple `return` statements.

    The issue is now fixed for the two patterns which would evoke
    the presenting problem.  The patterns "\\n|." and ".|\\n"
    are both used to indicate a match to ANY_CHARACTER, including
    newline. Because regex is defined to have a preference for the
    left term, they execute slightly different code paths.

  * Two unit-test cases were added to ParserSuite.scala. Each tests
    one of the two patterns which evoked Issue scala-native#1673.

  * Details for future maintainers.

    + The FLAGS argument to testParseDump() in NOMATCHNL_TESTS as 0.
    That is, all flags are clear. This mean MATCH_NL is clear, which
    matches Java 8 DOTALL being clear: dot does not match newline.

    Most of rest of Suite uses FLAGS which match PERL:
   `TEST_FLAGS = MATCH_NL | PERL_X | UNICODE_GROUPS`
    This sets dot to match newline and does not evoke Issue scala-native#1673.

    + Before this PR, Parser.scala with MATCH_NL clear compiled uses of
      the patterns to "dnl{}"; that is, ROP.ANY_CHAR_NOT_NL.
      As of this PR, the patterns get compiled to the correct and
      expected "dnl{}", ROP.ANY_CHAR. See NONMATCHNL_TESTS .

Documentation:

    * The standard changelog entry is requested.

Testing:

  * Built and tested ("test-all") in debug mode using sbt 1.2.8 on
    X86_64 only . All tests pass.
… & more

  * This PR ports [re2j](https://github.com/google/re2j/) PR
    #[85](google/re2j#85)
    "Don't create Machine while holding this' monitor", dated 2019-02-22,
    to Scala Native.

    My thanks and appreciation to the re2j developer @sjamesr for the
    re2j PR.

  * Since ScalaNative is currently single threaded, this PR may seem
    to be not worth the overhead.  At the very least, it keeps the
    scalanative.regex/re2s implementation close to the re2j parent.

    In the longer run, I do believe that someday, Real Soon Now, a
    Big Switch will be thrown and not having synchronization issues
    that could & should have been avoided sprinkled all over the
    code base will be seen to be a Good Thing.

  * I do not understand why the upstream re2j changed the data structure
    from ArrayList to ArrayDeque, then uses it as a Queue. Seems like a
    slight pessimization, if ArrayDeque is implemented in terms of
    ArrayList, as it now is in Scala Native.

    It seemed kinder to future devos, including myself in three months,
    to follow the re2j change that to maintain lots of comments
    explaining why re2s/sn.regex and re2j differed. This is especially
    important because of rounds of re2j PRs I hope to port in the near
    future.

  * No unit-tests were added. The re2j PR had no changes associated
    with testing. Focus testing the behavior on a single threaded
    ScalaNative would be hard.

    While it can not be shown that this PR introduced the positive
    change it describes, it can be shown, to a reasonable degree of
    certitude, that it did not introduce obvious negative changes,
    a.k.a. bugs: `test-all` in release-fast mode succeeds with no new
    performance issues.

Documentation:

  * The standard changelog entry is requested.

Testing:

  * Built and tested ("test-all") in release-fast mode using sbt 1.2.8 on
    X86_64 only . All tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants