-
Notifications
You must be signed in to change notification settings - Fork 383
Update RE2 regex implementation #2402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* This PR ports [re2j](https://github.com/google/re2j/) PR #[74](https://github.com/google/re2j/) "Detect invalid Unicode class specification", dated 2018-10-23, to Scala Native. Prior to this PR, compiling the pattern "\\p" would lead to an unexpected java.lang.StringIndexOutOfBoundsException. As of this PR the PatternSyntaxException used by the JVM is thrown. The text of the exception is slightly different. JVM reports the unknown property name, or {?}. Doing that in SN regex is harder that time allowed. * A new ERR_UNKNOWN_CHARACTER_PROPERTY_NAME message was added and and the initial word of several messages converted to Titlecase. Messages on the JVM start with an initial capital letter. Message texts from SN regex (née re2s) still do not match JVM exactly but they are moving closer. * The message text for the invalid specification "\\p{" was also changed. Documentation: * The standard changelog entry is requested. Testing: * Built and tested ("test-all") in debug mode, Java 8, using sbt 1.2.8 on X86_64 only . All tests pass.
WojciechMazur
approved these changes
Oct 5, 2021
unit-tests/native/src/test/scala/scala/scalanative/regex/ParserTest.scala
Outdated
Show resolved
Hide resolved
…m now works * This PR was motivated by Issue scala-native#1673 " regex RE2 wrongly ignores newline in OR groups". The defect was a botched rewrite of the Parser.scala method matchRune() in an attempt to remove multiple `return` statements. The issue is now fixed for the two patterns which would evoke the presenting problem. The patterns "\\n|." and ".|\\n" are both used to indicate a match to ANY_CHARACTER, including newline. Because regex is defined to have a preference for the left term, they execute slightly different code paths. * Two unit-test cases were added to ParserSuite.scala. Each tests one of the two patterns which evoked Issue scala-native#1673. * Details for future maintainers. + The FLAGS argument to testParseDump() in NOMATCHNL_TESTS as 0. That is, all flags are clear. This mean MATCH_NL is clear, which matches Java 8 DOTALL being clear: dot does not match newline. Most of rest of Suite uses FLAGS which match PERL: `TEST_FLAGS = MATCH_NL | PERL_X | UNICODE_GROUPS` This sets dot to match newline and does not evoke Issue scala-native#1673. + Before this PR, Parser.scala with MATCH_NL clear compiled uses of the patterns to "dnl{}"; that is, ROP.ANY_CHAR_NOT_NL. As of this PR, the patterns get compiled to the correct and expected "dnl{}", ROP.ANY_CHAR. See NONMATCHNL_TESTS . Documentation: * The standard changelog entry is requested. Testing: * Built and tested ("test-all") in debug mode using sbt 1.2.8 on X86_64 only . All tests pass.
… & more * This PR ports [re2j](https://github.com/google/re2j/) PR #[85](google/re2j#85) "Don't create Machine while holding this' monitor", dated 2019-02-22, to Scala Native. My thanks and appreciation to the re2j developer @sjamesr for the re2j PR. * Since ScalaNative is currently single threaded, this PR may seem to be not worth the overhead. At the very least, it keeps the scalanative.regex/re2s implementation close to the re2j parent. In the longer run, I do believe that someday, Real Soon Now, a Big Switch will be thrown and not having synchronization issues that could & should have been avoided sprinkled all over the code base will be seen to be a Good Thing. * I do not understand why the upstream re2j changed the data structure from ArrayList to ArrayDeque, then uses it as a Queue. Seems like a slight pessimization, if ArrayDeque is implemented in terms of ArrayList, as it now is in Scala Native. It seemed kinder to future devos, including myself in three months, to follow the re2j change that to maintain lots of comments explaining why re2s/sn.regex and re2j differed. This is especially important because of rounds of re2j PRs I hope to port in the near future. * No unit-tests were added. The re2j PR had no changes associated with testing. Focus testing the behavior on a single threaded ScalaNative would be hard. While it can not be shown that this PR introduced the positive change it describes, it can be shown, to a reasonable degree of certitude, that it did not introduce obvious negative changes, a.k.a. bugs: `test-all` in release-fast mode succeeds with no new performance issues. Documentation: * The standard changelog entry is requested. Testing: * Built and tested ("test-all") in release-fast mode using sbt 1.2.8 on X86_64 only . All tests pass.
This was referenced Oct 5, 2021
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is a rebased set of 3 of @LeeTibbert's PR's: #1690 #1691 and #1692. This includes:
Original commit/PR messages were left untouched (mostly, as issue #1091 was fixed after the original commits, leading to some roundabout changes no longer being necessary) to preserve their intent and I think it would be best to merge with those messages intact.