Skip to content

feat: remove unnecessary character class from 933151 #4135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 17, 2025

Conversation

TimDiam0nd
Copy link
Contributor

For the regex used in 933151, the regex in the first chained rule contains a character class at the end:
\b([^\s]+)\s*[(]
However, removing the character class is better for performance (and arguably more readable):
\b([^\s]+)\s*\(

Copy link
Contributor

github-actions bot commented May 16, 2025

📊 Quantitative test results for language: eng, year: 2023, size: 10K, paranoia level: 1:
🚀 Quantitative testing did not detect new false positives

EsadCetiner
EsadCetiner previously approved these changes May 16, 2025
@theseion
Copy link
Contributor

I'm curious why you think this would have an impact on performance. Did you measure any difference?

@airween
Copy link
Contributor

airween commented May 17, 2025

I'm curious why you think this would have an impact on performance. Did you measure any difference?

I'm not sure it's measurable, I mean the difference is not significant at all. I tried that with msc_retest on these ways:

$ cat r933151_old.txt 
\b([^\s]+)\s*[(]

$ cat r933151_new.txt 
\b([^\s]+)\s*\(

Please note that there is no EOL at the end of lines!

Then here are the commands:

echo -n "array_diff (" | src/pcre4msc2 -j -n 100000 r933151_old.txt 
echo -n "array_diff (" | src/pcre4msc2 -j -n 100000 r933151_new.txt 
echo -n "array_diff (" | src/pcre4msc3 -n 100000 r933151_old.txt 
echo -n "array_diff (" | src/pcre4msc3 -n 100000 r933151_new.txt 

These runs test with v2's and v3's regex engine with old and new patterns against the mentioned rule's 1st regression test. In case of v2 the regex engine (PCRE2 is the default) uses JIT. v3's regex code uses JIT by default.

The results are below:

v2/old pattern

Num of values: 100000
         Mean: 000.000000352
       Median: 000.000000324
          Min: 000.000000171
          Max: 000.000132326
        Range: 000.000132155
Std deviation: 000.000000519

v2/new pattern

Num of values: 100000
         Mean: 000.000000435
       Median: 000.000000347
          Min: 000.000000169
          Max: 000.000058875
        Range: 000.000058706
Std deviation: 000.000000507

v3/old pattern

Num of values: 100000
         Mean: 000.000000230
       Median: 000.000000218
          Min: 000.000000111
          Max: 000.000051658
        Range: 000.000051547
Std deviation: 000.000000281

v3/new pattern

Num of values: 100000
         Mean: 000.000000227
       Median: 000.000000213
          Min: 000.000000112
          Max: 000.000026295
        Range: 000.000026183
Std deviation: 000.000000194

I just leave this comment here to help anyone who wants to compare the regex patterns' performance, use msc_retest.

@TimDiam0nd
Copy link
Contributor Author

Hey, apologies i have a couple more prs coming in relation to the same regex (and 933151), which overall improves our performance between 48 to 112 times (i dont think the difference will be as pronounced for you guys, especially with pcre, however in rust regex not extracting from capture groups and just matching increases performance massively).
Specifically just this change improved our regex compile times very slightly, though not a noticeable amount.

@theseion
Copy link
Contributor

Interesting, thanks for the explanation. Although I would have expected the engine to automatically optimise [(] to \( ;)

@theseion theseion added this pull request to the merge queue May 17, 2025
Merged via the queue into coreruleset:main with commit 46f1aae May 17, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants