Optimize Regexp_matches to LIKE statements when possible #7264

Tmonster · 2023-04-26T14:10:35Z

As mentioned in the title, this PR is to optimise regexp_matches statements into LIKE statements. LIKE statements will then be optimised into prefix, suffix, or contains statements by the EmptyNeedleRemovalRule.

Of course, I imagine most duckdb users will use LIKE by default, but this should help with performance for those who don't. Most optimizations require passing the 's' regexp option, however. Maybe in the future we enable this by default?

If a user doesn't pass 's', the following transformations won't take place

regex . -> LIKE _
regex .* -> LIKE %

There are some corner cases that took some time to think about. Like is directly optimised to contains if only a literal or a string is provided in the regexp argument. If the regexp argument is a concatenation of regex special characters, more attention is paid into whether or not the optimisation should take place.

If the string or character has any control characters, the optimisation doesn't take place.
If the regexp has to match any special like characters, then an optimisation to like_escaped happens with an escape character of \.

Another round of improvements can be made after this merges. We can also optimise case insensitive matches to ILIKE and ILIKE_ESCAPE. But to avoid a messy large PR, I'll leave that for next week

src/optimizer/rule/regex_optimizations.cpp

Tishj

LGTM

Mytherin

Thanks for the PR! Looks good - some comments:

Mytherin · 2023-05-01T08:20:31Z

src/optimizer/rule/regex_optimizations.cpp

@@ -12,17 +14,76 @@ namespace duckdb {
 RegexOptimizationRule::RegexOptimizationRule(ExpressionRewriter &rewriter) : Rule(rewriter) {
 	auto func = make_uniq<FunctionExpressionMatcher>();
 	func->function = make_uniq<SpecificFunctionMatcher>("regexp_matches");
-	func->policy = SetMatcher::Policy::ORDERED;
+	func->policy = SetMatcher::Policy::SOME;


By changing this to Some instead of Ordered the order of the matches no longer matters, meaning we could have [Constant, Expression] instead of [Expression, Constant]. This likely causes some issues with the optimizer below. Maybe we should introduce a PARTIAL_ORDERED to accommodate the case here?

src/function/scalar/string/like.cpp

Tmonster · 2023-05-01T15:36:24Z

Forgot that I need to escape LIKE wildcards from literal strings. Will implement that tomorrow, so no need to merge if the previous commit is green

Mytherin

Thanks for the fixes! Some more comments:

Mytherin · 2023-05-05T12:19:44Z

src/include/duckdb/optimizer/matcher/set_matcher.hpp

@@ -79,6 +81,18 @@ class SetMatcher {
 				}
 			}
 			return true;
+		} else if (policy == Policy::PARTIAL_ORDERED) {
+			// partial ordered policy, if too many entries are provided, return false
+			if (matchers.size() < entries.size()) {


I would imagine this should works the other way around - i.e. we need to match all matchers, but if there are extra expressions they do not count. Otherwise the bindings provided to a function are off as they may contain fewer entries than expected.

I thought that too, but then how do we match the extra expressions when they are provided? Maybe the FunctionExpressionMatcher can have a minimum_matches variable? And the match function has an extra check like

if (entries.size() < minimum_matches) { return false; }

Ah, I think the way the optimizer works changed since my last review - you do want partial matching on the matchers' side. Perhaps call it SOME_ORDERED instead of PARTIAL_ORDERED for clarity, as you essentially want SOME but ordered?

yea. For the regex optimiser to work, I need to match 2 or 3 arguments. In both cases the arguments have to be ordered correctly.
I will change the name to PARTIAL_ORDERED. Is the minimum_matches variable still a good idea?

Fine by me - but then it should apply to some as well

Ah, I think it is unnecessary logic though since the function binding stage will make sure the arguments are correct. I'll just change the if statement to if (entries.size() < matchers.size()) return false;

Mytherin · 2023-05-05T12:22:53Z

src/optimizer/rule/regex_optimizations.cpp

+		char chr = toascii(rune);
+		// if a character is equal to the escaped character return that there is no escaped like string.
+		if (!contains && (chr == '%' || chr == '_' || chr == ret.escaped_character[0])) {
+			ret.escaped = true;


Maybe for simplicity we should just skip the % and _ characters from this optimization for now? It seems like it adds a number of edge cases that complicate this code significantly and I don't think these characters are very common anyway.

Mytherin · 2023-05-05T12:24:12Z

test/optimizer/regex_optimizer.test

+Binder Error
+
+# we escape like special character when we convert to a like string
+query II


We are only looking at the plans here but not at the actual results - if we want to leave the escaping code in could we add a bunch of tests that execute the queries and verify the correct results are returned?

… since the suffix regexp is private

… well

… together and create a conjuction expression

…p code

This reverts commit 17204fe.

…timizer we can match newlines for . characters

… so the operator still has access to them

… likefun::registerFunction

…nctionality ( i think)

Mytherin · 2023-06-05T07:44:00Z

Thanks!

Tmonster requested a review from Tishj April 26, 2023 14:10

Tmonster mentioned this pull request Apr 26, 2023

Add documentation or regexp_matches examples. duckdb/duckdb-web#719

Merged

Tishj reviewed Apr 26, 2023

View reviewed changes

src/optimizer/rule/regex_optimizations.cpp Show resolved Hide resolved

Tishj approved these changes Apr 28, 2023

View reviewed changes

Mytherin reviewed May 1, 2023

View reviewed changes

Tmonster requested a review from Mytherin May 2, 2023 11:17

Mytherin reviewed May 5, 2023

View reviewed changes

Tmonster requested a review from Mytherin May 5, 2023 16:18

Mytherin changed the base branch from master to feature May 11, 2023 16:43

Tmonster mentioned this pull request May 15, 2023

Revert prefix suffix optimizations, revert year extraction tidyverse/duckplyr#7

Merged

Tmonster added 19 commits May 31, 2023 10:11

I have an idea of how to add the prefix operator, none yet for suffix…

8353b2a

… since the suffix regexp is private

tmp

cc2b0cf

regexp_matches converts to prefix if only a prefix regex is asked for

b7a479d

have suffix working now as well. I know how to optimize it to like as…

b43a6af

… well

temporary commit, still need to identify suffix and prefix extensions…

10cc6c9

… together and create a conjuction expression

can convert regex statements into like, need to add tests and clean u…

4e42acf

…p code

added some tests, can now check the physical plan

13dc0e4

make format-fix

e0e0e80

all tests are passing. Need to look at the any byte thing again

27e7ff5

remove mode skip/unskip

ec8f366

fixed tests, going to open pull request and if stuff builds

43bd7b1

prevent fall through case

b67da7b

REVERT ME

9bf154e

Revert "REVERT ME"

f636f02

This reverts commit 17204fe.

regex optimizer test updates

6cdc9cd

figured out how to pass options to the regex optimizer, now in the op…

a4c4e0c

…timizer we can match newlines for . characters

more tests for newline

f1903c5

only remove regex options if a like string is found, if not keep them…

239130d

… so the operator still has access to them

small improvement

251866e

Tmonster added 11 commits May 31, 2023 10:11

make format-fix

9ffa872

add partial ordered for function matching. add getlikefunction to the…

c5a7120

… likefun::registerFunction

getting closer, still need to figure out like_escape

71cc61a

have now added like escape functionality, and have better contains fu…

d62de31

…nctionality ( i think)

also test that the escape character can be matched if escaped

96b0c3a

check for control characters in the literal case as well

2124516

fix the last tests

63db005

remove compile warning->error

b6f802e

clang fixes

7f26652

make format-fix

0825984

more PR fixes

de27937

Tmonster force-pushed the regex_prefix_suffix_optimizations_2 branch from 1cf3106 to de27937 Compare May 31, 2023 08:13

Tmonster added 2 commits May 31, 2023 10:23

dont initialize variable again

49003ab

fix comment

c53264b

Mytherin merged commit 6b84f2a into duckdb:feature Jun 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize Regexp_matches to LIKE statements when possible #7264

Optimize Regexp_matches to LIKE statements when possible #7264

Tmonster commented Apr 26, 2023 •

edited

Loading

Uh oh!

Uh oh!

Tishj left a comment

Uh oh!

Mytherin left a comment

Uh oh!

Mytherin May 1, 2023

Uh oh!

Uh oh!

Tmonster commented May 1, 2023 •

edited

Loading

Uh oh!

Mytherin left a comment

Uh oh!

Mytherin May 5, 2023

Uh oh!

Tmonster May 5, 2023

Uh oh!

Mytherin May 5, 2023

Uh oh!

Tmonster May 5, 2023

Uh oh!

Mytherin May 5, 2023

Uh oh!

Tmonster May 5, 2023

Uh oh!

Mytherin May 5, 2023

Uh oh!

Mytherin May 5, 2023

Uh oh!

Mytherin commented Jun 5, 2023

Uh oh!

Uh oh!

Optimize Regexp_matches to LIKE statements when possible #7264

Optimize Regexp_matches to LIKE statements when possible #7264

Conversation

Tmonster commented Apr 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Tishj left a comment

Choose a reason for hiding this comment

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Tmonster commented May 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mytherin commented Jun 5, 2023

Uh oh!

Uh oh!

Tmonster commented Apr 26, 2023 •

edited

Loading

Tmonster commented May 1, 2023 •

edited

Loading