-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Optimize Regexp_matches to LIKE statements when possible #7264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Regexp_matches to LIKE statements when possible #7264
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Looks good - some comments:
@@ -12,17 +14,76 @@ namespace duckdb { | |||
RegexOptimizationRule::RegexOptimizationRule(ExpressionRewriter &rewriter) : Rule(rewriter) { | |||
auto func = make_uniq<FunctionExpressionMatcher>(); | |||
func->function = make_uniq<SpecificFunctionMatcher>("regexp_matches"); | |||
func->policy = SetMatcher::Policy::ORDERED; | |||
func->policy = SetMatcher::Policy::SOME; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By changing this to Some
instead of Ordered
the order of the matches no longer matters, meaning we could have [Constant, Expression]
instead of [Expression, Constant]
. This likely causes some issues with the optimizer below. Maybe we should introduce a PARTIAL_ORDERED
to accommodate the case here?
Forgot that I need to escape LIKE wildcards from literal strings. Will implement that tomorrow, so no need to merge if the previous commit is green |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes! Some more comments:
@@ -79,6 +81,18 @@ class SetMatcher { | |||
} | |||
} | |||
return true; | |||
} else if (policy == Policy::PARTIAL_ORDERED) { | |||
// partial ordered policy, if too many entries are provided, return false | |||
if (matchers.size() < entries.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would imagine this should works the other way around - i.e. we need to match all matchers, but if there are extra expressions they do not count. Otherwise the bindings
provided to a function are off as they may contain fewer entries than expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that too, but then how do we match the extra expressions when they are provided? Maybe the FunctionExpressionMatcher can have a minimum_matches variable? And the match function has an extra check like
if (entries.size() < minimum_matches) {
return false;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think the way the optimizer works changed since my last review - you do want partial matching on the matchers' side. Perhaps call it SOME_ORDERED
instead of PARTIAL_ORDERED
for clarity, as you essentially want SOME
but ordered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea. For the regex optimiser to work, I need to match 2 or 3 arguments. In both cases the arguments have to be ordered correctly.
I will change the name to PARTIAL_ORDERED
. Is the minimum_matches
variable still a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine by me - but then it should apply to some as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think it is unnecessary logic though since the function binding stage will make sure the arguments are correct. I'll just change the if statement to if (entries.size() < matchers.size()) return false;
char chr = toascii(rune); | ||
// if a character is equal to the escaped character return that there is no escaped like string. | ||
if (!contains && (chr == '%' || chr == '_' || chr == ret.escaped_character[0])) { | ||
ret.escaped = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe for simplicity we should just skip the %
and _
characters from this optimization for now? It seems like it adds a number of edge cases that complicate this code significantly and I don't think these characters are very common anyway.
test/optimizer/regex_optimizer.test
Outdated
Binder Error | ||
|
||
# we escape like special character when we convert to a like string | ||
query II |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are only looking at the plans here but not at the actual results - if we want to leave the escaping code in could we add a bunch of tests that execute the queries and verify the correct results are returned?
… since the suffix regexp is private
… together and create a conjuction expression
This reverts commit 17204fe.
…timizer we can match newlines for . characters
… so the operator still has access to them
… likefun::registerFunction
…nctionality ( i think)
1cf3106
to
de27937
Compare
Thanks! |
As mentioned in the title, this PR is to optimise regexp_matches statements into LIKE statements. LIKE statements will then be optimised into prefix, suffix, or contains statements by the EmptyNeedleRemovalRule.
Of course, I imagine most duckdb users will use
LIKE
by default, but this should help with performance for those who don't. Most optimizations require passing the 's' regexp option, however. Maybe in the future we enable this by default?If a user doesn't pass 's', the following transformations won't take place
regex
.
-> LIKE_
regex
.*
-> LIKE%
There are some corner cases that took some time to think about. Like is directly optimised to contains if only a literal or a string is provided in the regexp argument. If the regexp argument is a concatenation of regex special characters, more attention is paid into whether or not the optimisation should take place.
If the string or character has any control characters, the optimisation doesn't take place.
If the regexp has to match any special like characters, then an optimisation to
like_escaped
happens with an escape character of\
.Another round of improvements can be made after this merges. We can also optimise case insensitive matches to
ILIKE
andILIKE_ESCAPE
. But to avoid a messy large PR, I'll leave that for next week