Skip to content

Conversation

Tmonster
Copy link
Contributor

@Tmonster Tmonster commented Apr 26, 2023

As mentioned in the title, this PR is to optimise regexp_matches statements into LIKE statements. LIKE statements will then be optimised into prefix, suffix, or contains statements by the EmptyNeedleRemovalRule.

Of course, I imagine most duckdb users will use LIKE by default, but this should help with performance for those who don't. Most optimizations require passing the 's' regexp option, however. Maybe in the future we enable this by default?

If a user doesn't pass 's', the following transformations won't take place

regex . -> LIKE _
regex .* -> LIKE %

There are some corner cases that took some time to think about. Like is directly optimised to contains if only a literal or a string is provided in the regexp argument. If the regexp argument is a concatenation of regex special characters, more attention is paid into whether or not the optimisation should take place.

If the string or character has any control characters, the optimisation doesn't take place.
If the regexp has to match any special like characters, then an optimisation to like_escaped happens with an escape character of \.

Another round of improvements can be made after this merges. We can also optimise case insensitive matches to ILIKE and ILIKE_ESCAPE. But to avoid a messy large PR, I'll leave that for next week

Copy link
Contributor

@Tishj Tishj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good - some comments:

@@ -12,17 +14,76 @@ namespace duckdb {
RegexOptimizationRule::RegexOptimizationRule(ExpressionRewriter &rewriter) : Rule(rewriter) {
auto func = make_uniq<FunctionExpressionMatcher>();
func->function = make_uniq<SpecificFunctionMatcher>("regexp_matches");
func->policy = SetMatcher::Policy::ORDERED;
func->policy = SetMatcher::Policy::SOME;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By changing this to Some instead of Ordered the order of the matches no longer matters, meaning we could have [Constant, Expression] instead of [Expression, Constant]. This likely causes some issues with the optimizer below. Maybe we should introduce a PARTIAL_ORDERED to accommodate the case here?

@Tmonster
Copy link
Contributor Author

Tmonster commented May 1, 2023

Forgot that I need to escape LIKE wildcards from literal strings. Will implement that tomorrow, so no need to merge if the previous commit is green

@Tmonster Tmonster requested a review from Mytherin May 2, 2023 11:17
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes! Some more comments:

@@ -79,6 +81,18 @@ class SetMatcher {
}
}
return true;
} else if (policy == Policy::PARTIAL_ORDERED) {
// partial ordered policy, if too many entries are provided, return false
if (matchers.size() < entries.size()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would imagine this should works the other way around - i.e. we need to match all matchers, but if there are extra expressions they do not count. Otherwise the bindings provided to a function are off as they may contain fewer entries than expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that too, but then how do we match the extra expressions when they are provided? Maybe the FunctionExpressionMatcher can have a minimum_matches variable? And the match function has an extra check like

if (entries.size() < minimum_matches) {
    return false;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think the way the optimizer works changed since my last review - you do want partial matching on the matchers' side. Perhaps call it SOME_ORDERED instead of PARTIAL_ORDERED for clarity, as you essentially want SOME but ordered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea. For the regex optimiser to work, I need to match 2 or 3 arguments. In both cases the arguments have to be ordered correctly.
I will change the name to PARTIAL_ORDERED. Is the minimum_matches variable still a good idea?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me - but then it should apply to some as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think it is unnecessary logic though since the function binding stage will make sure the arguments are correct. I'll just change the if statement to if (entries.size() < matchers.size()) return false;

char chr = toascii(rune);
// if a character is equal to the escaped character return that there is no escaped like string.
if (!contains && (chr == '%' || chr == '_' || chr == ret.escaped_character[0])) {
ret.escaped = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for simplicity we should just skip the % and _ characters from this optimization for now? It seems like it adds a number of edge cases that complicate this code significantly and I don't think these characters are very common anyway.

Binder Error

# we escape like special character when we convert to a like string
query II
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only looking at the plans here but not at the actual results - if we want to leave the escaping code in could we add a bunch of tests that execute the queries and verify the correct results are returned?

@Tmonster Tmonster requested a review from Mytherin May 5, 2023 16:18
@Mytherin Mytherin changed the base branch from master to feature May 11, 2023 16:43
@Tmonster Tmonster force-pushed the regex_prefix_suffix_optimizations_2 branch from 1cf3106 to de27937 Compare May 31, 2023 08:13
@Mytherin Mytherin merged commit 6b84f2a into duckdb:feature Jun 5, 2023
@Mytherin
Copy link
Collaborator

Mytherin commented Jun 5, 2023

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants