Skip to content

Conversation

Tmonster
Copy link
Contributor

@Tmonster Tmonster commented Jul 9, 2024

Continuation of https://github.com/duckdb/duckdb/pull/12828/files
closes #11042

The discussion on PR 12828, led to making a new optimizer to prevent mark to semi join conversions. Although the chance is very small, I was worried people could turn off the NO_MARK_TO_SEMI_OPTIMIZER but not the filter pushdown optimizer. This would mean the bug could persist if another optimizer was added to fix this.

Now in optimizer.cpp, once the filter pushdown object is created, filter_pushdown.CheckMarkToSemi(*plan, top_bindings); is called. This is only needs to be called at the root of the plan, keeping the code clean and avoiding extra tree traversals.

@Tmonster Tmonster requested review from lnkuiper and Mytherin July 9, 2024 13:20
@duckdb-draftbot duckdb-draftbot marked this pull request as draft July 9, 2024 13:44
@Tmonster Tmonster marked this pull request as ready for review July 9, 2024 13:44
@duckdb-draftbot duckdb-draftbot marked this pull request as draft July 9, 2024 14:03
@Tmonster Tmonster marked this pull request as ready for review July 9, 2024 14:03
@duckdb-draftbot duckdb-draftbot marked this pull request as draft July 10, 2024 12:49
@Tmonster Tmonster marked this pull request as ready for review July 11, 2024 08:41
Copy link
Contributor

@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I have one comment:

auto &expr = proj.expressions.at(col_index);
vector<ColumnBinding> bindings_to_keep;
ExpressionIterator::EnumerateExpression(expr, [&](Expression &child) {
if (expr->expression_class == ExpressionClass::BOUND_COLUMN_REF) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work if the top-level expression is not a column ref, but it refers to one or more column refs?

For example:

SELECT col0 AND col1 FROM (...);

The mark join colref boolean can be used in the remainder of the plan like so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EnumerateExpression should also enumerate into the children, so the mark join colref should still be picked up here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I was worried that referring to columns multiple times would mess things up in queries like this:

SELECT col0, col0 AND col1 FROM (...);

but they are deduplicated by the unordered_set, so that should be fine

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we perhaps in a follow-up PR rename EnumerateExpression to EnumerateAllExpressions? The difference between EnumerateExpression and EnumerateChildren is otherwise easy to miss when reading.

Copy link
Contributor

@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good to go from my side

@Mytherin Mytherin merged commit 8f35a62 into duckdb:main Jul 11, 2024
@Mytherin
Copy link
Collaborator

Thanks!

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Jul 17, 2024
Merge pull request duckdb/duckdb#12916 from Tmonster/label_mark_joins_with_convert_to_semi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

INTERNAL Error: Failed to bind column reference "xx" [21.0] (bindings: [0.0 0.1 1.0])
3 participants