-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Join Order Optimizer has duplicate enumerations and lost some neighbors #7358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi Lokax, Thanks for finding this bug! I can confirm that this is an issue, but I'm not convinced this is the correct fix. I need to go back and review the paper, but I'm pretty sure we only consider neighbors with lower ids so that the algorithm doesn't emit the same pairs multiple times. If we consistently consider all neighbors, we will end up emitting the same join pairs more once and the algorithm can become an order of magnitude less efficient. DuckDB won't be slower since we stop emitting pairs at 10,000 and go to a greedy Join ordering algorthm. I think the problem is somewhere else actually, in the function
and the following edges.
You can verify these with a On line 114 of query_graph.cpp, I can step through the execution and I see the following calls and output
The fact that no neighbors are returned for the last two calls makes me believe that we aren't enumerating our graph correctly. I can dig deeper into this, but maybe you want to have a go at it? Let me know if you need more information for debugging. I would also like to add a test for this. |
Thanks. |
Hmm, you may be right that it doesn't emit duplicate pairs, however, I still think the issue is in the EnumerateNeighbors. I looked into it again and found that by changing a
This is fixed in my branch. We see if hypernode {1, 3} exists, and when it doesn't we move on to making nodes starting with {3}. Instead we should continue and inspect if there is a hypernode {1, 4} and if the hypernode has neighbors. |
Yup, this is my first fix. Because there is no HyperNode{1, 3} in this example. When HyperNode{1, 3} exists, we can't get the neighbors of HyperNode{1, 4}. Are there unit tests for the Join Order Optimizer? It looks like sqllogictest is not easy to run due to other optimizers. |
I've updated my PR to run a test case with all optimisers enabled. If you run EXPLAIN ANALYZE then you see the join plan and you see that there are no cross products. If you can produce a difference query that fails, I'd be happy to look into this more, but if my fix passes the tests I'm going to open it up against master |
I changed disable other optimizersSELECT * FROM test0, test1, test2, test3, test4 WHERE test1.range + test4.range = test2.range AND test1.range + test4.range = test3.range AND test1.range = test4.range AND test1.range = test0.range AND test1.range + test3.range = test0.range;
...
node = [1, 3, 4]
exclusion_set = 3, 4, 1, 0,
neighbors = [], should be [2]
...
In this example, it still didn't look up the HyperNode{1, 4}'s neighbors.
Even though other optimizers are disabled and neighbors of HyperNode{1, 4} are not found, we still don't generate a CrossProduct.
Simply Changing
I agree. My fix is still in this function. But what I have to say is that this PR contains two fixes.
|
Yes you're right, my implementation was wrong. I didn't disable all other optimisers, and other edges got optimised in that still made my tests pass.
DuckDB doesn't have unit tests for many of our functions. We prefer SQLLogic tests because they test the behaviour of the whole system for many different features. They also don't require recompiling code, and if anybody refactors someone else's code, the SQLLogic tests remain valid tests even if functions change.
I would still appreciate more information on this. Can you link a specific page in the document? Also, can you add more D_ASSERT(s) to check and make sure we aren't emitting extra pairs? Also, please add the unittest file that I have in my PR. It may not fully test the functionality for your second test, but the first query does error on master in debug mode if you have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -474,6 +490,16 @@ bool JoinOrderOptimizer::EnumerateCSGRecursive(JoinRelationSet &node, unordered_ | |||
for (idx_t i = 0; i < neighbors.size(); i++) { | |||
// Reset the exclusion set so that the algorithm considers all combinations | |||
// of the exclusion_set with a subset of neighbors. | |||
|
|||
// FIXME(lokax): This looks like there is a problem with duplicated enumeration? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are fixing this in this PR right? do you still need the comment then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Maybe I should remove this comment. When I have time, I will try to validate my thoughts. :)
@Tmonster I was wrong here. EXPLAIN ANALYZE SELECT * FROM t1, t2, t3, t4, t5, t6, t7; before: after: |
JoinOrderOptimizer::EnumerateCSGRecursive(...) {
// FIXME(lokax): This looks like there is a problem with duplicated enumeration?
// eg. S1 = {R0}, neighbors = {R1, R2}
// First, S1 = {R0} ==> {R0, R1} ==> {R0, R1, R2}
// Then, S1 = {R0} ==> {R0, R2} ==> {R0, R1, R2}
// S1 = {R0, R1, R2} will be duplicated enumerated.
// Although this is necessary for correctness, since {R0, R1, R2} may be updated. But we do have duplicated
// enumeration. Maybe we should get all subsets of neighbors and traverse from small to large subsets And
// new_exclusion_set will be (exclusion_set U all neighbors) eg. S1 = {R0} ==> {R0, R1} ==> {R0, R2} ==> {R0,
// R1, R2}
} When n is 7, The perfect PairNumer of clique join graph is 966 instead of 4016. I tried to fix it locally. Now PairNumber is 966. I will push it later. |
Now: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Came back from vacation and started looking at this again. One of the imdb queries is regressing and local testing on the imdb dataset can reproduce this.
I believe the presence of the bug is the reason a better join tree is selected in master. One of the missing/not considered neighbors is a better plan on paper, but after execution, it clearly is not.
There are a number of improvements to the imdb plans. They don't show up on the CI because the sum of all hash join cardinalities is too low to cause a noticeable difference in the timing of the query execution. I'm working on a PR to refactor the join ordering code to make different plan selection easier to debug.
@Mytherin
Once you change the cardinality_is_higher()
function in scripts/plan_cost_runner.py
to compare exact cardinality counts instead of anything over 20% the output looks like this (https://github.com/duckdb/duckdb/blob/master/scripts/plan_cost_runner.py#L101). The bug that required the 20% fix seems to have been fix, but I'm not sure when/where.
====================================================
=========== IMPROVEMENTS DETECTED ==========
====================================================
Query: 22a
Old cost: 742913
New cost: 737733
Query: 22b
Old cost: 370230
New cost: 369964
Query: 22c
Old cost: 3570607
New cost: 3526186
Query: 22d
Old cost: 7242514
New cost: 6714659
Query: 23a
Old cost: 1515746
New cost: 1509989
Query: 23c
Old cost: 1625297
New cost: 1603453
Query: 27a
Old cost: 13647
New cost: 13627
Query: 30c
Old cost: 1351374
New cost: 1349011
====================================================
=========== REGRESSIONS DETECTED ===========
====================================================
Query: 19d
Old cost: 15005406
New cost: 17461235
Query: 23b
Old cost: 2536
New cost: 2546
Query: 27b
Old cost: 12374
New cost: 12431
Query: 27c
Old cost: 13393
New cost: 13764
Once my comments are addressed I'm ok with approving&merging
@@ -65,4 +65,53 @@ class JoinRelationSetManager { | |||
JoinRelationTreeNode root; | |||
}; | |||
|
|||
class NeighborSubset { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I feel like this doesn't need to be its own class. I think the JoinRelationSetManager
class can have a &GetAllNeighborsSubset(relations)
function? Then you can use it in
EnumerateCmpRecursive
EnumerateCSGRecursive
and
UpdateDPTree()
GetAllNeighborSets()
in join_order_optimizer.cpp has similar logic in terms of creating all the subsets, but is a static function.
union_sets.reserve(neighbors.size()); | ||
for (idx_t i = 0; i < neighbors.size(); i++) { | ||
auto &neighbor = set_manager.GetJoinRelation(neighbors[i]); | ||
union_sets.reserve(all_subset.Size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need all the subsets of the neighbors of the right relation? If I'm not mistaken, it is not guaranteed that all subsets have a connection with node. I feel like we will end up generating all of these subsets, but many of them are ignored because the if statement if (plans.find(&new_set) != plans.end()) {
will be false.
…of github.com:duckdb/duckdb into dphyppp-s
This looks good to me. The regression for imdb isn't ideal, but there are 8 queries that improve. I have created a PR to remove the 20% check for join regressions here #7989, this should then show that there are more improvements than regressions. I don't think the regression is a big deal. It's another case where our selectivity estimation is incorrect but hopefully that is improved in another 3 months. In addition, a number of other queries are improved, and with larger datasets, this may mean better join orders where the saved execution time is more significant. |
Could you just have a look at the failing code coverage CI? |
It seems that because of the previous bug fixed, now left_set and right_set no longer intersect. |
Could you either delete the code and replace it with an internal exception/assertion (if it is no longer required) or add a new test that triggers the behavior again? |
Thanks! |
We need to iterate over the neighbors of all subsets. Because this will not lose some HyperNode neighbors.
Disable all other optimizer: avoid predicate deduction when predicate pushdown.
before:
S1 = {test 0}, S2 = {test1} ==> {test1, test4} ==> {test1, test2, test4} ==> We can't get {test1, test2, test3, test4}
So we need to generatte cross product.
after:
S1 = {test 0}, S2 = {test1} ==> {test1, test4} ==> {test1, test2, test4} ==> Because the subset {test1, test4} has neighbor {test3} ==> {test1, test2, test3, test4}
We can get the final plan JoinNode {S1, S2}.
Dphyp paper missing some pseudocode in EmitCSG. Cycle query will be duplicated enumeration.
So we need to update exclusion_set. example below:
All table are connected. S2 = {R0}, neighbors { R1}, {R2}, {R3}