Skip to content

Conversation

lokax
Copy link
Contributor

@lokax lokax commented May 4, 2023

We need to iterate over the neighbors of all subsets. Because this will not lose some HyperNode neighbors.

SELECT * FROM  test0, test1, test2,test3, test4 WHERE test1.x + test4.x = test2.x AND test1.x + test4.x = test3.x AND test1.x = test4.x AND test1.x = test0.x;

Disable all other optimizer: avoid predicate deduction when predicate pushdown.

before:
S1 = {test 0}, S2 = {test1} ==> {test1, test4} ==> {test1, test2, test4} ==> We can't get {test1, test2, test3, test4}
So we need to generatte cross product.
after:
S1 = {test 0}, S2 = {test1} ==> {test1, test4} ==> {test1, test2, test4} ==> Because the subset {test1, test4} has neighbor {test3} ==> {test1, test2, test3, test4}
We can get the final plan JoinNode {S1, S2}.

Dphyp paper missing some pseudocode in EmitCSG. Cycle query will be duplicated enumeration.
So we need to update exclusion_set. example below:

All table are connected. S2 = {R0}, neighbors { R1}, {R2}, {R3}

  1. S2 = {R0, R3}, exclusion_set = {R0, R1, R2} ==> End.
  2. S2 = {R0, R2}, exclusion_set = {R0, R1} ==> {R0, R2, R3}
  3. S2 = {R0, R1}, exclusion_set = {R0} ==> neighbors = {R2}, {R3} ==> {R0, R1, R2}, {R0, R1, R3}

@Mytherin Mytherin requested a review from Tmonster May 4, 2023 05:01
@Tmonster
Copy link
Contributor

Tmonster commented May 4, 2023

Hi Lokax,

Thanks for finding this bug! I can confirm that this is an issue, but I'm not convinced this is the correct fix. I need to go back and review the paper, but I'm pretty sure we only consider neighbors with lower ids so that the algorithm doesn't emit the same pairs multiple times. If we consistently consider all neighbors, we will end up emitting the same join pairs more once and the algorithm can become an order of magnitude less efficient. DuckDB won't be slower since we stop emitting pairs at 10,000 and go to a greedy Join ordering algorthm.

I think the problem is somewhere else actually, in the function GetNeighbors and subsequently EnumerateNeighbors. I did some digging myself, and I found the following.
When we create the Join Graph we have the following relations

[0], [1], [2], [3], [4]

and the following edges.

[0] <-> [1]
[4] <-> [1]
[3] <-> [1, 4]
[2] <-> [1, 4]

You can verify these with a Print() or ToString()

On line 114 of query_graph.cpp, I can step through the execution and I see the following calls and output

node = [1, 4], exclusion_set = [4, 1, 0]
neighbors returned = [2, 3]

node = [1, 4], exclusion_set = [4, 0]
neighbors returned = [1, 3, 2]

node = [1, 3, 4], exclusion_set = [4, 3, 1, 0]
neighbors returned = [] <- *should be [2]*

node = [1, 2, 4], exclusion_set = [4, 2, 1, 0]
neighbors returned = [] <- *should be [3]*

The fact that no neighbors are returned for the last two calls makes me believe that we aren't enumerating our graph correctly. I can dig deeper into this, but maybe you want to have a go at it? Let me know if you need more information for debugging.

I would also like to add a test for this.
I know you can add a pragma debug_force_no_cross_product=true; to catch if a cross product is added when not needed, but with the example given, the other optimizers optimize away the problem. I think if we populate the tables this won't be the case.

@lokax
Copy link
Contributor Author

lokax commented May 4, 2023

If we consistently consider all neighbors, we will end up emitting the same join pairs more once and the algorithm can become an order of magnitude less efficient. DuckDB won't be slower since we stop emitting pairs at 10,000 and go to a greedy Join ordering algorthm.

Thanks.
The RelationSetMgr::UNION(...) always sorts relation ids and stores them in Tries.
What I want to say is that we need to look for all HyperNode's neighbors. In the current implementation, we have not look for some HyperNode 's neighbors(eg. HypeNode{1, 4}). Because the nested loop can't enumerate all subsets.
eg. node = [1, 3, 4], exclusion_set = [4, 3, 1, 0]
We will look for those subset neigbors. (SimpleNode{1}, SimpleNode{3}, SimpleNode{4}, HyperNode{1, 3}, HyperNode{1, 3, 4}, HyperNode{3, 4}). But there is no HypeNode{1, 4}.
I don't think this will emit same pair multiple times, because we always use the smallest relation id and use a hash table for deduplication.

@Tmonster
Copy link
Contributor

Tmonster commented May 4, 2023

Hmm, you may be right that it doesn't emit duplicate pairs, however, I still think the issue is in the EnumerateNeighbors. I looked into it again and found that by changing a break to a continue your issue is solved.
If you take a look at my branch/PR here you can see my solution. Along with a test case.
Tmonster#51

We will look for those subset neigbors. (SimpleNode{1}, SimpleNode{3}, SimpleNode{4}, HyperNode{1, 3}, HyperNode{1, 3, 4}, HyperNode{3, 4}). But there is no HypeNode{1, 4}.

This is fixed in my branch. We see if hypernode {1, 3} exists, and when it doesn't we move on to making nodes starting with {3}. Instead we should continue and inspect if there is a hypernode {1, 4} and if the hypernode has neighbors.

@lokax
Copy link
Contributor Author

lokax commented May 5, 2023

Yup, this is my first fix. Because there is no HyperNode{1, 3} in this example. When HyperNode{1, 3} exists, we can't get the neighbors of HyperNode{1, 4}.

Are there unit tests for the Join Order Optimizer? It looks like sqllogictest is not easy to run due to other optimizers.

@Tmonster
Copy link
Contributor

Tmonster commented May 5, 2023

I've updated my PR to run a test case with all optimisers enabled. If you run EXPLAIN ANALYZE then you see the join plan and you see that there are no cross products.

If you can produce a difference query that fails, I'd be happy to look into this more, but if my fix passes the tests I'm going to open it up against master

@lokax
Copy link
Contributor Author

lokax commented May 6, 2023

I changed Break to Continue. New test shows that's not the right thing to do.

disable other optimizers

SELECT * FROM test0, test1, test2, test3, test4 WHERE test1.range + test4.range = test2.range AND test1.range + test4.range = test3.range AND test1.range = test4.range AND test1.range = test0.range AND test1.range + test3.range = test0.range; 
...
node = [1, 3, 4]
exclusion_set = 3, 4, 1, 0,
neighbors = [], should be [2]
...

When HyperNode{1, 3} exists, we can't get the neighbors of HyperNode{1, 4}.

In this example, it still didn't look up the HyperNode{1, 4}'s neighbors.

Are there unit tests for the Join Order Optimizer? It looks like sqllogictest is not easy to run due to other optimizers.

Even though other optimizers are disabled and neighbors of HyperNode{1, 4} are not found, we still don't generate a CrossProduct. pragma debug_force_no_cross_product=true; is not enough to test. Maybe we should add some unit test for GetNeighbors(...).

DuckDB won't be slower since we stop emitting pairs at 10,000 and go to a greedy Join ordering algorthm.

Simply Changing Break to Continue makes it possible for the JoinOptimizer not to get the optimal JoinTree. Not getting an optimal JoinTree is not a big deal, but it happened before we switched the optimizer from Dphyp to GOO. I think it's a bug we have to deal with.

I think the problem is somewhere else actually, in the function GetNeighbors and subsequently EnumerateNeighbors

I agree. My fix is still in this function. But what I have to say is that this PR contains two fixes.

  1. Fix for repeating enumeration to the same node instead of repeating EmitPair. Dphyp paper missing some code In EmitCSG, The QueryCompiler document fixed this. https://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf
    to issue#3475 optimize CSG & CMP enumeration of join order optimizer #3652
  2. Some HyperNode neighbors we didn't look for.
  3. It should be noted that I never said that we will emit the same pair multiple times in current implementation, only that we will call same function for the same set of nodes multiple times.

@Tmonster
Copy link
Contributor

Tmonster commented May 8, 2023

I changed Break to Continue. New test shows that's not the right thing to do.

Yes you're right, my implementation was wrong. I didn't disable all other optimisers, and other edges got optimised in that still made my tests pass.

Even though other optimizers are disabled and neighbors of HyperNode{1, 4} are not found, we still don't generate a CrossProduct. pragma debug_force_no_cross_product=true; is not enough to test. Maybe we should add some unit test for GetNeighbors(...).

DuckDB doesn't have unit tests for many of our functions. We prefer SQLLogic tests because they test the behaviour of the whole system for many different features. They also don't require recompiling code, and if anybody refactors someone else's code, the SQLLogic tests remain valid tests even if functions change.

  1. Fix for repeating enumeration to the same node instead of repeating EmitPair. Dphyp paper missing some code In EmitCSG, The QueryCompiler document fixed this. https://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf
    to issue#3475 optimize CSG & CMP enumeration of join order optimizer #3652

I would still appreciate more information on this. Can you link a specific page in the document? Also, can you add more D_ASSERT(s) to check and make sure we aren't emitting extra pairs?

Also, please add the unittest file that I have in my PR. It may not fully test the functionality for your second test, but the first query does error on master in debug mode if you have pragma debug_force_no_cross_product=true;

@lokax
Copy link
Contributor Author

lokax commented May 12, 2023

@Tmonster https://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf?#page=389

Copy link
Contributor

@Tmonster Tmonster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -474,6 +490,16 @@ bool JoinOrderOptimizer::EnumerateCSGRecursive(JoinRelationSet &node, unordered_
for (idx_t i = 0; i < neighbors.size(); i++) {
// Reset the exclusion set so that the algorithm considers all combinations
// of the exclusion_set with a subset of neighbors.

// FIXME(lokax): This looks like there is a problem with duplicated enumeration?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are fixing this in this PR right? do you still need the comment then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Maybe I should remove this comment. When I have time, I will try to validate my thoughts. :)

@lokax lokax marked this pull request as draft May 19, 2023 14:51
@lokax
Copy link
Contributor Author

lokax commented May 19, 2023

Fix for repeating enumeration to the same node instead of repeating EmitPair.
It should be noted that I never said that we will emit the same pair multiple times in current implementation, only that we will call same function for the same set of nodes multiple times.

@Tmonster I was wrong here.

EXPLAIN ANALYZE SELECT * FROM t1, t2, t3, t4, t5, t6, t7;

before:
PairNum = 21295
Join Order: 0.0326s (Remove GOO code, Only dphyp, Because after fixing the bug, we didn't use GOO)

after:
PairNum = 4016
Join Order: 0.0060s

@lokax lokax marked this pull request as ready for review May 19, 2023 15:15
@lokax lokax marked this pull request as draft May 19, 2023 16:18
@lokax
Copy link
Contributor Author

lokax commented May 19, 2023

JoinOrderOptimizer::EnumerateCSGRecursive(...) {

// FIXME(lokax): This looks like there is a problem with duplicated enumeration?
		// eg. S1 = {R0}, neighbors = {R1, R2}
		// First, S1 = {R0} ==> {R0, R1} ==> {R0, R1, R2}
		// Then, S1 = {R0} ==> {R0, R2} ==> {R0, R1, R2}
		// S1 = {R0, R1, R2} will be duplicated enumerated.
		// Although this is necessary for correctness, since {R0, R1, R2} may be updated. But we do have duplicated
		// enumeration. Maybe we should get all subsets of neighbors and traverse from small to large subsets And
		// new_exclusion_set will be (exclusion_set U all neighbors) eg. S1 = {R0} ==> {R0, R1} ==> {R0, R2} ==> {R0,
		// R1, R2}
}

When n is 7, The perfect PairNumer of clique join graph is 966 instead of 4016. I tried to fix it locally. Now PairNumber is 966. I will push it later.

@lokax lokax marked this pull request as ready for review May 21, 2023 05:36
@lokax
Copy link
Contributor Author

lokax commented May 21, 2023

Now:
PairNum = 966
Join Order: 0.0016s

@lokax lokax requested a review from Tmonster May 21, 2023 05:40
Copy link
Contributor

@Tmonster Tmonster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Came back from vacation and started looking at this again. One of the imdb queries is regressing and local testing on the imdb dataset can reproduce this.

I believe the presence of the bug is the reason a better join tree is selected in master. One of the missing/not considered neighbors is a better plan on paper, but after execution, it clearly is not.

There are a number of improvements to the imdb plans. They don't show up on the CI because the sum of all hash join cardinalities is too low to cause a noticeable difference in the timing of the query execution. I'm working on a PR to refactor the join ordering code to make different plan selection easier to debug.

@Mytherin
Once you change the cardinality_is_higher() function in scripts/plan_cost_runner.py to compare exact cardinality counts instead of anything over 20% the output looks like this (https://github.com/duckdb/duckdb/blob/master/scripts/plan_cost_runner.py#L101). The bug that required the 20% fix seems to have been fix, but I'm not sure when/where.

====================================================
===========     IMPROVEMENTS DETECTED     ==========
====================================================

Query: 22a
Old cost: 742913
New cost: 737733

Query: 22b
Old cost: 370230
New cost: 369964

Query: 22c
Old cost: 3570607
New cost: 3526186

Query: 22d
Old cost: 7242514
New cost: 6714659

Query: 23a
Old cost: 1515746
New cost: 1509989

Query: 23c
Old cost: 1625297
New cost: 1603453

Query: 27a
Old cost: 13647
New cost: 13627

Query: 30c
Old cost: 1351374
New cost: 1349011

====================================================
===========     REGRESSIONS DETECTED     ===========
====================================================

Query: 19d
Old cost: 15005406
New cost: 17461235

Query: 23b
Old cost: 2536
New cost: 2546

Query: 27b
Old cost: 12374
New cost: 12431

Query: 27c
Old cost: 13393
New cost: 13764

Once my comments are addressed I'm ok with approving&merging

@@ -65,4 +65,53 @@ class JoinRelationSetManager {
JoinRelationTreeNode root;
};

class NeighborSubset {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I feel like this doesn't need to be its own class. I think the JoinRelationSetManager class can have a &GetAllNeighborsSubset(relations) function? Then you can use it in
EnumerateCmpRecursive
EnumerateCSGRecursive
and
UpdateDPTree()

GetAllNeighborSets() in join_order_optimizer.cpp has similar logic in terms of creating all the subsets, but is a static function.

union_sets.reserve(neighbors.size());
for (idx_t i = 0; i < neighbors.size(); i++) {
auto &neighbor = set_manager.GetJoinRelation(neighbors[i]);
union_sets.reserve(all_subset.Size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need all the subsets of the neighbors of the right relation? If I'm not mistaken, it is not guaranteed that all subsets have a connection with node. I feel like we will end up generating all of these subsets, but many of them are ignored because the if statement if (plans.find(&new_set) != plans.end()) { will be false.

@Tmonster
Copy link
Contributor

This looks good to me. The regression for imdb isn't ideal, but there are 8 queries that improve. I have created a PR to remove the 20% check for join regressions here #7989, this should then show that there are more improvements than regressions.

I don't think the regression is a big deal. It's another case where our selectivity estimation is incorrect but hopefully that is improved in another 3 months. In addition, a number of other queries are improved, and with larger datasets, this may mean better join orders where the saved execution time is more significant.

@Mytherin Mytherin changed the base branch from master to feature June 19, 2023 14:32
@Mytherin
Copy link
Collaborator

Could you just have a look at the failing code coverage CI?

@lokax
Copy link
Contributor Author

lokax commented Jun 19, 2023

Could you just have a look at the failing code coverage CI?

It seems that because of the previous bug fixed, now left_set and right_set no longer intersect.

@Mytherin
Copy link
Collaborator

Mytherin commented Jun 19, 2023

Could you either delete the code and replace it with an internal exception/assertion (if it is no longer required) or add a new test that triggers the behavior again?

@lokax
Copy link
Contributor Author

lokax commented Jun 23, 2023

@Mytherin

@Mytherin Mytherin merged commit aa4a965 into duckdb:feature Jun 23, 2023
@Mytherin
Copy link
Collaborator

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants