Join Order Optimizer has duplicate enumerations and lost some neighbors #7358

lokax · 2023-05-04T04:39:45Z

We need to iterate over the neighbors of all subsets. Because this will not lose some HyperNode neighbors.

SELECT * FROM  test0, test1, test2,test3, test4 WHERE test1.x + test4.x = test2.x AND test1.x + test4.x = test3.x AND test1.x = test4.x AND test1.x = test0.x;

Disable all other optimizer: avoid predicate deduction when predicate pushdown.

before:
S1 = {test 0}, S2 = {test1} ==> {test1, test4} ==> {test1, test2, test4} ==> We can't get {test1, test2, test3, test4}
So we need to generatte cross product.
after：
S1 = {test 0}, S2 = {test1} ==> {test1, test4} ==> {test1, test2, test4} ==> Because the subset {test1, test4} has neighbor {test3} ==> {test1, test2, test3, test4}
We can get the final plan JoinNode {S1, S2}.

Dphyp paper missing some pseudocode in EmitCSG. Cycle query will be duplicated enumeration.
So we need to update exclusion_set. example below:

All table are connected. S2 = {R0}, neighbors { R1}, {R2}, {R3}

S2 = {R0, R3}, exclusion_set = {R0, R1, R2} ==> End.
S2 = {R0, R2}, exclusion_set = {R0, R1} ==> {R0, R2, R3}
S2 = {R0, R1}, exclusion_set = {R0} ==> neighbors = {R2}, {R3} ==> {R0, R1, R2}, {R0, R1, R3}

Tmonster · 2023-05-04T11:12:06Z

Hi Lokax,

Thanks for finding this bug! I can confirm that this is an issue, but I'm not convinced this is the correct fix. I need to go back and review the paper, but I'm pretty sure we only consider neighbors with lower ids so that the algorithm doesn't emit the same pairs multiple times. If we consistently consider all neighbors, we will end up emitting the same join pairs more once and the algorithm can become an order of magnitude less efficient. DuckDB won't be slower since we stop emitting pairs at 10,000 and go to a greedy Join ordering algorthm.

I think the problem is somewhere else actually, in the function GetNeighbors and subsequently EnumerateNeighbors. I did some digging myself, and I found the following.
When we create the Join Graph we have the following relations

[0], [1], [2], [3], [4]

and the following edges.

[0] <-> [1]
[4] <-> [1]
[3] <-> [1, 4]
[2] <-> [1, 4]

You can verify these with a Print() or ToString()

On line 114 of query_graph.cpp, I can step through the execution and I see the following calls and output

node = [1, 4], exclusion_set = [4, 1, 0]
neighbors returned = [2, 3]

node = [1, 4], exclusion_set = [4, 0]
neighbors returned = [1, 3, 2]

node = [1, 3, 4], exclusion_set = [4, 3, 1, 0]
neighbors returned = [] <- *should be [2]*

node = [1, 2, 4], exclusion_set = [4, 2, 1, 0]
neighbors returned = [] <- *should be [3]*

The fact that no neighbors are returned for the last two calls makes me believe that we aren't enumerating our graph correctly. I can dig deeper into this, but maybe you want to have a go at it? Let me know if you need more information for debugging.

I would also like to add a test for this.
I know you can add a pragma debug_force_no_cross_product=true; to catch if a cross product is added when not needed, but with the example given, the other optimizers optimize away the problem. I think if we populate the tables this won't be the case.

lokax · 2023-05-04T12:48:29Z

If we consistently consider all neighbors, we will end up emitting the same join pairs more once and the algorithm can become an order of magnitude less efficient. DuckDB won't be slower since we stop emitting pairs at 10,000 and go to a greedy Join ordering algorthm.

Thanks.
The RelationSetMgr::UNION(...) always sorts relation ids and stores them in Tries.
What I want to say is that we need to look for all HyperNode's neighbors. In the current implementation, we have not look for some HyperNode 's neighbors(eg. HypeNode{1, 4}). Because the nested loop can't enumerate all subsets.
eg. node = [1, 3, 4], exclusion_set = [4, 3, 1, 0]
We will look for those subset neigbors. (SimpleNode{1}, SimpleNode{3}, SimpleNode{4}, HyperNode{1, 3}, HyperNode{1, 3, 4}, HyperNode{3, 4}). But there is no HypeNode{1, 4}.
I don't think this will emit same pair multiple times, because we always use the smallest relation id and use a hash table for deduplication.

Tmonster · 2023-05-04T13:35:11Z

Hmm, you may be right that it doesn't emit duplicate pairs, however, I still think the issue is in the EnumerateNeighbors. I looked into it again and found that by changing a break to a continue your issue is solved.
If you take a look at my branch/PR here you can see my solution. Along with a test case.
Tmonster#51

We will look for those subset neigbors. (SimpleNode{1}, SimpleNode{3}, SimpleNode{4}, HyperNode{1, 3}, HyperNode{1, 3, 4}, HyperNode{3, 4}). But there is no HypeNode{1, 4}.

This is fixed in my branch. We see if hypernode {1, 3} exists, and when it doesn't we move on to making nodes starting with {3}. Instead we should continue and inspect if there is a hypernode {1, 4} and if the hypernode has neighbors.

lokax · 2023-05-05T12:53:08Z

Yup, this is my first fix. Because there is no HyperNode{1, 3} in this example. When HyperNode{1, 3} exists, we can't get the neighbors of HyperNode{1, 4}.

Are there unit tests for the Join Order Optimizer? It looks like sqllogictest is not easy to run due to other optimizers.

Tmonster · 2023-05-05T15:16:04Z

I've updated my PR to run a test case with all optimisers enabled. If you run EXPLAIN ANALYZE then you see the join plan and you see that there are no cross products.

If you can produce a difference query that fails, I'd be happy to look into this more, but if my fix passes the tests I'm going to open it up against master

lokax · 2023-05-06T04:42:38Z

I changed Break to Continue. New test shows that's not the right thing to do.

disable other optimizers

SELECT * FROM test0, test1, test2, test3, test4 WHERE test1.range + test4.range = test2.range AND test1.range + test4.range = test3.range AND test1.range = test4.range AND test1.range = test0.range AND test1.range + test3.range = test0.range; 
...
node = [1, 3, 4]
exclusion_set = 3, 4, 1, 0,
neighbors = [], should be [2]
...

When HyperNode{1, 3} exists, we can't get the neighbors of HyperNode{1, 4}.

In this example, it still didn't look up the HyperNode{1, 4}'s neighbors.

Are there unit tests for the Join Order Optimizer? It looks like sqllogictest is not easy to run due to other optimizers.

Even though other optimizers are disabled and neighbors of HyperNode{1, 4} are not found, we still don't generate a CrossProduct. pragma debug_force_no_cross_product=true; is not enough to test. Maybe we should add some unit test for GetNeighbors(...).

DuckDB won't be slower since we stop emitting pairs at 10,000 and go to a greedy Join ordering algorthm.

Simply Changing Break to Continue makes it possible for the JoinOptimizer not to get the optimal JoinTree. Not getting an optimal JoinTree is not a big deal, but it happened before we switched the optimizer from Dphyp to GOO. I think it's a bug we have to deal with.

I think the problem is somewhere else actually, in the function GetNeighbors and subsequently EnumerateNeighbors

I agree. My fix is still in this function. But what I have to say is that this PR contains two fixes.

Fix for repeating enumeration to the same node instead of repeating EmitPair. Dphyp paper missing some code In EmitCSG, The QueryCompiler document fixed this. https://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf
to issue#3475 optimize CSG & CMP enumeration of join order optimizer #3652
Some HyperNode neighbors we didn't look for.
It should be noted that I never said that we will emit the same pair multiple times in current implementation, only that we will call same function for the same set of nodes multiple times.

Tmonster · 2023-05-08T15:02:44Z

I changed Break to Continue. New test shows that's not the right thing to do.

Yes you're right, my implementation was wrong. I didn't disable all other optimisers, and other edges got optimised in that still made my tests pass.

Even though other optimizers are disabled and neighbors of HyperNode{1, 4} are not found, we still don't generate a CrossProduct. pragma debug_force_no_cross_product=true; is not enough to test. Maybe we should add some unit test for GetNeighbors(...).

DuckDB doesn't have unit tests for many of our functions. We prefer SQLLogic tests because they test the behaviour of the whole system for many different features. They also don't require recompiling code, and if anybody refactors someone else's code, the SQLLogic tests remain valid tests even if functions change.

Fix for repeating enumeration to the same node instead of repeating EmitPair. Dphyp paper missing some code In EmitCSG, The QueryCompiler document fixed this. https://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf
to issue#3475 optimize CSG & CMP enumeration of join order optimizer #3652

I would still appreciate more information on this. Can you link a specific page in the document? Also, can you add more D_ASSERT(s) to check and make sure we aren't emitting extra pairs?

Also, please add the unittest file that I have in my PR. It may not fully test the functionality for your second test, but the first query does error on master in debug mode if you have pragma debug_force_no_cross_product=true;

lokax · 2023-05-12T06:45:53Z

@Tmonster https://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pdf?#page=389

Tmonster

Can you please add the tests?
https://github.com/Tmonster/duckdb/blob/748158b28c2a2abfaa8ffeed7cc46656a450aac5/test/optimizer/joins/no_duplicate_elimination_join.test

Tmonster · 2023-05-15T12:54:51Z

src/optimizer/join_order/join_order_optimizer.cpp

@@ -474,6 +490,16 @@ bool JoinOrderOptimizer::EnumerateCSGRecursive(JoinRelationSet &node, unordered_
 	for (idx_t i = 0; i < neighbors.size(); i++) {
 		// Reset the exclusion set so that the algorithm considers all combinations
 		// of the exclusion_set with a subset of neighbors.
+
+		// FIXME(lokax): This looks like there is a problem with duplicated enumeration?


You are fixing this in this PR right? do you still need the comment then?

No. Maybe I should remove this comment. When I have time, I will try to validate my thoughts. :)

lokax · 2023-05-19T15:13:51Z

Fix for repeating enumeration to the same node instead of repeating EmitPair.
It should be noted that I never said that we will emit the same pair multiple times in current implementation, only that we will call same function for the same set of nodes multiple times.

@Tmonster I was wrong here.

EXPLAIN ANALYZE SELECT * FROM t1, t2, t3, t4, t5, t6, t7;

before:
PairNum = 21295
Join Order: 0.0326s (Remove GOO code, Only dphyp, Because after fixing the bug, we didn't use GOO)

after:
PairNum = 4016
Join Order: 0.0060s

lokax · 2023-05-19T16:24:11Z

JoinOrderOptimizer::EnumerateCSGRecursive(...) {

// FIXME(lokax): This looks like there is a problem with duplicated enumeration?
		// eg. S1 = {R0}, neighbors = {R1, R2}
		// First, S1 = {R0} ==> {R0, R1} ==> {R0, R1, R2}
		// Then, S1 = {R0} ==> {R0, R2} ==> {R0, R1, R2}
		// S1 = {R0, R1, R2} will be duplicated enumerated.
		// Although this is necessary for correctness, since {R0, R1, R2} may be updated. But we do have duplicated
		// enumeration. Maybe we should get all subsets of neighbors and traverse from small to large subsets And
		// new_exclusion_set will be (exclusion_set U all neighbors) eg. S1 = {R0} ==> {R0, R1} ==> {R0, R2} ==> {R0,
		// R1, R2}
}

When n is 7, The perfect PairNumer of clique join graph is 966 instead of 4016. I tried to fix it locally. Now PairNumber is 966. I will push it later.

lokax · 2023-05-21T05:36:43Z

Now:
PairNum = 966
Join Order: 0.0016s

Tmonster

Came back from vacation and started looking at this again. One of the imdb queries is regressing and local testing on the imdb dataset can reproduce this.

I believe the presence of the bug is the reason a better join tree is selected in master. One of the missing/not considered neighbors is a better plan on paper, but after execution, it clearly is not.

There are a number of improvements to the imdb plans. They don't show up on the CI because the sum of all hash join cardinalities is too low to cause a noticeable difference in the timing of the query execution. I'm working on a PR to refactor the join ordering code to make different plan selection easier to debug.

@Mytherin
Once you change the cardinality_is_higher() function in scripts/plan_cost_runner.py to compare exact cardinality counts instead of anything over 20% the output looks like this (https://github.com/duckdb/duckdb/blob/master/scripts/plan_cost_runner.py#L101). The bug that required the 20% fix seems to have been fix, but I'm not sure when/where.

====================================================
===========     IMPROVEMENTS DETECTED     ==========
====================================================

Query: 22a
Old cost: 742913
New cost: 737733

Query: 22b
Old cost: 370230
New cost: 369964

Query: 22c
Old cost: 3570607
New cost: 3526186

Query: 22d
Old cost: 7242514
New cost: 6714659

Query: 23a
Old cost: 1515746
New cost: 1509989

Query: 23c
Old cost: 1625297
New cost: 1603453

Query: 27a
Old cost: 13647
New cost: 13627

Query: 30c
Old cost: 1351374
New cost: 1349011

====================================================
===========     REGRESSIONS DETECTED     ===========
====================================================

Query: 19d
Old cost: 15005406
New cost: 17461235

Query: 23b
Old cost: 2536
New cost: 2546

Query: 27b
Old cost: 12374
New cost: 12431

Query: 27c
Old cost: 13393
New cost: 13764

Once my comments are addressed I'm ok with approving&merging

Tmonster · 2023-06-02T15:18:06Z

src/include/duckdb/optimizer/join_order/join_relation.hpp

@@ -65,4 +65,53 @@ class JoinRelationSetManager {
 	JoinRelationTreeNode root;
 };

+class NeighborSubset {


Hmm, I feel like this doesn't need to be its own class. I think the JoinRelationSetManager class can have a &GetAllNeighborsSubset(relations) function? Then you can use it in
EnumerateCmpRecursive
EnumerateCSGRecursive
and
UpdateDPTree()

GetAllNeighborSets() in join_order_optimizer.cpp has similar logic in terms of creating all the subsets, but is a static function.

Tmonster · 2023-06-02T15:23:01Z

src/optimizer/join_order/join_order_optimizer.cpp

-	union_sets.reserve(neighbors.size());
-	for (idx_t i = 0; i < neighbors.size(); i++) {
-		auto &neighbor = set_manager.GetJoinRelation(neighbors[i]);
+	union_sets.reserve(all_subset.Size());


Why do we need all the subsets of the neighbors of the right relation? If I'm not mistaken, it is not guaranteed that all subsets have a connection with node. I feel like we will end up generating all of these subsets, but many of them are ignored because the if statement if (plans.find(&new_set) != plans.end()) { will be false.

…of github.com:duckdb/duckdb into dphyppp-s

Tmonster · 2023-06-19T12:07:05Z

This looks good to me. The regression for imdb isn't ideal, but there are 8 queries that improve. I have created a PR to remove the 20% check for join regressions here #7989, this should then show that there are more improvements than regressions.

I don't think the regression is a big deal. It's another case where our selectivity estimation is incorrect but hopefully that is improved in another 3 months. In addition, a number of other queries are improved, and with larger datasets, this may mean better join orders where the saved execution time is more significant.

Mytherin · 2023-06-19T14:33:08Z

Could you just have a look at the failing code coverage CI?

lokax · 2023-06-19T14:34:59Z

Could you just have a look at the failing code coverage CI?

It seems that because of the previous bug fixed, now left_set and right_set no longer intersect.

Mytherin · 2023-06-19T14:37:05Z

Could you either delete the code and replace it with an internal exception/assertion (if it is no longer required) or add a new test that triggers the behavior again?

lokax · 2023-06-23T07:10:27Z

@Mytherin

Mytherin · 2023-06-23T10:56:57Z

Thanks!

lokax added 5 commits April 29, 2023 19:18

Avoid duplicate enumeration

e037084

generate all subsets

cd67d75

remove reset exclusin set

f44bb9c

fix and add some comment

56fe068

add comment

cb41564

Mytherin requested a review from Tmonster May 4, 2023 05:01

Tmonster mentioned this pull request May 8, 2023

7314 tpcds query fails Tmonster/duckdb#52

Closed

Tmonster suggested changes May 15, 2023

View reviewed changes

lokax added 2 commits May 19, 2023 19:53

Add join optimizer test

d59fb45

Merge branch 'master' of github.com:duckdb/duckdb into dphyppp-s

9a3f7a8

lokax marked this pull request as draft May 19, 2023 14:51

lokax marked this pull request as ready for review May 19, 2023 15:15

lokax marked this pull request as draft May 19, 2023 16:18

lokax added 7 commits May 20, 2023 13:09

Add NeighborSubset class to enumerate subset from small to large

eed6603

avoid copy

c41a5c8

fix build

2bc97fd

fix tidy

9ccf247

Merge branch 'master' of github.com:duckdb/duckdb into dphyppp-s

bf0c153

Merge branch 'master' of github.com:duckdb/duckdb into dphyp-u

83f1c61

fix build

22fe1c1

lokax added 2 commits May 20, 2023 20:45

Merge branch 'dphyp-u' into dphyppp-s

c5c84d2

fix segfault

bc114d9

lokax marked this pull request as ready for review May 21, 2023 05:36

lokax requested a review from Tmonster May 21, 2023 05:40

Tmonster suggested changes Jun 5, 2023

View reviewed changes

lokax added 5 commits June 18, 2023 16:39

Merge branch 'dphyppp-s' of github.com:lokax/duckdb; branch 'master' …

b716711

…of github.com:duckdb/duckdb into dphyppp-s

remove NegiborSubset class

55dad63

chore

78016f0

format

d55677f

fix tidy check

0dfc1bd

Tmonster approved these changes Jun 19, 2023

View reviewed changes

Mytherin changed the base branch from master to feature June 19, 2023 14:32

remove unneed code

b8d0be0

Mytherin merged commit aa4a965 into duckdb:feature Jun 23, 2023

Join Order Optimizer has duplicate enumerations and lost some neighbors #7358

Join Order Optimizer has duplicate enumerations and lost some neighbors #7358

Uh oh!

Conversation

lokax commented May 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Disable all other optimizer: avoid predicate deduction when predicate pushdown.

Uh oh!

Tmonster commented May 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lokax commented May 4, 2023

Uh oh!

Tmonster commented May 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lokax commented May 5, 2023

Uh oh!

Tmonster commented May 5, 2023

Uh oh!

lokax commented May 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

disable other optimizers

Uh oh!

Tmonster commented May 8, 2023

Uh oh!

lokax commented May 12, 2023

Uh oh!

Tmonster left a comment

Choose a reason for hiding this comment

Uh oh!

Tmonster May 15, 2023

Choose a reason for hiding this comment

Uh oh!

lokax May 16, 2023

Choose a reason for hiding this comment

Uh oh!

lokax commented May 19, 2023

Uh oh!

lokax commented May 19, 2023

Uh oh!

lokax commented May 21, 2023

Uh oh!

Tmonster left a comment

Choose a reason for hiding this comment

Uh oh!

Tmonster Jun 2, 2023

Choose a reason for hiding this comment

Uh oh!

Tmonster Jun 2, 2023

Choose a reason for hiding this comment

Uh oh!

Tmonster commented Jun 19, 2023

Uh oh!

Mytherin commented Jun 19, 2023

Uh oh!

lokax commented Jun 19, 2023

Uh oh!

Mytherin commented Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lokax commented Jun 23, 2023

Uh oh!

Mytherin commented Jun 23, 2023

Uh oh!

Uh oh!

lokax commented May 4, 2023 •

edited

Loading

Tmonster commented May 4, 2023 •

edited

Loading

Tmonster commented May 4, 2023 •

edited

Loading

lokax commented May 6, 2023 •

edited

Loading

Mytherin commented Jun 19, 2023 •

edited

Loading