Fix EXPORT/IMPORT DATABASE `schema.sql` order #8619

Tishj · 2023-08-18T14:32:36Z

This PR addresses #8496

What is the problem and what is the proposed solution?

Previously we used an ad-hoc solution to approximately order the catalog entries, this has some outliers and that resulted in issues like the one that's linked.

What this PR aims to do is replace this ad-hoc solution with a hopefully foolproof solution which relies on the DependencyManager to guide this ordering.

CatalogEntryRetriever

After adding this DependencyManager-driven ordering of catalog entries, it became apparent that the DependencyManager wasn't being utilized as well as it should be.
Dependencies for Indexes, Types, Macros, Generated Columns and Views were not being registered.

In the existing code path we already Bind these entries, to verify that they are correctly formed and their dependencies are present at the time of creation.
It does this by retrieving the required catalog entries, which is essentially listing out all the dependencies of the entry.

To make use of this fact I've added a CatalogEntryRetriever, this class has an optional callback, which is called for every successfully retrieved catalog entry.
We use this callback to populate a DependencyList in CreateInfo.
StandardEntry has a new virtual method, InherentDependencies which returns an empty DependencyList by default.

The relevant subclasses of this have been updated to hold a DependencyList, taken from the CreateInfo, and override this InherentDependencies method to return this list of dependencies.

CREATE TYPE + EXPORT DATABASE

It seems exporting TypeCatalogEntry, created by the CREATE TYPE statement was not properly tested, because when we EXPORT DATABASE, the ToSQL method is called and this previously only supported ENUM, I've updated this to support every type we have.

Footnote

Since only the DuckCatalog has a DependencyManager the old method still exists and will still need to be utilized by other Catalog types.

…te the catalog entries

…inders CatalogEntryRetriever, add tests with created types and dependencies created by them

… dependencies of the generated columns are registered as dependencies of the table

… depends on it

Tishj · 2023-08-18T14:35:02Z

src/catalog/dependency_manager.cpp

 	}
 	// erase the dependents and dependencies for this object
 	dependents_map.erase(object);
 	dependencies_map.erase(object);
 }

+void DependencyManager::PrintDependencyMap() {


These are for debugging purposes.
What would be the best way to keep them around so they can be useful in the future without causing unnecessary friction?

Would it be desirable to expose this (in some form) to SQL?

Idea is that then it can be tested / and being present there for debug purposes.

That's not a bad idea, maybe we could make two table functions that output a column for every catalog entry, containing a VARCHAR[] containing the names of the dependents/dependencies

Maybe take an optional catalog VARCHAR parameter

Edit:

I thought about it a little more, this might be a better structure
Columns:

Name (of the entry)

Connections (the list of dependencies/dependents)

Type ("DEPENDENCIES" or "DEPENDENTS")

Catalog (name of catalog the entry belongs to)

Schema

Then we could fit it into a single table function

dependency_manager()

Btw I just found that we have duckdb_dependencies() table function, but it seems kind of lacking

Tishj · 2023-08-18T14:35:13Z

src/catalog/dependency_manager.cpp

+	catalog_entry_set_t entries;
+	catalog_entry_vector_t export_order;
+
+	PrintDependencyMap();


These need to be removed

Tishj · 2023-08-18T14:40:02Z

src/planner/binder/statement/bind_create.cpp

-			type = catalog->GetType(context, schema, user_type_name, OnEntryNotFound::RETURN_NULL);
+			auto entry = entry_retriever.GetEntry(CatalogType::TYPE_ENTRY, *catalog, schema, user_type_name,
+			                                      OnEntryNotFound::RETURN_NULL);
+			if (!entry) {


These might be a little too terse, but I didn't want to duplicate all of the "helper" functions that Catalog has.

In the future it might be better to move all of the catalog entry look up responsibility to CatalogEntryRetriever, that would also make the Catalog class a little lighter.

Tishj · 2023-08-18T14:42:59Z

src/include/duckdb/planner/binder.hpp

@@ -174,8 +179,8 @@ class Binder : public std::enable_shared_from_this<Binder> {
 	void BindOnConflictClause(LogicalInsert &insert, TableCatalogEntry &table, InsertStatement &stmt);

 	static void BindSchemaOrCatalog(ClientContext &context, string &catalog, string &schema);
-	static void BindLogicalType(ClientContext &context, LogicalType &type, optional_ptr<Catalog> catalog = nullptr,
-	                            const string &schema = INVALID_SCHEMA);
+	void BindLogicalType(LogicalType &type, optional_ptr<Catalog> catalog = nullptr,


This is no longer a static method because the binding of USER types perform a CatalogEntry lookup which we want to perform the CatalogEntryRetrievers callback on.

In the places that previously used this as a static method without being inside a Binder or having access to a binder I've created a binder on the spot.
In a lot of these places a binder was already created later, so I don't think any harm was done here.

Tishj · 2023-08-18T14:48:08Z

src/include/duckdb/catalog/dependency_manager.hpp

@@ -50,5 +53,13 @@ class DependencyManager {
 	void DropObject(CatalogTransaction transaction, CatalogEntry &object, bool cascade);
 	void AlterObject(CatalogTransaction transaction, CatalogEntry &old_obj, CatalogEntry &new_obj);
 	void EraseObjectInternal(CatalogEntry &object);
+
+	dependency_set_t &GetEntriesThatDependOnObject(CatalogEntry &object);


These methods were made in an attempt to make it easier to work with the dependency_map and dependents_map, because it's logically a little hard to understand what connections they represent.

Tishj · 2023-08-18T14:51:00Z

src/execution/operator/persistent/physical_export.cpp

+		auto &dependency_manager = duck_catalog.GetDependencyManager();
+		catalog_entries = GetDependencyDrivenExportOrder(context, dependency_manager, *info, exported_tables);
+	} else {
+		catalog_entries = GetNaiveExportOrder(context, *info, exported_tables);


Sadly we can't get rid of this error-prone method entirely, though I haven't seen any other classes that derive from Catalog, but this could likely be a point of interest for extensions later on

Maybe GetExportOrder should be a virtual method, so it can be overridden by an extension catalog ?

Tishj · 2023-08-18T14:55:20Z

There are some methods related to dependency extraction in the binder code, such as
ExtractExpressionDependencies

Maybe these could also be removed in favor of the CatalogEntryRetriever callback method?
I've left them in for now, but they probably perform the same task.

carlopi · 2023-08-18T16:38:11Z

src/catalog/dependency_manager.cpp

+	queue<reference<CatalogEntry>> backlog;
+	// Populate the backlog with every entry in the dependencies map
+	for (auto &entry : dependencies_map) {
+		backlog.push(entry.first);
+	}
+
+	// First populate our backlog with every entry in dependencies_map
+	while (!backlog.empty()) {
+		auto &object = backlog.front();
+		backlog.pop();
+		if (entries.count(object)) {
+			// This entry has already been written
+			continue;
+		}
+		auto entry = dependencies_map.find(object);
+		if (entry == dependencies_map.end() || AllExportDependenciesWritten(object, entry->second, entries)) {
+			// All dependencies written, we can write this now
+			auto insert_result = entries.insert(object);
+			D_ASSERT(insert_result.second);
+			export_order.push_back(object);
+		} else {
+			for (auto &dependency : entry->second) {
+				backlog.emplace(dependency);
+			}
+			backlog.emplace(object);
+		}
+	}


This implementation is potentially O(n^2), not sure we care enough and it's not immediate to me what an alternative implementation would be.

This implementation is potentially O(n^2), not sure we care enough and it's not immediate to me what an alternative implementation would be.

Yea it's pretty bad, I figured EXPORT DATABASE isn't very performance sensitive and it's better to focus on correctness

But it could definitely be improved

If we turn this into a stack instead and push every dependency to the front, this should be slightly more efficient, as we won't keep checking the same object if the dependencies of the object also have dependencies

now:
obj has dependencies:

state of the queue:

dep1 dep2 dep3 obj

dep1 has dependencies, push these, state of the queue:

dep2 dep3 obj dep1.1 dep1.2 dep1.3 dep1

Once we reach obj again, we will push dep1 and the rest of its dependencies again (dep2, dep3)

With a stack:
obj has dependencies:

dep3 dep2 dep1 obj

dep1 has dependencies, push these:

dep1.3 dep1.2 dep1.1 dep1 obj

We will naturally deal with all dependencies first before checking an object whos dependencies have not been processed yet

to get the order don't you just do a depth first search over the entries added to the backlog? my c++ is rusty but something like:

void DependencyManager::DepthFirstSearch(catalog_entry_set_t &explored, catalog_entry_vector_t &order, the_type &entry) { if (explored.count(entry) > 0) { return; } explored.insert(entry); auto deps = dependencies_map.find(entry); if (deps != dependencies_map.end() { for (auto &dependency : deps->second) { DepthFirstSearch(explored, order, dependency); } } order.push_back(entry); } catalog_entry_vector_t DependencyManager::GetExportOrder() { catalog_entry_set_t explored; catalog_entry_vector_t export_order; for (auto &entry : dependencies_map) { DepthFirstSearch(explored, export_order, entry.first); } return export_order; }

I wanted to avoid recursion, to not run the risk of a stack overflow.
But maybe I was being too cautious for no good reason here

I think my proposed change to use a stack would mimic your depth first search approach though

i don't quite follow the explanation with the stack / queue described. For a non-recursive DFS there is an implementation on Wikipedia, https://en.wikipedia.org/wiki/Depth-first_search#Pseudocode, that uses a stack of iterators. That would work to give a topological sort -- appending the element to the output vector at the last pop in the pseudo-code.

here's my attempt, i was bored.
https://github.com/robert-manchester/duckdb/blob/7de8d5dd272737d6397a30affebc20ceae32fc87/src/catalog/dependency_manager.cpp#L197

Tishj · 2023-08-19T16:23:00Z

Another thing to note, I am introducing dependencies to a view, it is now dependent on the table.
That is changing some behavior that we have tests for, with this change we are no longer following sqlite behavior.

Though one could argue that was already no longer the case because create view v1 as select * from tbl succeeds in SQLite even when tbl doesn't exist

Only when v1 is used is the existence of tbl checked in SQLite, we however bind the query on view creation and subsequently throw an error if tbl doesn't exist yet

EDIT:
PostgreSQL has this same behavior:

➜  duckdb git:(macro_function_dependencies) ✗ psql postgres
psql (14.8 (Homebrew))
Type "help" for help.

postgres=# create view v1 as select * from tbl;
ERROR:  relation "tbl" does not exist
LINE 1: create view v1 as select * from tbl;
                                        ^
postgres=# create table tbl (a integer);
CREATE TABLE
postgres=# create view v1 as select * from tbl;
CREATE VIEW
postgres=# drop table tbl;
ERROR:  cannot drop table tbl because other objects depend on it
DETAIL:  view v1 depends on table tbl
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

So this change is probably a welcome one

…, don't make any cross-catalog dependencies, detect and properly error when CREATE OR REPLACE statements depend on itself

…t test

…ndencies

src/include/duckdb/storage/serialization/create_info.json

src/storage/checkpoint_manager.cpp

…ndencies

…ependencies

Tishj · 2023-10-20T11:48:12Z

This is very strange, and not something I can reproduce locally, even after merging with feature?
https://github.com/duckdb/duckdb/actions/runs/6586650717/job/17895372799?pr=8619

Error: unable to open database "test/sql/storage_version/storage_version.db": Serialization Error: Failed to deserialize: expected end of object, but found field id: 107

It does make sense in the context of this change:

	  {
		"id": 107,
		"name": "dependencies",
		"type": "LogicalDependencyList",
		"default": "LogicalDependencyList()"
	  }

EDIT:
Actually had a quick talk with Max, we both think this is an expected failure.
If we add a new field, then the old database can't read it anymore

But we can still read old databases if we mark the field as having a default (which I've done here)

Maxxen · 2023-10-20T11:57:18Z

The storage incompatibility is expected, old versions can't read new fields (and can't ignore them either), so you have to break forwards-compatibility (which is fine).

Mytherin · 2023-10-20T11:10:01Z

src/catalog/catalog_entry/index_catalog_entry.cpp

    : StandardEntry(CatalogType::INDEX_ENTRY, schema, catalog, info.index_name), index(nullptr), sql(info.sql) {
 	this->temporary = info.temporary;
+	this->dependencies = info.dependencies.GetPhysical(catalog, context);


Can we not do the change where we push the ClientContext into the constructor of every single entry, and instead do the resolution of the dependency list (from logical -> physical) during the binding or e.g. in the schema catalog entry?

Mytherin · 2023-10-20T11:15:52Z

src/catalog/catalog_entry/view_catalog_entry.cpp

 	query = std::move(info.query);
 	this->aliases = info.aliases;
 	this->types = info.types;
 	this->temporary = info.temporary;
 	this->sql = info.sql;
 	this->internal = info.internal;
+	this->dependencies = info.dependencies.GetPhysical(catalog, context);


Big change request, but would it be possible to perhaps (this should likely be an entirely separate PR or change) remove the physical dependencies altogether - from the dependency manager and elsewhere - and instead always work with logical dependencies (e.g. a pair of CatalogType, schema_name, entry_name)? This should both solve the original problem of not being able to serialize dependencies, as well as simplify the bookkeeping around dependencies in general and make it less error prone.

Tishj added 11 commits August 17, 2023 16:24

use the DependencyManager to determine what the right order is to wri…

800873a

…te the catalog entries

use the Bind to get dependencies for macro functions

4a1d973

add dependency tracking for Views

5f354dc

register dependencies for indexes

028b23b

add test for exporting indexes with dependencies

a0f6820

update test to properly check that the index was succesfully imported

e2fbe57

make BindLogicalType a non-static method, so we can make use of the B…

ebbac57

…inders CatalogEntryRetriever, add tests with created types and dependencies created by them

add a dependency to 'export_generated_columns.test' and make sure the…

2d7076c

… dependencies of the generated columns are registered as dependencies of the table

verify that the macro can not be dropped because the generated column…

ca34bf2

… depends on it

add extra coverage for different types aliased with 'CREATE TYPE'

efc5797

forgot one

d9ebc96

Tishj commented Aug 18, 2023

View reviewed changes

Tishj added 2 commits August 18, 2023 17:10

fix up some compilation issue

dd50e43

remove calls to debug functions

1657016

github-actions bot marked this pull request as draft August 18, 2023 15:11

carlopi reviewed Aug 18, 2023

View reviewed changes

Merge branch 'master' into macro_function_dependencies

edabfb3

Tishj added 6 commits August 19, 2023 18:56

dont populate dependencies when binding the query for CREATE TABLE AS…

0f3f02f

…, don't make any cross-catalog dependencies, detect and properly error when CREATE OR REPLACE statements depend on itself

fix broken tests related to views

dbc192a

set the right catalog in bind_export, improve the attach_export_impor…

509c055

…t test

fix failing slow tests

6bf88ef

fix unused variable error

5adeb14

cascade the deletes in python test fixture

e2c5ff7

Tishj added 2 commits September 18, 2023 13:09

Merge remote-tracking branch 'upstream/main' into macro_function_depe…

ce8eacc

…ndencies

Merge remote-tracking branch 'upstream/main' into macro_function_depe…

47176cc

…ndencies

Tishj mentioned this pull request Sep 21, 2023

Serialize Create Table, Insert , Delete and Update operator using create info #9026

Merged

Merge remote-tracking branch 'upstream/main' into macro_function_depe…

6a20af8

…ndencies

Tishj commented Sep 25, 2023

View reviewed changes

src/include/duckdb/storage/serialization/create_info.json Outdated Show resolved Hide resolved

Tishj commented Sep 25, 2023

View reviewed changes

src/storage/checkpoint_manager.cpp Show resolved Hide resolved

Tishj commented Sep 25, 2023

View reviewed changes

src/storage/checkpoint_manager.cpp Outdated Show resolved Hide resolved

Tishj mentioned this pull request Sep 25, 2023

[Dev] Change the checkpoint_manager.cpp catalog entry serialization. #9091

Merged

Tishj added 5 commits September 26, 2023 10:55

make LogicalDependencyList a default value

4b39a1e

Merge remote-tracking branch 'upstream/main' into macro_function_depe…

6219482

…ndencies

Merge remote-tracking branch 'upstream/main' into macro_function_depe…

95bef0a

…ndencies

update serialization code

a782f63

Merge remote-tracking branch 'upstream/main' into macro_function_depe…

692daad

…ndencies

Tishj changed the base branch from main to feature October 2, 2023 11:10

Tishj added 3 commits October 5, 2023 10:21

Merge remote-tracking branch 'upstream/feature' into macro_function_d…

38e6e33

…ependencies

make the == operator const, fixes compilation error

2236b89

Merge remote-tracking branch 'upstream/feature' into macro_function_d…

3c6925a

…ependencies

Mytherin marked this pull request as ready for review October 20, 2023 10:39

Merge remote-tracking branch 'upstream/feature' into macro_function_d…

b2bcd57

…ependencies

github-actions bot marked this pull request as draft October 20, 2023 11:46

Mytherin reviewed Oct 20, 2023

View reviewed changes

Tishj mentioned this pull request Nov 2, 2023

Dependency manager powered export Tishj/duckdb#82

Draft

Mytherin changed the base branch from feature to main November 20, 2023 12:45

Mytherin mentioned this pull request Nov 22, 2023

Add support for COPY FROM DATABASE statement #9765

Merged

Tishj mentioned this pull request Nov 30, 2023

Exported database view macro ordering bug #9861

Closed

1 task

Tishj mentioned this pull request Feb 15, 2024

Fix ordering issues in EXPORT DATABASE #10677

Closed

github-actions bot added the stale label Feb 19, 2024

Tishj closed this Feb 21, 2024

Fix EXPORT/IMPORT DATABASE schema.sql order #8619

Fix EXPORT/IMPORT DATABASE schema.sql order #8619

Conversation

Tishj commented Aug 18, 2023

What is the problem and what is the proposed solution?

CatalogEntryRetriever

CREATE TYPE + EXPORT DATABASE

Footnote

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tishj Aug 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tishj Aug 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tishj Aug 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tishj commented Aug 18, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tishj Aug 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost Aug 19, 2023 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tishj Aug 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost Aug 21, 2023 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost Aug 22, 2023 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tishj commented Aug 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tishj commented Oct 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Maxxen commented Oct 20, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fix EXPORT/IMPORT DATABASE `schema.sql` order #8619

Fix EXPORT/IMPORT DATABASE `schema.sql` order #8619

Tishj Aug 18, 2023 •

edited

Loading

Tishj Aug 18, 2023 •

edited

Loading

Tishj Aug 18, 2023 •

edited

Loading

Tishj Aug 18, 2023 •

edited

Loading

ghost Aug 19, 2023 •

edited by ghost

Loading

Tishj Aug 21, 2023 •

edited

Loading

ghost Aug 21, 2023 •

edited by ghost

Loading

ghost Aug 22, 2023 •

edited by ghost

Loading

Tishj commented Aug 19, 2023 •

edited

Loading

Tishj commented Oct 20, 2023 •

edited

Loading