-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Fix EXPORT/IMPORT DATABASE schema.sql
order
#8619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…te the catalog entries
…inders CatalogEntryRetriever, add tests with created types and dependencies created by them
… dependencies of the generated columns are registered as dependencies of the table
src/catalog/dependency_manager.cpp
Outdated
} | ||
// erase the dependents and dependencies for this object | ||
dependents_map.erase(object); | ||
dependencies_map.erase(object); | ||
} | ||
|
||
void DependencyManager::PrintDependencyMap() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are for debugging purposes.
What would be the best way to keep them around so they can be useful in the future without causing unnecessary friction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be desirable to expose this (in some form) to SQL?
Idea is that then it can be tested / and being present there for debug purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a bad idea, maybe we could make two table functions that output a column for every catalog entry, containing a VARCHAR[] containing the names of the dependents/dependencies
Maybe take an optional catalog VARCHAR
parameter
Edit:
I thought about it a little more, this might be a better structure
Columns:
- Name (of the entry)
- Connections (the list of dependencies/dependents)
- Type ("DEPENDENCIES" or "DEPENDENTS")
- Catalog (name of catalog the entry belongs to)
- Schema
Then we could fit it into a single table function
dependency_manager()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw I just found that we have duckdb_dependencies()
table function, but it seems kind of lacking
src/catalog/dependency_manager.cpp
Outdated
catalog_entry_set_t entries; | ||
catalog_entry_vector_t export_order; | ||
|
||
PrintDependencyMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These need to be removed
type = catalog->GetType(context, schema, user_type_name, OnEntryNotFound::RETURN_NULL); | ||
auto entry = entry_retriever.GetEntry(CatalogType::TYPE_ENTRY, *catalog, schema, user_type_name, | ||
OnEntryNotFound::RETURN_NULL); | ||
if (!entry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These might be a little too terse, but I didn't want to duplicate all of the "helper" functions that Catalog has.
In the future it might be better to move all of the catalog entry look up responsibility to CatalogEntryRetriever, that would also make the Catalog class a little lighter.
@@ -174,8 +179,8 @@ class Binder : public std::enable_shared_from_this<Binder> { | |||
void BindOnConflictClause(LogicalInsert &insert, TableCatalogEntry &table, InsertStatement &stmt); | |||
|
|||
static void BindSchemaOrCatalog(ClientContext &context, string &catalog, string &schema); | |||
static void BindLogicalType(ClientContext &context, LogicalType &type, optional_ptr<Catalog> catalog = nullptr, | |||
const string &schema = INVALID_SCHEMA); | |||
void BindLogicalType(LogicalType &type, optional_ptr<Catalog> catalog = nullptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer a static method because the binding of USER types perform a CatalogEntry lookup which we want to perform the CatalogEntryRetrievers callback on.
In the places that previously used this as a static method without being inside a Binder or having access to a binder I've created a binder on the spot.
In a lot of these places a binder was already created later, so I don't think any harm was done here.
@@ -50,5 +53,13 @@ class DependencyManager { | |||
void DropObject(CatalogTransaction transaction, CatalogEntry &object, bool cascade); | |||
void AlterObject(CatalogTransaction transaction, CatalogEntry &old_obj, CatalogEntry &new_obj); | |||
void EraseObjectInternal(CatalogEntry &object); | |||
|
|||
dependency_set_t &GetEntriesThatDependOnObject(CatalogEntry &object); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods were made in an attempt to make it easier to work with the dependency_map
and dependents_map
, because it's logically a little hard to understand what connections they represent.
auto &dependency_manager = duck_catalog.GetDependencyManager(); | ||
catalog_entries = GetDependencyDrivenExportOrder(context, dependency_manager, *info, exported_tables); | ||
} else { | ||
catalog_entries = GetNaiveExportOrder(context, *info, exported_tables); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly we can't get rid of this error-prone method entirely, though I haven't seen any other classes that derive from Catalog, but this could likely be a point of interest for extensions later on
Maybe GetExportOrder should be a virtual method, so it can be overridden by an extension catalog ?
There are some methods related to dependency extraction in the binder code, such as Maybe these could also be removed in favor of the CatalogEntryRetriever callback method? |
src/catalog/dependency_manager.cpp
Outdated
queue<reference<CatalogEntry>> backlog; | ||
// Populate the backlog with every entry in the dependencies map | ||
for (auto &entry : dependencies_map) { | ||
backlog.push(entry.first); | ||
} | ||
|
||
// First populate our backlog with every entry in dependencies_map | ||
while (!backlog.empty()) { | ||
auto &object = backlog.front(); | ||
backlog.pop(); | ||
if (entries.count(object)) { | ||
// This entry has already been written | ||
continue; | ||
} | ||
auto entry = dependencies_map.find(object); | ||
if (entry == dependencies_map.end() || AllExportDependenciesWritten(object, entry->second, entries)) { | ||
// All dependencies written, we can write this now | ||
auto insert_result = entries.insert(object); | ||
D_ASSERT(insert_result.second); | ||
export_order.push_back(object); | ||
} else { | ||
for (auto &dependency : entry->second) { | ||
backlog.emplace(dependency); | ||
} | ||
backlog.emplace(object); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is potentially O(n^2), not sure we care enough and it's not immediate to me what an alternative implementation would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is potentially O(n^2), not sure we care enough and it's not immediate to me what an alternative implementation would be.
Yea it's pretty bad, I figured EXPORT DATABASE isn't very performance sensitive and it's better to focus on correctness
But it could definitely be improved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we turn this into a stack
instead and push every dependency to the front, this should be slightly more efficient, as we won't keep checking the same object if the dependencies of the object also have dependencies
now:
obj has dependencies:
state of the queue:
dep1
dep2
dep3
obj
dep1 has dependencies, push these, state of the queue:
dep2
dep3
obj
dep1.1
dep1.2
dep1.3
dep1
Once we reach obj
again, we will push dep1
and the rest of its dependencies again (dep2, dep3)
With a stack:
obj has dependencies:
dep3
dep2
dep1
obj
dep1 has dependencies, push these:
dep1.3
dep1.2
dep1.1
dep1
obj
We will naturally deal with all dependencies first before checking an object whos dependencies have not been processed yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to get the order don't you just do a depth first search over the entries added to the backlog? my c++ is rusty but something like:
void DependencyManager::DepthFirstSearch(catalog_entry_set_t &explored, catalog_entry_vector_t &order, the_type &entry)
{
if (explored.count(entry) > 0) {
return;
}
explored.insert(entry);
auto deps = dependencies_map.find(entry);
if (deps != dependencies_map.end() {
for (auto &dependency : deps->second) {
DepthFirstSearch(explored, order, dependency);
}
}
order.push_back(entry);
}
catalog_entry_vector_t DependencyManager::GetExportOrder() {
catalog_entry_set_t explored;
catalog_entry_vector_t export_order;
for (auto &entry : dependencies_map) {
DepthFirstSearch(explored, export_order, entry.first);
}
return export_order;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to avoid recursion, to not run the risk of a stack overflow.
But maybe I was being too cautious for no good reason here
I think my proposed change to use a stack would mimic your depth first search approach though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't quite follow the explanation with the stack / queue described. For a non-recursive DFS there is an implementation on Wikipedia, https://en.wikipedia.org/wiki/Depth-first_search#Pseudocode, that uses a stack of iterators. That would work to give a topological sort -- appending the element to the output vector at the last pop in the pseudo-code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing to note, I am introducing dependencies to a view, it is now dependent on the table. Though one could argue that was already no longer the case because Only when v1 is used is the existence of EDIT:
So this change is probably a welcome one |
…, don't make any cross-catalog dependencies, detect and properly error when CREATE OR REPLACE statements depend on itself
This is very strange, and not something I can reproduce locally, even after merging with feature?
It does make sense in the context of this change: {
"id": 107,
"name": "dependencies",
"type": "LogicalDependencyList",
"default": "LogicalDependencyList()"
} EDIT: But we can still read old databases if we mark the field as having a default (which I've done here) |
The storage incompatibility is expected, old versions can't read new fields (and can't ignore them either), so you have to break forwards-compatibility (which is fine). |
: StandardEntry(CatalogType::INDEX_ENTRY, schema, catalog, info.index_name), index(nullptr), sql(info.sql) { | ||
this->temporary = info.temporary; | ||
this->dependencies = info.dependencies.GetPhysical(catalog, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not do the change where we push the ClientContext
into the constructor of every single entry, and instead do the resolution of the dependency list (from logical -> physical) during the binding or e.g. in the schema catalog entry?
query = std::move(info.query); | ||
this->aliases = info.aliases; | ||
this->types = info.types; | ||
this->temporary = info.temporary; | ||
this->sql = info.sql; | ||
this->internal = info.internal; | ||
this->dependencies = info.dependencies.GetPhysical(catalog, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big change request, but would it be possible to perhaps (this should likely be an entirely separate PR or change) remove the physical dependencies altogether - from the dependency manager and elsewhere - and instead always work with logical dependencies (e.g. a pair of CatalogType, schema_name, entry_name
)? This should both solve the original problem of not being able to serialize dependencies, as well as simplify the bookkeeping around dependencies in general and make it less error prone.
This PR addresses #8496
What is the problem and what is the proposed solution?
Previously we used an ad-hoc solution to approximately order the catalog entries, this has some outliers and that resulted in issues like the one that's linked.
What this PR aims to do is replace this ad-hoc solution with a hopefully foolproof solution which relies on the
DependencyManager
to guide this ordering.CatalogEntryRetriever
After adding this
DependencyManager
-driven ordering of catalog entries, it became apparent that theDependencyManager
wasn't being utilized as well as it should be.Dependencies for Indexes, Types, Macros, Generated Columns and Views were not being registered.
In the existing code path we already Bind these entries, to verify that they are correctly formed and their dependencies are present at the time of creation.
It does this by retrieving the required catalog entries, which is essentially listing out all the dependencies of the entry.
To make use of this fact I've added a
CatalogEntryRetriever
, this class has an optional callback, which is called for every successfully retrieved catalog entry.We use this callback to populate a
DependencyList
inCreateInfo
.StandardEntry
has a new virtual method,InherentDependencies
which returns an emptyDependencyList
by default.The relevant subclasses of this have been updated to hold a
DependencyList
, taken from theCreateInfo
, and override thisInherentDependencies
method to return this list of dependencies.CREATE TYPE + EXPORT DATABASE
It seems exporting
TypeCatalogEntry
, created by theCREATE TYPE
statement was not properly tested, because when we EXPORT DATABASE, the ToSQL method is called and this previously only supported ENUM, I've updated this to support every type we have.Footnote
Since only the
DuckCatalog
has aDependencyManager
the old method still exists and will still need to be utilized by other Catalog types.