Skip to content

Conversation

Tishj
Copy link
Contributor

@Tishj Tishj commented Aug 18, 2023

This PR addresses #8496

What is the problem and what is the proposed solution?

Previously we used an ad-hoc solution to approximately order the catalog entries, this has some outliers and that resulted in issues like the one that's linked.

What this PR aims to do is replace this ad-hoc solution with a hopefully foolproof solution which relies on the DependencyManager to guide this ordering.

CatalogEntryRetriever

After adding this DependencyManager-driven ordering of catalog entries, it became apparent that the DependencyManager wasn't being utilized as well as it should be.
Dependencies for Indexes, Types, Macros, Generated Columns and Views were not being registered.

In the existing code path we already Bind these entries, to verify that they are correctly formed and their dependencies are present at the time of creation.
It does this by retrieving the required catalog entries, which is essentially listing out all the dependencies of the entry.

To make use of this fact I've added a CatalogEntryRetriever, this class has an optional callback, which is called for every successfully retrieved catalog entry.
We use this callback to populate a DependencyList in CreateInfo.
StandardEntry has a new virtual method, InherentDependencies which returns an empty DependencyList by default.

The relevant subclasses of this have been updated to hold a DependencyList, taken from the CreateInfo, and override this InherentDependencies method to return this list of dependencies.

CREATE TYPE + EXPORT DATABASE

It seems exporting TypeCatalogEntry, created by the CREATE TYPE statement was not properly tested, because when we EXPORT DATABASE, the ToSQL method is called and this previously only supported ENUM, I've updated this to support every type we have.

Footnote

Since only the DuckCatalog has a DependencyManager the old method still exists and will still need to be utilized by other Catalog types.

}
// erase the dependents and dependencies for this object
dependents_map.erase(object);
dependencies_map.erase(object);
}

void DependencyManager::PrintDependencyMap() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are for debugging purposes.
What would be the best way to keep them around so they can be useful in the future without causing unnecessary friction?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be desirable to expose this (in some form) to SQL?

Idea is that then it can be tested / and being present there for debug purposes.

Copy link
Contributor Author

@Tishj Tishj Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a bad idea, maybe we could make two table functions that output a column for every catalog entry, containing a VARCHAR[] containing the names of the dependents/dependencies

Maybe take an optional catalog VARCHAR parameter

Edit:

I thought about it a little more, this might be a better structure
Columns:

  • Name (of the entry)
  • Connections (the list of dependencies/dependents)
  • Type ("DEPENDENCIES" or "DEPENDENTS")
  • Catalog (name of catalog the entry belongs to)
  • Schema

Then we could fit it into a single table function

dependency_manager()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw I just found that we have duckdb_dependencies() table function, but it seems kind of lacking

catalog_entry_set_t entries;
catalog_entry_vector_t export_order;

PrintDependencyMap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to be removed

type = catalog->GetType(context, schema, user_type_name, OnEntryNotFound::RETURN_NULL);
auto entry = entry_retriever.GetEntry(CatalogType::TYPE_ENTRY, *catalog, schema, user_type_name,
OnEntryNotFound::RETURN_NULL);
if (!entry) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These might be a little too terse, but I didn't want to duplicate all of the "helper" functions that Catalog has.

In the future it might be better to move all of the catalog entry look up responsibility to CatalogEntryRetriever, that would also make the Catalog class a little lighter.

@@ -174,8 +179,8 @@ class Binder : public std::enable_shared_from_this<Binder> {
void BindOnConflictClause(LogicalInsert &insert, TableCatalogEntry &table, InsertStatement &stmt);

static void BindSchemaOrCatalog(ClientContext &context, string &catalog, string &schema);
static void BindLogicalType(ClientContext &context, LogicalType &type, optional_ptr<Catalog> catalog = nullptr,
const string &schema = INVALID_SCHEMA);
void BindLogicalType(LogicalType &type, optional_ptr<Catalog> catalog = nullptr,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer a static method because the binding of USER types perform a CatalogEntry lookup which we want to perform the CatalogEntryRetrievers callback on.

In the places that previously used this as a static method without being inside a Binder or having access to a binder I've created a binder on the spot.
In a lot of these places a binder was already created later, so I don't think any harm was done here.

@@ -50,5 +53,13 @@ class DependencyManager {
void DropObject(CatalogTransaction transaction, CatalogEntry &object, bool cascade);
void AlterObject(CatalogTransaction transaction, CatalogEntry &old_obj, CatalogEntry &new_obj);
void EraseObjectInternal(CatalogEntry &object);

dependency_set_t &GetEntriesThatDependOnObject(CatalogEntry &object);
Copy link
Contributor Author

@Tishj Tishj Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods were made in an attempt to make it easier to work with the dependency_map and dependents_map, because it's logically a little hard to understand what connections they represent.

auto &dependency_manager = duck_catalog.GetDependencyManager();
catalog_entries = GetDependencyDrivenExportOrder(context, dependency_manager, *info, exported_tables);
} else {
catalog_entries = GetNaiveExportOrder(context, *info, exported_tables);
Copy link
Contributor Author

@Tishj Tishj Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly we can't get rid of this error-prone method entirely, though I haven't seen any other classes that derive from Catalog, but this could likely be a point of interest for extensions later on

Maybe GetExportOrder should be a virtual method, so it can be overridden by an extension catalog ?

@Tishj
Copy link
Contributor Author

Tishj commented Aug 18, 2023

There are some methods related to dependency extraction in the binder code, such as
ExtractExpressionDependencies

Maybe these could also be removed in favor of the CatalogEntryRetriever callback method?
I've left them in for now, but they probably perform the same task.

@github-actions github-actions bot marked this pull request as draft August 18, 2023 15:11
Comment on lines 206 to 232
queue<reference<CatalogEntry>> backlog;
// Populate the backlog with every entry in the dependencies map
for (auto &entry : dependencies_map) {
backlog.push(entry.first);
}

// First populate our backlog with every entry in dependencies_map
while (!backlog.empty()) {
auto &object = backlog.front();
backlog.pop();
if (entries.count(object)) {
// This entry has already been written
continue;
}
auto entry = dependencies_map.find(object);
if (entry == dependencies_map.end() || AllExportDependenciesWritten(object, entry->second, entries)) {
// All dependencies written, we can write this now
auto insert_result = entries.insert(object);
D_ASSERT(insert_result.second);
export_order.push_back(object);
} else {
for (auto &dependency : entry->second) {
backlog.emplace(dependency);
}
backlog.emplace(object);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is potentially O(n^2), not sure we care enough and it's not immediate to me what an alternative implementation would be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is potentially O(n^2), not sure we care enough and it's not immediate to me what an alternative implementation would be.

Yea it's pretty bad, I figured EXPORT DATABASE isn't very performance sensitive and it's better to focus on correctness

But it could definitely be improved

Copy link
Contributor Author

@Tishj Tishj Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we turn this into a stack instead and push every dependency to the front, this should be slightly more efficient, as we won't keep checking the same object if the dependencies of the object also have dependencies

now:
obj has dependencies:

state of the queue:

dep1
dep2
dep3
obj

dep1 has dependencies, push these, state of the queue:

dep2
dep3
obj
dep1.1
dep1.2
dep1.3
dep1

Once we reach obj again, we will push dep1 and the rest of its dependencies again (dep2, dep3)

With a stack:
obj has dependencies:

dep3
dep2
dep1
obj

dep1 has dependencies, push these:

dep1.3
dep1.2
dep1.1
dep1
obj

We will naturally deal with all dependencies first before checking an object whos dependencies have not been processed yet

Copy link

@ghost ghost Aug 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to get the order don't you just do a depth first search over the entries added to the backlog? my c++ is rusty but something like:

void DependencyManager::DepthFirstSearch(catalog_entry_set_t &explored, catalog_entry_vector_t &order, the_type &entry)
{
        if (explored.count(entry) > 0) {
                return;
        }
        explored.insert(entry);
        auto deps = dependencies_map.find(entry);
        if (deps != dependencies_map.end() {
                for (auto &dependency : deps->second) {
                        DepthFirstSearch(explored, order, dependency);
                }
        }   
        order.push_back(entry);
}

catalog_entry_vector_t DependencyManager::GetExportOrder() {
        catalog_entry_set_t explored;
        catalog_entry_vector_t export_order;

        for (auto &entry : dependencies_map) {
                DepthFirstSearch(explored, export_order, entry.first);
        }
        return export_order;
}

Copy link
Contributor Author

@Tishj Tishj Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid recursion, to not run the risk of a stack overflow.
But maybe I was being too cautious for no good reason here

I think my proposed change to use a stack would mimic your depth first search approach though

Copy link

@ghost ghost Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't quite follow the explanation with the stack / queue described. For a non-recursive DFS there is an implementation on Wikipedia, https://en.wikipedia.org/wiki/Depth-first_search#Pseudocode, that uses a stack of iterators. That would work to give a topological sort -- appending the element to the output vector at the last pop in the pseudo-code.

Copy link

@ghost ghost Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tishj
Copy link
Contributor Author

Tishj commented Aug 19, 2023

Another thing to note, I am introducing dependencies to a view, it is now dependent on the table.
That is changing some behavior that we have tests for, with this change we are no longer following sqlite behavior.

Though one could argue that was already no longer the case because create view v1 as select * from tbl succeeds in SQLite even when tbl doesn't exist

Only when v1 is used is the existence of tbl checked in SQLite, we however bind the query on view creation and subsequently throw an error if tbl doesn't exist yet

EDIT:
PostgreSQL has this same behavior:

➜  duckdb git:(macro_function_dependencies) ✗ psql postgres
psql (14.8 (Homebrew))
Type "help" for help.

postgres=# create view v1 as select * from tbl;
ERROR:  relation "tbl" does not exist
LINE 1: create view v1 as select * from tbl;
                                        ^
postgres=# create table tbl (a integer);
CREATE TABLE
postgres=# create view v1 as select * from tbl;
CREATE VIEW
postgres=# drop table tbl;
ERROR:  cannot drop table tbl because other objects depend on it
DETAIL:  view v1 depends on table tbl
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

So this change is probably a welcome one

@Tishj Tishj changed the base branch from main to feature October 2, 2023 11:10
@Mytherin Mytherin marked this pull request as ready for review October 20, 2023 10:39
@github-actions github-actions bot marked this pull request as draft October 20, 2023 11:46
@Tishj
Copy link
Contributor Author

Tishj commented Oct 20, 2023

This is very strange, and not something I can reproduce locally, even after merging with feature?
https://github.com/duckdb/duckdb/actions/runs/6586650717/job/17895372799?pr=8619

Error: unable to open database "test/sql/storage_version/storage_version.db": Serialization Error: Failed to deserialize: expected end of object, but found field id: 107

It does make sense in the context of this change:

	  {
		"id": 107,
		"name": "dependencies",
		"type": "LogicalDependencyList",
		"default": "LogicalDependencyList()"
	  }

EDIT:
Actually had a quick talk with Max, we both think this is an expected failure.
If we add a new field, then the old database can't read it anymore

But we can still read old databases if we mark the field as having a default (which I've done here)

@Maxxen
Copy link
Member

Maxxen commented Oct 20, 2023

The storage incompatibility is expected, old versions can't read new fields (and can't ignore them either), so you have to break forwards-compatibility (which is fine).

: StandardEntry(CatalogType::INDEX_ENTRY, schema, catalog, info.index_name), index(nullptr), sql(info.sql) {
this->temporary = info.temporary;
this->dependencies = info.dependencies.GetPhysical(catalog, context);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not do the change where we push the ClientContext into the constructor of every single entry, and instead do the resolution of the dependency list (from logical -> physical) during the binding or e.g. in the schema catalog entry?

query = std::move(info.query);
this->aliases = info.aliases;
this->types = info.types;
this->temporary = info.temporary;
this->sql = info.sql;
this->internal = info.internal;
this->dependencies = info.dependencies.GetPhysical(catalog, context);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big change request, but would it be possible to perhaps (this should likely be an entirely separate PR or change) remove the physical dependencies altogether - from the dependency manager and elsewhere - and instead always work with logical dependencies (e.g. a pair of CatalogType, schema_name, entry_name)? This should both solve the original problem of not being able to serialize dependencies, as well as simplify the bookkeeping around dependencies in general and make it less error prone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants