Skip to content

Conversation

rustyconover
Copy link
Contributor

@rustyconover rustyconover commented Nov 1, 2024

This PR introduces support for configuring the rowid pseudo-column to use a type other than BIGINT. In particular, it enables rowid to be defined as either a BLOB or VARCHAR. I think this is useful when working with external data sources, and especially Apache Arrow provided data sources. This PR is one step on the road for enabling DELETE/UPDATE/INSERT to tables via Apache Arrow Flight but could be used by other data sources.

Changes

To support this, I’ve updated the TableCatalogEntry to specify the type of rowid as a LogicalType, so this change can be used by other catalog implementations.

@pdet, In the Arrow integration, ArrowTableFunction::ArrowToDuckDB now accepts an optional rowid_column_index parameter to indicate the position of the rowid column within the Arrow table. This change ensures that rowid can appear at any position within an Arrow array.

To fully support this functionality, I also adjusted BindContext::AddBaseTable, and the LogicalGet, and Binding classes, ensuring that the rowid column type is dynamic but defaults to LogicalTypeId::ROW_TYPE if unspecified.

Discussion

I realize this PR is a big bigger than the size I typically send, and will require more energy to review. I tried to make it as small as possible, but if there is a better way to accomplish what I'm trying to do I'm happy to rework the approach. I’d also be happy to discuss any details or adjustments to make this approach work smoothly for others. Thanks for reviewing this.

@rustyconover rustyconover force-pushed the feature_dynamic_rowid_type branch from abce24e to fb0d220 Compare November 1, 2024 18:13
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 1, 2024 18:13
@rustyconover rustyconover marked this pull request as ready for review November 1, 2024 18:13
@rustyconover rustyconover marked this pull request as draft November 1, 2024 18:37
@rustyconover rustyconover force-pushed the feature_dynamic_rowid_type branch 2 times, most recently from 0a0e593 to 946c23b Compare November 1, 2024 19:11
@rustyconover rustyconover marked this pull request as ready for review November 1, 2024 19:26
@rustyconover rustyconover force-pushed the feature_dynamic_rowid_type branch from 946c23b to 3039ad0 Compare November 5, 2024 18:25
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 5, 2024 18:25
@rustyconover rustyconover marked this pull request as ready for review November 5, 2024 18:26
@rustyconover rustyconover force-pushed the feature_dynamic_rowid_type branch 2 times, most recently from efff9da to dcc95c2 Compare November 27, 2024 02:33
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 27, 2024 02:33
@rustyconover rustyconover marked this pull request as ready for review November 27, 2024 02:33
@rustyconover rustyconover force-pushed the feature_dynamic_rowid_type branch from dcc95c2 to 52ab6e6 Compare December 2, 2024 22:33
@duckdb-draftbot duckdb-draftbot marked this pull request as draft December 2, 2024 22:34
@rustyconover rustyconover marked this pull request as ready for review December 2, 2024 22:34
@rustyconover rustyconover force-pushed the feature_dynamic_rowid_type branch from 52ab6e6 to 6f7de1c Compare December 3, 2024 13:05
@duckdb-draftbot duckdb-draftbot marked this pull request as draft December 3, 2024 13:05
@rustyconover rustyconover marked this pull request as ready for review December 4, 2024 10:26
@rustyconover rustyconover force-pushed the feature_dynamic_rowid_type branch from 6f7de1c to ef2f0a5 Compare December 5, 2024 13:14
@duckdb-draftbot duckdb-draftbot marked this pull request as draft December 5, 2024 13:15
@szarnyasg szarnyasg marked this pull request as ready for review December 9, 2024 12:42
@szarnyasg szarnyasg requested a review from Mytherin December 9, 2024 12:42
@rustyconover
Copy link
Contributor Author

@samansmink Hi Sam, could you peek at this too? Thank you.

@samansmink
Copy link
Contributor

@rustyconover I think @Mytherin would be more suitable for reviewing this one!

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I have been thinking about reworking the rowid type and generated columns in general to be less hard-coded, and while I think this is a good change (making row-id type configurable) this changeset moves in the opposite direction and introduces even more hard-coding for the row-id column, which will make reworking it later on more difficult.

My ideal version of this would be that table functions can provide a set of columns that are queryable, but not included/shown by default, i.e. similar to how SELECT * FROM tbl does not show the rowid column. Then scanners could expose multiple columns in a more easy manner. For example, we could easily expose SELECT row_group_id FROM tbl as well. Similarly, many of the columns that are currently emitted by e.g. Parquet files would no longer need to be options. We could simply query SELECT filename FROM file.parquet instead of having to do SELECT filename FROM read_parquet('file.parquet', filename=True).

That also solves this problem because scanners could register their own rowid column with whichever type they want using that generic interface.

I'm not opposed to merging this with some changes made to it - but there's a high chance this will be completely overhauled/reworked in the (near) future.

Nevertheless, if you want to continue with this changeset, there's some comments below:

@@ -114,10 +115,18 @@ class TableCatalogEntry : public StandardEntry {
//! Returns true, if the table has a primary key, else false.
bool HasPrimaryKey() const;

//! Returns the rowid type of this table
LogicalType GetRowIdType() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a virtual method instead that returns LogicalType::ROW_TYPE? Then we don't need to keep this as a member or take this in the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit

@@ -101,7 +104,7 @@ struct TableBinding : public Binding {
public:
TableBinding(const string &alias, vector<LogicalType> types, vector<string> names,
vector<ColumnIndex> &bound_column_ids, optional_ptr<StandardEntry> entry, idx_t index,
bool add_row_id = false);
bool add_row_id = false, LogicalType rowid_type = LogicalType(LogicalType::ROW_TYPE));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TableBinding already takes a StandardEntry (which should be a TableCatalogEntry if add_row_id is true). Can we just call entry->Cast<TableCatalogEntry>().GetRowIdType() instead to avoid having to pull this through many layers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this change and it simplified things quite a bit.

@rustyconover rustyconover force-pushed the feature_dynamic_rowid_type branch from ef2f0a5 to 2481584 Compare December 17, 2024 04:47
@duckdb-draftbot duckdb-draftbot marked this pull request as draft December 17, 2024 04:48
@rustyconover rustyconover marked this pull request as ready for review December 17, 2024 04:49
@rustyconover
Copy link
Contributor Author

Thank you for the thoughtful review and for being open to this idea!

I have been thinking about reworking the rowid type and generated columns in general to be less hard-coded, and while I think this is a good change (making row-id type configurable) this changeset moves in the opposite direction and introduces even more hard-coding for the row-id column, which will make reworking it later on more difficult.

I understand your concerns here. While I hope this doesn’t add too much rigidity, I’m happy for this interface to be replaced in the future if something more flexible and powerful is introduced.

My ideal version of this would be that table functions can provide a set of columns that are queryable, but not included/shown by default, i.e. similar to how SELECT * FROM tbl does not show the rowid column. Then scanners could expose multiple columns in a more easy manner. For example, we could easily expose SELECT row_group_id FROM tbl as well. Similarly, many of the columns that are currently emitted by e.g. Parquet files would no longer need to be options. We could simply query SELECT filename FROM file.parquet instead of having to do SELECT filename FROM read_parquet('file.parquet', filename=True).

That sounds like a fantastic approach. It would be particularly helpful for cases like Delta Lake row tracking, where _metadata columns are heavily used. I think that naming conflicts between these additional columns and user-defined columns would need to be addressed, but that seems manageable. Another challenge would be ensuring users can easily discover these scanner-provided columns.

That also solves this problem because scanners could register their own rowid column with whichever type they want using that generic interface.

Absolutely—this design would be a great fit for that!

I'm not opposed to merging this with some changes made to it - but there's a high chance this will be completely overhauled/reworked in the (near) future.

That works for me. I fully understand that the C++ API comes with no stability guarantees, so I’m okay with the possibility of this being replaced or reworked down the line.

Nevertheless, if you want to continue with this changeset, there's some comments below:

I’ve addressed the feedback you provided, and the updates have reduced the size of the change significantly, especially in terms of the number of files modified. Could you please take another look? I appreciate your time.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft December 17, 2024 05:44
@rustyconover rustyconover marked this pull request as ready for review December 17, 2024 05:45
@rustyconover
Copy link
Contributor Author

@Mytherin no rush, but just wanted to let you know this PR is looking good now.

Copy link
Contributor

@samansmink samansmink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me here!

@rustyconover
Copy link
Contributor Author

Hi @samansmink,

Changes look good to me here!

Great to hear that, can you remove the changes requested label?

Thank you!

Rusty

@rustyconover
Copy link
Contributor Author

Just bumping this PR, since it still pending.

@Mytherin Mytherin merged commit 13ba13c into duckdb:main Jan 6, 2025
47 checks passed
@Mytherin
Copy link
Collaborator

Mytherin commented Jan 6, 2025

Thanks! The changes make this a lot simpler indeed

@rustyconover
Copy link
Contributor Author

rustyconover commented Jan 6, 2025 via email

@szarnyasg szarnyasg added the Needs Documentation Use for issues or PRs that require changes in the documentation label Jan 6, 2025
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Jan 6, 2025
Allow a variable type `rowid` pseudocolumn in tables (duckdb/duckdb#14674)
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Jan 6, 2025
Allow a variable type `rowid` pseudocolumn in tables (duckdb/duckdb#14674)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Mytherin added a commit that referenced this pull request Feb 17, 2025
…ename` a virtual column in the Parquet/CSV/JSON readers (#16248)

This PR generalizes the `rowid` column to the broader concept of virtual
columns. Previously, `rowid` was a special column that existed within
DuckDB tables. This column was special in the sense that it could be
queried, but would not show up as an actual column of the table when
inspecting the table or when running `SELECT * FROM tbl`:

```sql
D create table tbl(i int);
D insert into tbl values (42);
D select * from tbl;
┌───────┐
│   i   │
│ int32 │
├───────┤
│  42   │
└───────┘
D describe tbl;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ i           │ INTEGER     │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
-- but we can query the column if explicitly mentioned
D select rowid, i from tbl;
┌───────┬───────┐
│ rowid │   i   │
│ int64 │ int32 │
├───────┼───────┤
│   0   │  42   │
└───────┴───────┘
```

The `rowid` column is also used internally for several purposes, such as
handling deletes/updates of rows, or performing late materialization.

In the current implementation - only the `rowid` virtual column exists,
and only tables can implement this virtual column. Some previous work
was done on making the `rowid` column type more flexible - e.g.
#14674 - but still only this
virtual column was hard-coded in the system.

This PR generalizes the concept of virtual columns by adding a new
callback to the `TableFunction`, that allows table functions to return
the set of virtual columns that they support:

```cpp
struct TableColumn {
	string name;
	LogicalType type;
};

using virtual_column_map_t = unordered_map<column_t, TableColumn>;

typedef virtual_column_map_t (*table_function_get_virtual_columns_t)(ClientContext &context,
                                                                     optional_ptr<FunctionData> bind_data);
```

As an example, if we wanted to support the `rowid` column as before we
would set it up like so:

```cpp
virtual_column_map_t virtual_columns;
virtual_columns.insert(make_pair(COLUMN_IDENTIFIER_ROW_ID, TableColumn("rowid", LogicalType::ROW_TYPE)));
```

However, we can now support multiple virtual columns and can fully
customize the virtual columns that we emit. This PR also introduces two
new virtual columns:

* `COLUMN_IDENTIFIER_FILENAME`, emitted by the Parquet/CSV/JSON readers 
* `COLUMN_IDENTIFIER_EMPTY ` - this is a non-queryable virtual column.
When this column is present, it is added when running `COUNT(*)` over
the table function.


#### Filename

With the addition of the `filename` virtual column, the following
snippet now works:

```sql
D copy (select 42 i) to tbl.parquet;
D select * from tbl.parquet;
┌───────┐
│   i   │
│ int32 │
├───────┤
│  42   │
└───────┘
D select *, filename from tbl.parquet;
┌───────┬─────────────┐
│   i   │  filename   │
│ int32 │   varchar   │
├───────┼─────────────┤
│  42   │ tbl.parquet │
└───────┴─────────────┘
```

Previously, we could emit the filename only by using the `filename`
option. This is now no longer required - we can directly query the
`filename` if desired.

This PR is primarily adding support for virtual columns - we're planning
to extend this support further in the future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Documentation Use for issues or PRs that require changes in the documentation needs maintainer approval
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants