Allow a variable type `rowid` pseudocolumn in tables #14674

rustyconover · 2024-11-01T17:12:12Z

This PR introduces support for configuring the rowid pseudo-column to use a type other than BIGINT. In particular, it enables rowid to be defined as either a BLOB or VARCHAR. I think this is useful when working with external data sources, and especially Apache Arrow provided data sources. This PR is one step on the road for enabling DELETE/UPDATE/INSERT to tables via Apache Arrow Flight but could be used by other data sources.

Changes

To support this, I’ve updated the TableCatalogEntry to specify the type of rowid as a LogicalType, so this change can be used by other catalog implementations.

@pdet, In the Arrow integration, ArrowTableFunction::ArrowToDuckDB now accepts an optional rowid_column_index parameter to indicate the position of the rowid column within the Arrow table. This change ensures that rowid can appear at any position within an Arrow array.

To fully support this functionality, I also adjusted BindContext::AddBaseTable, and the LogicalGet, and Binding classes, ensuring that the rowid column type is dynamic but defaults to LogicalTypeId::ROW_TYPE if unspecified.

Discussion

I realize this PR is a big bigger than the size I typically send, and will require more energy to review. I tried to make it as small as possible, but if there is a better way to accomplish what I'm trying to do I'm happy to rework the approach. I’d also be happy to discuss any details or adjustments to make this approach work smoothly for others. Thanks for reviewing this.

rustyconover · 2024-12-12T05:03:59Z

@samansmink Hi Sam, could you peek at this too? Thank you.

samansmink · 2024-12-12T13:57:14Z

@rustyconover I think @Mytherin would be more suitable for reviewing this one!

Mytherin

Thanks for the PR!

I have been thinking about reworking the rowid type and generated columns in general to be less hard-coded, and while I think this is a good change (making row-id type configurable) this changeset moves in the opposite direction and introduces even more hard-coding for the row-id column, which will make reworking it later on more difficult.

My ideal version of this would be that table functions can provide a set of columns that are queryable, but not included/shown by default, i.e. similar to how SELECT * FROM tbl does not show the rowid column. Then scanners could expose multiple columns in a more easy manner. For example, we could easily expose SELECT row_group_id FROM tbl as well. Similarly, many of the columns that are currently emitted by e.g. Parquet files would no longer need to be options. We could simply query SELECT filename FROM file.parquet instead of having to do SELECT filename FROM read_parquet('file.parquet', filename=True).

That also solves this problem because scanners could register their own rowid column with whichever type they want using that generic interface.

I'm not opposed to merging this with some changes made to it - but there's a high chance this will be completely overhauled/reworked in the (near) future.

Nevertheless, if you want to continue with this changeset, there's some comments below:

src/execution/physical_plan/plan_get.cpp

src/function/table/arrow_conversion.cpp

Mytherin · 2024-12-12T14:35:37Z

src/include/duckdb/catalog/catalog_entry/table_catalog_entry.hpp

@@ -114,10 +115,18 @@ class TableCatalogEntry : public StandardEntry {
 	//! Returns true, if the table has a primary key, else false.
 	bool HasPrimaryKey() const;

+	//! Returns the rowid type of this table
+	LogicalType GetRowIdType() {


Can we make this a virtual method instead that returns LogicalType::ROW_TYPE? Then we don't need to keep this as a member or take this in the constructor.

Addressed in latest commit

Mytherin · 2024-12-12T14:43:39Z

src/include/duckdb/planner/table_binding.hpp

@@ -101,7 +104,7 @@ struct TableBinding : public Binding {
 public:
 	TableBinding(const string &alias, vector<LogicalType> types, vector<string> names,
 	             vector<ColumnIndex> &bound_column_ids, optional_ptr<StandardEntry> entry, idx_t index,
-	             bool add_row_id = false);
+	             bool add_row_id = false, LogicalType rowid_type = LogicalType(LogicalType::ROW_TYPE));


TableBinding already takes a StandardEntry (which should be a TableCatalogEntry if add_row_id is true). Can we just call entry->Cast<TableCatalogEntry>().GetRowIdType() instead to avoid having to pull this through many layers?

I've made this change and it simplified things quite a bit.

rustyconover · 2024-12-17T05:00:29Z

Thank you for the thoughtful review and for being open to this idea!

I have been thinking about reworking the rowid type and generated columns in general to be less hard-coded, and while I think this is a good change (making row-id type configurable) this changeset moves in the opposite direction and introduces even more hard-coding for the row-id column, which will make reworking it later on more difficult.

I understand your concerns here. While I hope this doesn’t add too much rigidity, I’m happy for this interface to be replaced in the future if something more flexible and powerful is introduced.

My ideal version of this would be that table functions can provide a set of columns that are queryable, but not included/shown by default, i.e. similar to how SELECT * FROM tbl does not show the rowid column. Then scanners could expose multiple columns in a more easy manner. For example, we could easily expose SELECT row_group_id FROM tbl as well. Similarly, many of the columns that are currently emitted by e.g. Parquet files would no longer need to be options. We could simply query SELECT filename FROM file.parquet instead of having to do SELECT filename FROM read_parquet('file.parquet', filename=True).

That sounds like a fantastic approach. It would be particularly helpful for cases like Delta Lake row tracking, where _metadata columns are heavily used. I think that naming conflicts between these additional columns and user-defined columns would need to be addressed, but that seems manageable. Another challenge would be ensuring users can easily discover these scanner-provided columns.

That also solves this problem because scanners could register their own rowid column with whichever type they want using that generic interface.

Absolutely—this design would be a great fit for that!

I'm not opposed to merging this with some changes made to it - but there's a high chance this will be completely overhauled/reworked in the (near) future.

That works for me. I fully understand that the C++ API comes with no stability guarantees, so I’m okay with the possibility of this being replaced or reworked down the line.

Nevertheless, if you want to continue with this changeset, there's some comments below:

I’ve addressed the feedback you provided, and the updates have reduced the size of the change significantly, especially in terms of the number of files modified. Could you please take another look? I appreciate your time.

rustyconover · 2024-12-17T08:04:28Z

@Mytherin no rush, but just wanted to let you know this PR is looking good now.

samansmink

Changes look good to me here!

rustyconover · 2024-12-17T22:32:50Z

Hi @samansmink,

Changes look good to me here!

Great to hear that, can you remove the changes requested label?

Thank you!

Rusty

rustyconover · 2025-01-01T01:43:55Z

Just bumping this PR, since it still pending.

Mytherin · 2025-01-06T14:48:44Z

Thanks! The changes make this a lot simpler indeed

rustyconover · 2025-01-06T15:18:22Z

Thank you Mark! And everybody else too who helped.

…

On Mon, Jan 6, 2025 at 09:49 Mark ***@***.***> wrote: Thanks! The changes make this a lot simpler indeed — Reply to this email directly, view it on GitHub <#14674 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFSWJIVAPIA4HX6TGTQ3OL2JKJWFAVCNFSM6AAAAABRAUEUMGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZTGI3DOMBUG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Allow a variable type `rowid` pseudocolumn in tables (duckdb/duckdb#14674)

Allow a variable type `rowid` pseudocolumn in tables (duckdb/duckdb#14674) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>

…ename` a virtual column in the Parquet/CSV/JSON readers (#16248) This PR generalizes the `rowid` column to the broader concept of virtual columns. Previously, `rowid` was a special column that existed within DuckDB tables. This column was special in the sense that it could be queried, but would not show up as an actual column of the table when inspecting the table or when running `SELECT * FROM tbl`: ```sql D create table tbl(i int); D insert into tbl values (42); D select * from tbl; ┌───────┐ │ i │ │ int32 │ ├───────┤ │ 42 │ └───────┘ D describe tbl; ┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐ │ column_name │ column_type │ null │ key │ default │ extra │ │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ ├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤ │ i │ INTEGER │ YES │ NULL │ NULL │ NULL │ └─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘ -- but we can query the column if explicitly mentioned D select rowid, i from tbl; ┌───────┬───────┐ │ rowid │ i │ │ int64 │ int32 │ ├───────┼───────┤ │ 0 │ 42 │ └───────┴───────┘ ``` The `rowid` column is also used internally for several purposes, such as handling deletes/updates of rows, or performing late materialization. In the current implementation - only the `rowid` virtual column exists, and only tables can implement this virtual column. Some previous work was done on making the `rowid` column type more flexible - e.g. #14674 - but still only this virtual column was hard-coded in the system. This PR generalizes the concept of virtual columns by adding a new callback to the `TableFunction`, that allows table functions to return the set of virtual columns that they support: ```cpp struct TableColumn { string name; LogicalType type; }; using virtual_column_map_t = unordered_map<column_t, TableColumn>; typedef virtual_column_map_t (*table_function_get_virtual_columns_t)(ClientContext &context, optional_ptr<FunctionData> bind_data); ``` As an example, if we wanted to support the `rowid` column as before we would set it up like so: ```cpp virtual_column_map_t virtual_columns; virtual_columns.insert(make_pair(COLUMN_IDENTIFIER_ROW_ID, TableColumn("rowid", LogicalType::ROW_TYPE))); ``` However, we can now support multiple virtual columns and can fully customize the virtual columns that we emit. This PR also introduces two new virtual columns: * `COLUMN_IDENTIFIER_FILENAME`, emitted by the Parquet/CSV/JSON readers * `COLUMN_IDENTIFIER_EMPTY ` - this is a non-queryable virtual column. When this column is present, it is added when running `COUNT(*)` over the table function. #### Filename With the addition of the `filename` virtual column, the following snippet now works: ```sql D copy (select 42 i) to tbl.parquet; D select * from tbl.parquet; ┌───────┐ │ i │ │ int32 │ ├───────┤ │ 42 │ └───────┘ D select *, filename from tbl.parquet; ┌───────┬─────────────┐ │ i │ filename │ │ int32 │ varchar │ ├───────┼─────────────┤ │ 42 │ tbl.parquet │ └───────┴─────────────┘ ``` Previously, we could emit the filename only by using the `filename` option. This is now no longer required - we can directly query the `filename` if desired. This PR is primarily adding support for virtual columns - we're planning to extend this support further in the future.

rustyconover force-pushed the feature_dynamic_rowid_type branch from abce24e to fb0d220 Compare November 1, 2024 18:13

duckdb-draftbot marked this pull request as draft November 1, 2024 18:13

rustyconover marked this pull request as ready for review November 1, 2024 18:13

rustyconover marked this pull request as draft November 1, 2024 18:37

rustyconover force-pushed the feature_dynamic_rowid_type branch 2 times, most recently from 0a0e593 to 946c23b Compare November 1, 2024 19:11

rustyconover marked this pull request as ready for review November 1, 2024 19:26

rustyconover force-pushed the feature_dynamic_rowid_type branch from 946c23b to 3039ad0 Compare November 5, 2024 18:25

duckdb-draftbot marked this pull request as draft November 5, 2024 18:25

rustyconover marked this pull request as ready for review November 5, 2024 18:26

rustyconover force-pushed the feature_dynamic_rowid_type branch 2 times, most recently from efff9da to dcc95c2 Compare November 27, 2024 02:33

duckdb-draftbot marked this pull request as draft November 27, 2024 02:33

rustyconover marked this pull request as ready for review November 27, 2024 02:33

szarnyasg added the needs maintainer approval label Dec 2, 2024

rustyconover force-pushed the feature_dynamic_rowid_type branch from dcc95c2 to 52ab6e6 Compare December 2, 2024 22:33

duckdb-draftbot marked this pull request as draft December 2, 2024 22:34

rustyconover marked this pull request as ready for review December 2, 2024 22:34

rustyconover force-pushed the feature_dynamic_rowid_type branch from 52ab6e6 to 6f7de1c Compare December 3, 2024 13:05

duckdb-draftbot marked this pull request as draft December 3, 2024 13:05

rustyconover marked this pull request as ready for review December 4, 2024 10:26

rustyconover force-pushed the feature_dynamic_rowid_type branch from 6f7de1c to ef2f0a5 Compare December 5, 2024 13:14

duckdb-draftbot marked this pull request as draft December 5, 2024 13:15

szarnyasg marked this pull request as ready for review December 9, 2024 12:42

szarnyasg requested a review from Mytherin December 9, 2024 12:42

Mytherin reviewed Dec 12, 2024

View reviewed changes

Mytherin added the Changes Requested label Dec 12, 2024

Changes for a variable type rowid pseudocolumn in tables

6392b1b

Address PR comments

2481584

rustyconover force-pushed the feature_dynamic_rowid_type branch from ef2f0a5 to 2481584 Compare December 17, 2024 04:47

duckdb-draftbot marked this pull request as draft December 17, 2024 04:48

rustyconover marked this pull request as ready for review December 17, 2024 04:49

Fix test failure

55c39c8

duckdb-draftbot marked this pull request as draft December 17, 2024 05:44

rustyconover marked this pull request as ready for review December 17, 2024 05:45

samansmink approved these changes Dec 17, 2024

View reviewed changes

samansmink removed the Changes Requested label Dec 18, 2024

szarnyasg requested a review from hannes December 19, 2024 13:04

Mytherin merged commit 13ba13c into duckdb:main Jan 6, 2025
47 checks passed

szarnyasg added the Needs Documentation Use for issues or PRs that require changes in the documentation label Jan 6, 2025

duckdblabs-bot mentioned this pull request Jan 6, 2025

[duckdb/#14674] - Allow a variable type rowid pseudocolumn in tables needs documentation duckdb/duckdb-web#4482

Closed

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Jan 6, 2025

vendor: Update vendored sources to duckdb/duckdb@13ba13c

1a7f133

Allow a variable type `rowid` pseudocolumn in tables (duckdb/duckdb#14674)

github-actions bot mentioned this pull request Jan 6, 2025

vendor: Update vendored sources to duckdb/duckdb@13ba13c121acfb3f4c48c16337297ac705779c19 duckdb/duckdb-r#958

Merged

Mytherin mentioned this pull request Feb 14, 2025

Generalize rowid into the concept of virtual columns, and make filename a virtual column in the Parquet/CSV/JSON readers #16248

Merged

Allow a variable type rowid pseudocolumn in tables #14674

Allow a variable type rowid pseudocolumn in tables #14674

Uh oh!

Conversation

rustyconover commented Nov 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Discussion

Uh oh!

rustyconover commented Dec 12, 2024

Uh oh!

samansmink commented Dec 12, 2024

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Mytherin Dec 12, 2024

Choose a reason for hiding this comment

Uh oh!

rustyconover Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

rustyconover Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

Mytherin Dec 12, 2024

Choose a reason for hiding this comment

Uh oh!

rustyconover Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

rustyconover commented Dec 17, 2024

Uh oh!

rustyconover commented Dec 17, 2024

Uh oh!

samansmink left a comment

Choose a reason for hiding this comment

Uh oh!

rustyconover commented Dec 17, 2024

Uh oh!

rustyconover commented Jan 1, 2025

Uh oh!

Uh oh!

Mytherin commented Jan 6, 2025

Uh oh!

rustyconover commented Jan 6, 2025 via email

Uh oh!

Uh oh!

Allow a variable type `rowid` pseudocolumn in tables #14674

Allow a variable type `rowid` pseudocolumn in tables #14674

rustyconover commented Nov 1, 2024 •

edited

Loading