-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Allow a variable type rowid
pseudocolumn in tables
#14674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
abce24e
to
fb0d220
Compare
0a0e593
to
946c23b
Compare
946c23b
to
3039ad0
Compare
efff9da
to
dcc95c2
Compare
dcc95c2
to
52ab6e6
Compare
52ab6e6
to
6f7de1c
Compare
6f7de1c
to
ef2f0a5
Compare
@samansmink Hi Sam, could you peek at this too? Thank you. |
@rustyconover I think @Mytherin would be more suitable for reviewing this one! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
I have been thinking about reworking the rowid
type and generated columns in general to be less hard-coded, and while I think this is a good change (making row-id type configurable) this changeset moves in the opposite direction and introduces even more hard-coding for the row-id column, which will make reworking it later on more difficult.
My ideal version of this would be that table functions can provide a set of columns that are queryable, but not included/shown by default, i.e. similar to how SELECT * FROM tbl
does not show the rowid column. Then scanners could expose multiple columns in a more easy manner. For example, we could easily expose SELECT row_group_id FROM tbl
as well. Similarly, many of the columns that are currently emitted by e.g. Parquet files would no longer need to be options. We could simply query SELECT filename FROM file.parquet
instead of having to do SELECT filename FROM read_parquet('file.parquet', filename=True)
.
That also solves this problem because scanners could register their own rowid
column with whichever type they want using that generic interface.
I'm not opposed to merging this with some changes made to it - but there's a high chance this will be completely overhauled/reworked in the (near) future.
Nevertheless, if you want to continue with this changeset, there's some comments below:
@@ -114,10 +115,18 @@ class TableCatalogEntry : public StandardEntry { | |||
//! Returns true, if the table has a primary key, else false. | |||
bool HasPrimaryKey() const; | |||
|
|||
//! Returns the rowid type of this table | |||
LogicalType GetRowIdType() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a virtual method instead that returns LogicalType::ROW_TYPE
? Then we don't need to keep this as a member or take this in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest commit
@@ -101,7 +104,7 @@ struct TableBinding : public Binding { | |||
public: | |||
TableBinding(const string &alias, vector<LogicalType> types, vector<string> names, | |||
vector<ColumnIndex> &bound_column_ids, optional_ptr<StandardEntry> entry, idx_t index, | |||
bool add_row_id = false); | |||
bool add_row_id = false, LogicalType rowid_type = LogicalType(LogicalType::ROW_TYPE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TableBinding
already takes a StandardEntry (which should be a TableCatalogEntry
if add_row_id
is true). Can we just call entry->Cast<TableCatalogEntry>().GetRowIdType()
instead to avoid having to pull this through many layers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made this change and it simplified things quite a bit.
ef2f0a5
to
2481584
Compare
Thank you for the thoughtful review and for being open to this idea!
I understand your concerns here. While I hope this doesn’t add too much rigidity, I’m happy for this interface to be replaced in the future if something more flexible and powerful is introduced.
That sounds like a fantastic approach. It would be particularly helpful for cases like Delta Lake row tracking, where
Absolutely—this design would be a great fit for that!
That works for me. I fully understand that the C++ API comes with no stability guarantees, so I’m okay with the possibility of this being replaced or reworked down the line.
I’ve addressed the feedback you provided, and the updates have reduced the size of the change significantly, especially in terms of the number of files modified. Could you please take another look? I appreciate your time. |
@Mytherin no rush, but just wanted to let you know this PR is looking good now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me here!
Hi @samansmink,
Great to hear that, can you remove the changes requested label? Thank you! Rusty |
Just bumping this PR, since it still pending. |
Thanks! The changes make this a lot simpler indeed |
Thank you Mark! And everybody else too who helped.
…On Mon, Jan 6, 2025 at 09:49 Mark ***@***.***> wrote:
Thanks! The changes make this a lot simpler indeed
—
Reply to this email directly, view it on GitHub
<#14674 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFSWJIVAPIA4HX6TGTQ3OL2JKJWFAVCNFSM6AAAAABRAUEUMGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZTGI3DOMBUG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Allow a variable type `rowid` pseudocolumn in tables (duckdb/duckdb#14674)
Allow a variable type `rowid` pseudocolumn in tables (duckdb/duckdb#14674) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
…ename` a virtual column in the Parquet/CSV/JSON readers (#16248) This PR generalizes the `rowid` column to the broader concept of virtual columns. Previously, `rowid` was a special column that existed within DuckDB tables. This column was special in the sense that it could be queried, but would not show up as an actual column of the table when inspecting the table or when running `SELECT * FROM tbl`: ```sql D create table tbl(i int); D insert into tbl values (42); D select * from tbl; ┌───────┐ │ i │ │ int32 │ ├───────┤ │ 42 │ └───────┘ D describe tbl; ┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐ │ column_name │ column_type │ null │ key │ default │ extra │ │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ ├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤ │ i │ INTEGER │ YES │ NULL │ NULL │ NULL │ └─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘ -- but we can query the column if explicitly mentioned D select rowid, i from tbl; ┌───────┬───────┐ │ rowid │ i │ │ int64 │ int32 │ ├───────┼───────┤ │ 0 │ 42 │ └───────┴───────┘ ``` The `rowid` column is also used internally for several purposes, such as handling deletes/updates of rows, or performing late materialization. In the current implementation - only the `rowid` virtual column exists, and only tables can implement this virtual column. Some previous work was done on making the `rowid` column type more flexible - e.g. #14674 - but still only this virtual column was hard-coded in the system. This PR generalizes the concept of virtual columns by adding a new callback to the `TableFunction`, that allows table functions to return the set of virtual columns that they support: ```cpp struct TableColumn { string name; LogicalType type; }; using virtual_column_map_t = unordered_map<column_t, TableColumn>; typedef virtual_column_map_t (*table_function_get_virtual_columns_t)(ClientContext &context, optional_ptr<FunctionData> bind_data); ``` As an example, if we wanted to support the `rowid` column as before we would set it up like so: ```cpp virtual_column_map_t virtual_columns; virtual_columns.insert(make_pair(COLUMN_IDENTIFIER_ROW_ID, TableColumn("rowid", LogicalType::ROW_TYPE))); ``` However, we can now support multiple virtual columns and can fully customize the virtual columns that we emit. This PR also introduces two new virtual columns: * `COLUMN_IDENTIFIER_FILENAME`, emitted by the Parquet/CSV/JSON readers * `COLUMN_IDENTIFIER_EMPTY ` - this is a non-queryable virtual column. When this column is present, it is added when running `COUNT(*)` over the table function. #### Filename With the addition of the `filename` virtual column, the following snippet now works: ```sql D copy (select 42 i) to tbl.parquet; D select * from tbl.parquet; ┌───────┐ │ i │ │ int32 │ ├───────┤ │ 42 │ └───────┘ D select *, filename from tbl.parquet; ┌───────┬─────────────┐ │ i │ filename │ │ int32 │ varchar │ ├───────┼─────────────┤ │ 42 │ tbl.parquet │ └───────┴─────────────┘ ``` Previously, we could emit the filename only by using the `filename` option. This is now no longer required - we can directly query the `filename` if desired. This PR is primarily adding support for virtual columns - we're planning to extend this support further in the future.
This PR introduces support for configuring the
rowid
pseudo-column to use a type other thanBIGINT
. In particular, it enablesrowid
to be defined as either aBLOB
orVARCHAR
. I think this is useful when working with external data sources, and especially Apache Arrow provided data sources. This PR is one step on the road for enablingDELETE
/UPDATE
/INSERT
to tables via Apache Arrow Flight but could be used by other data sources.Changes
To support this, I’ve updated the
TableCatalogEntry
to specify the type ofrowid
as aLogicalType
, so this change can be used by other catalog implementations.@pdet, In the Arrow integration,
ArrowTableFunction::ArrowToDuckDB
now accepts an optionalrowid_column_index
parameter to indicate the position of therowid
column within the Arrow table. This change ensures thatrowid
can appear at any position within an Arrow array.To fully support this functionality, I also adjusted
BindContext::AddBaseTable
, and theLogicalGet
, andBinding
classes, ensuring that therowid
column type is dynamic but defaults toLogicalTypeId::ROW_TYPE
if unspecified.Discussion
I realize this PR is a big bigger than the size I typically send, and will require more energy to review. I tried to make it as small as possible, but if there is a better way to accomplish what I'm trying to do I'm happy to rework the approach. I’d also be happy to discuss any details or adjustments to make this approach work smoothly for others. Thanks for reviewing this.