Skip to content

Conversation

Mytherin
Copy link
Collaborator

@Mytherin Mytherin commented Jun 5, 2025

@evertlammerts @carlopi I had to reconcile a merge conflict between #17708 and #17605 - could you double check that the code in setup.py is still correct?

krlmlr and others added 30 commits May 25, 2025 16:14
…nction

Move version parsing and bumping logic to top of file and consolidate
version handling through a single bump_version function. Replace complex
setuptools_scm parsing and version_scheme with streamlined implementation
using OVERRIDE_GIT_DESCRIBE environment variable handling.
…#17689)

`FileExists` returns true for root buckets on S3 (e.g.
`s3://root-bucket/`). This causes partitioned copy operations like the
following to fail currently:

```sql
copy (select 42 i, 1 p) to 's3://root-bucket/' (format parquet, partition_by p);
-- Cannot write to "s3://root-bucket/" - it exists and is a file, not a directory!
```

The check ("is this a file or is this a directory") doesn't really make
sense on blob stores to begin with - so just skip it for remote files.
Fixes:
1. duckdb#17682 (missed a `!`, used
uninitialized variable in Parquet BSS encoder)
2. duckdblabs/duckdb-internal#4999
(`ExternalFileCache` assertion failure because we exited loop too early)
`GetFileHandle()` bypasses a check to `validate` which tells the
caching_file_system to prefer file data in cache. By calling `CanSeek()`
first there is a check to the cache if the file is in cache and if
seeking is possible. This avoids an unnecessary head request for full
file reads (like avro on Iceberg).
Currently we assume all plans can be cached - this allows table
functions to opt out of statement caching. If this is opted out of we
will always rebind when re-executing a prepared statement instead.
…`query` statement (duckdb#17710)

This PR is part of fixing
duckdblabs/duckdb-internal#5006

This is required for `duckdb-iceberg`, as it uses `<FILE>:` for its TPCH
tests, which needs a `__WORKING_DIRECTORY__` to function when called
from duckdb/duckdb
(duckdb/duckdb-iceberg#270)
* Only generate IN pushdown filters for equality conditions

fixes: duckdblabs/duckdb-internal#5022
* Add missing commutative variants for DOUBLE and BIGINT
* Fix broken test that no one seemed to have noticed...

fixes: duckdblabs/duckdb-internal#4995
… is FIXED_LEN_BYTE_ARRAY (duckdb#17723)

Fixes a regression introduced in
duckdb#16161

Type length may also be set for variable-length byte arrays (in which
case it should be ignored).
* Only generate IN pushdown filters for equality conditions

fixes: duckdblabs/duckdb-internal#5022
* Add missing commutative variants for DOUBLE and BIGINT
* Fix broken test that no one seemed to have noticed...

fixes: duckdblabs/duckdb-internal#4995
Mytherin and others added 15 commits June 4, 2025 09:43
…duckdb#17581)

Fixes: duckdb#17008

This PR fixes an issue with type casting in lambda expressions used in
the `list_reduce` function.

For example, in the following query:
```sql
select list_reduce([0], (x, y) -> x > 3, 3.1)
```

The lambda expression was incorrectly bound as:
```sql
CAST((x > CAST(3 AS INTEGER)) AS DECIMAL(11,1))
```

Now proper type casting is implemented to match the max logical type of
both the list child type and the initial value type:
```sql
CAST((x > CAST(3 AS DECIMAL(11,1))) AS DECIMAL(11,1))
```

### Test Cases
Added tests to verify the correct casting behaviour
Current status of cache shows stuff is not really long lived, with a
bunch of space (I have not measured, possibly 20%+) busy with repeated
msys2 items, that are only cached on the same PR.

Moved to regular behaviour that is caching only on main branch.
Add infra so that a subset of extensions can be tested on each PR, that
should speed up CI times with limited risks.

Currently skips `encodings` and `spatial` on PRs. Which extensions is up
for discussions, I would be for expanding even more, given most
extensions are not actually impacting CI surface.

I would like to see whether this actually works (minor implementations
details migth be off), then we can discuss.
This could also be paired with other ideas to improve PR process, like
having a tag "extra tests" that opt-in to the full set of tests, but
this is worth having also in isolation I think.
…_ptr (duckdb#17749)

This also simplifies the destruction code, since these strings will be
cleaned up with the `ArgMinMaxStateBase`
Very basic initial implementation. Adds a new log type
(`PhysicalOperator`) and adds logging for the hash join and parquet
writer. I've implemented a utility that can be passed into classes that
we use during execution such as `JoinHashTable` and `ParquetWriter` that
logs messages, but we can see which operator this belongs to:
```sql
D pragma enable_logging;
D set logging_level='DEBUG';
D set debug_force_external=true;
D set threads=1;
D copy (
      select t1.i
      from range(3_000_000) t1(i)
      join range(3_000_000) t2(i)
      using (i)
  ) to 'physical_operator_logging.parquet';
D pragma disable_logging;
┌──────────────────┬───────────────┬──────────────────────────────────────────────────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│       type       │ operator_type │                                parameters                                │                                                           message                                                           │
│     varchar      │    varchar    │                          map(varchar, varchar)                           │                                                           varchar                                                           │
├──────────────────┼───────────────┼──────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PhysicalOperator │ HASH_JOIN     │ {Join Type=INNER, Conditions='i = i', __estimated_cardinality__=3000000} │ Building JoinHashTable (187250 rows, 7377554 bytes)                                                                         │
│ PhysicalOperator │ HASH_JOIN     │ {Join Type=INNER, Conditions='i = i', __estimated_cardinality__=3000000} │ External hash join: enabled. Size (118108864 bytes) greater than reservation (15782448 bytes)                               │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (122896 rows, 1015040 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded) │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (64354 rows, 532480 bytes) to file "physical_operator_logging.parquet" (Combine)                         │
│ PhysicalOperator │ HASH_JOIN     │ {Join Type=INNER, Conditions='i = i', __estimated_cardinality__=3000000} │ Building JoinHashTable (187018 rows, 7373610 bytes)                                                                         │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (122880 rows, 998400 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded)  │
│ PhysicalOperator │ HASH_JOIN     │ {Join Type=INNER, Conditions='i = i', __estimated_cardinality__=3000000} │ Building JoinHashTable (188044 rows, 7391052 bytes)                                                                         │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (123530 rows, 1015040 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded) │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (122880 rows, 998400 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded)  │
│ PhysicalOperator │ HASH_JOIN     │ {Join Type=INNER, Conditions='i = i', __estimated_cardinality__=3000000} │ Building JoinHashTable (187019 rows, 7373627 bytes)                                                                         │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (124556 rows, 1015040 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded) │
│ PhysicalOperator │ HASH_JOIN     │ {Join Type=INNER, Conditions='i = i', __estimated_cardinality__=3000000} │ Building JoinHashTable (187663 rows, 7384575 bytes)                                                                         │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (123531 rows, 1015040 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded) │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (122880 rows, 998400 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded)  │
│ PhysicalOperator │ HASH_JOIN     │ {Join Type=INNER, Conditions='i = i', __estimated_cardinality__=3000000} │ Building JoinHashTable (187163 rows, 7376075 bytes)                                                                         │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (124175 rows, 1015040 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded) │
│ PhysicalOperator │ HASH_JOIN     │ {Join Type=INNER, Conditions='i = i', __estimated_cardinality__=3000000} │ Building JoinHashTable (188208 rows, 7393840 bytes)                                                                         │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (123675 rows, 1015040 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded) │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (122880 rows, 998400 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded)  │
│ PhysicalOperator │ HASH_JOIN     │ {Join Type=INNER, Conditions='i = i', __estimated_cardinality__=3000000} │ Building JoinHashTable (187522 rows, 7382178 bytes)                                                                         │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (124720 rows, 1015040 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded) │
│ PhysicalOperator │ HASH_JOIN     │ {Join Type=INNER, Conditions='i = i', __estimated_cardinality__=3000000} │ Building JoinHashTable (187690 rows, 7385034 bytes)                                                                         │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (124034 rows, 1015040 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded) │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (122880 rows, 998400 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded)  │
│ PhysicalOperator │ HASH_JOIN     │ {Join Type=INNER, Conditions='i = i', __estimated_cardinality__=3000000} │ Building JoinHashTable (187526 rows, 7382246 bytes)                                                                         │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (124202 rows, 1015040 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded) │
│ PhysicalOperator │ HASH_JOIN     │ {Join Type=INNER, Conditions='i = i', __estimated_cardinality__=3000000} │ Building JoinHashTable (187171 rows, 7376211 bytes)                                                                         │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (124038 rows, 1015040 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded) │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (122880 rows, 998400 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded)  │
│ PhysicalOperator │ HASH_JOIN     │ {Join Type=INNER, Conditions='i = i', __estimated_cardinality__=3000000} │ Building JoinHashTable (187427 rows, 7380563 bytes)                                                                         │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (123683 rows, 1015040 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded) │
│ PhysicalOperator │ HASH_JOIN     │ {Join Type=INNER, Conditions='i = i', __estimated_cardinality__=3000000} │ Building JoinHashTable (187701 rows, 7385221 bytes)                                                                         │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (123939 rows, 1015040 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded) │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (122880 rows, 998400 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded)  │
│ PhysicalOperator │ HASH_JOIN     │ {Join Type=INNER, Conditions='i = i', __estimated_cardinality__=3000000} │ Building JoinHashTable (187690 rows, 7385034 bytes)                                                                         │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (124213 rows, 1015040 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded) │
│ PhysicalOperator │ HASH_JOIN     │ {Join Type=INNER, Conditions='i = i', __estimated_cardinality__=3000000} │ Building JoinHashTable (187374 rows, 7379662 bytes)                                                                         │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (124202 rows, 1015040 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded) │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (122880 rows, 998400 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded)  │
│ PhysicalOperator │ HASH_JOIN     │ {Join Type=INNER, Conditions='i = i', __estimated_cardinality__=3000000} │ Building JoinHashTable (187534 rows, 7382382 bytes)                                                                         │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (123886 rows, 1015040 bytes) to file "physical_operator_logging.parquet" (Sink: ROW_GROUP_SIZE exceeded) │
│ PhysicalOperator │ COPY_TO_FILE  │ {}                                                                       │ Flushing row group (93326 rows, 765440 bytes) to file "physical_operator_logging.parquet" (Combine)                         │
├──────────────────┴───────────────┴──────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 42 rows                                                                                                                                                                                                                         4 columns │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
```
I'd be happy to receive any feedback on this :)
…midcomment line during sniffing (duckdb#17751)

This PR, considers the null_padding option when detecting comments in a
CSV File.
It also quotes values that have a possible comment option (i.e., '#').

Fix: duckdb#17744
Co-authored-by: Carlo Piovesan <piovesan.carlo@gmail.com>
@Mytherin Mytherin marked this pull request as draft June 5, 2025 08:34
@Mytherin Mytherin marked this pull request as ready for review June 5, 2025 08:34
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 5, 2025 08:37
@Mytherin Mytherin marked this pull request as ready for review June 5, 2025 08:49
@Mytherin Mytherin merged commit f85436e into duckdb:main Jun 5, 2025
53 of 54 checks passed
@carlopi carlopi mentioned this pull request Jun 6, 2025
Mytherin added a commit that referenced this pull request Jun 6, 2025
More merging in `main`, with the twist that I did not see the proper
merge conflict raised at
#17806 (comment).

(@evertlammerts)

This also includes #17831
@Mytherin Mytherin deleted the mergev13again branch June 12, 2025 15:27
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Jun 21, 2025
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Jun 21, 2025
Merge v1.3 into main (duckdb/duckdb#17806)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants