Skip to content

Unnecessarily restrictive signature of list_reduce(..., initial) #17009

@soerenwolfers

Description

@soerenwolfers

What happens?

The new list_reduce(..., initial) has the signature

list_reduce(seq: T[], fun: (T, T) -> T, initial: T)

instead of the "correct" signature

list_reduce(seq: T[], fun: (S, T) -> S, initial: S)

which means it can't do anything that couldn't already be done with

list_reduce([initial] + seq, fun)

(the latter even has the advantage that you don't need to guess whether fun or initial come first).

In particular, the most simple fold, the identity on lists, isn't possible with the current signature constraint:

select list_reduce([1], (x, y) -> x || [y], [])
Binder Error: Cannot concatenate types INTEGER and INTEGER[] - an explicit cast is required

and the unnecessarily inserted casts to enforce an all-Ts-signature produce "obviously wrong" results:

select
    y1: list_reduce([0], (x, y) -> if(len(x::VARCHAR) - 1, true, false), false), -- should be `true`, is `0`
    y2: list_reduce([2], (x, y) -> ['3.1', '1.4'][y], 1) -- should be `1.4` is `1`
┌───────┬───────┐
│  y1   │  y2   │
│ int32 │ int32 │
├───────┼───────┤
│     0 │     1 │
└───────┴───────┘

Given that the new list_reduce(..., initial) is logically just a rewrite of list_reduce([initial] || ...) it is not surprising that the latter produces the same results. That's just another reason that the initial version shouldn't just be a rewrite though: In the rewritten form you can at least understand what's going on because [initial] || seq must obviously be assigned some type and it is excusable that the type inference for that choice doesn't consider the lambda function. With an explicit initial argument, on the other hand, one would intuitively expect that the types of initial and seq need not be in any relation and thus be surprised from resulting errors (or, worse, wrong results). In fact, I believe

list_reduce(<SeqExpr>, <LambdaArgAndValueExpr>, <InitExpr>)

should have the same type as the "super type" for <InitExpr> and

<LambdaValueExpr>(<SeqExpr>[0], <InitExpr>)

If implementing that isn't as easy as I imagine, I'd argue that one should simply take the type of <InitExpr> as given, and throw if <LambdaValueExpr>(<SeqExpr>[0], <InitExpr>) isn't implicit-castable to that.

In summary, I think the danger of users getting wrong results without noticing (or spending time figuring out what's happening when they do notice) currently outweights the benefit of the 2 characters saved from the new synax in cases where the two-argument function was already sufficient.

Finally, I assume that the unnecessary casts play part in the internal error reported at #17008

To Reproduce

.

OS:

Linux

DuckDB Version:

'1.3.0-dev1976'

DuckDB Client:

Python

Hardware:

.

Full Name:

Soeren Wolfers

Affiliation:

G-Research

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions