Skip to content

Conversation

drizk1
Copy link
Contributor

@drizk1 drizk1 commented Sep 18, 2024

This PR adds support and tests for defining scalar UDFs in julia and using them in duckdb. Thank you for an incredible tool.

Below is an example.

using DuckDB
function MyAddition(info::DuckDB.duckdb_function_info, input::DuckDB.duckdb_data_chunk, output::DuckDB.duckdb_vector)
    input = DuckDB.DataChunk(input, false)
    n = DuckDB.get_size(input)
    a_data = DuckDB.get_array(DuckDB.get_vector(input, 1), Int64, n)
    b_data = DuckDB.get_array(DuckDB.get_vector(input, 2), Int64, n)

    output_data = DuckDB.get_array(DuckDB.Vec(output), Int64, n)
    for row in 1:n
        output_data[row] = a_data[row] + b_data[row]
    end
end


db = DuckDB.DB()
con = DuckDB.connect(db)
# create a scalar function
f = DuckDB.duckdb_create_scalar_function()
DuckDB.duckdb_scalar_function_set_name(f, "my_addition")
# add two bigint parameters
type = DuckDB.duckdb_create_logical_type(DuckDB.DUCKDB_TYPE_BIGINT)
DuckDB.duckdb_table_function_add_parameter(f, type)
DuckDB.duckdb_table_function_add_parameter(f, type)
# set the return type to bigint
DuckDB.duckdb_scalar_function_set_return_type(f, type)
DuckDB.duckdb_destroy_logical_type(type)
# set up the function
CMyAddition = @cfunction(MyAddition, Cvoid, (DuckDB.duckdb_function_info, DuckDB.duckdb_data_chunk, DuckDB.duckdb_vector))
DuckDB.duckdb_scalar_function_set_function(f, CMyAddition)
# register and cleanup
DuckDB.duckdb_register_scalar_function(con.handle, f)
DuckDB.duckdb_destroy_scalar_function(f)

DuckDB.query(con, "CREATE TABLE big_table AS SELECT i FROM range(9) t(i)") |> DataFrame
DuckDB.query(con, "SELECT my_addition(i, i) FROM big_table") |> DataFrame

@duckdb-draftbot duckdb-draftbot marked this pull request as draft September 18, 2024 21:47
@drizk1 drizk1 marked this pull request as ready for review September 18, 2024 21:47
@duckdb-draftbot duckdb-draftbot marked this pull request as draft September 18, 2024 22:00
@drizk1 drizk1 marked this pull request as ready for review September 18, 2024 22:08
@duckdb-draftbot duckdb-draftbot marked this pull request as draft September 18, 2024 23:52
@drizk1 drizk1 marked this pull request as ready for review September 18, 2024 23:57
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM - one comment

@duckdb-draftbot duckdb-draftbot marked this pull request as draft September 19, 2024 10:15
@drizk1 drizk1 marked this pull request as ready for review September 19, 2024 10:19
@Mytherin Mytherin merged commit 38bacb3 into duckdb:main Sep 19, 2024
8 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Sep 27, 2024
Fix duckdb/duckdb#13993 - avoid disabling optimizers for SET VARIABLE (duckdb/duckdb#14028)
Proper NULL handling in special json extraction functions (duckdb/duckdb#14032)
Adds Julia support for scalar UDFs (duckdb/duckdb#14024)
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Sep 27, 2024
Fix duckdb/duckdb#13993 - avoid disabling optimizers for SET VARIABLE (duckdb/duckdb#14028)
Proper NULL handling in special json extraction functions (duckdb/duckdb#14032)
Adds Julia support for scalar UDFs (duckdb/duckdb#14024)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Mytherin added a commit that referenced this pull request Jan 3, 2025
This PR improves the Julia support for user defined UDFs.

It adds the `ScalarFunction` type and the `@create_scalar_function`
macro that automatically generates a DuckDB compatible wrapper. See also
discussion #13176 for details.

Numeric types, dates, strings, missing values and exceptions are
supported. Composite types are not yet implemented.

## Example Usage:

```julia
using DuckDB, DataFrames

f_add = (a, b) -> a + b
db = DuckDB.DB()
con = DuckDB.connect(db)

# Create the scalar function 
# the second argument can be omitted if the function name is identical to a global symbol
fun = DuckDB.@create_scalar_function f_add(a::Int, b::Int)::Int f_add
DuckDB.register_scalar_function(con, fun)

df = DataFrame(a = [1, 2, 3], b = [1, 2, 3])
DuckDB.register_table(con, df, "test1")

result = DuckDB.execute(con, "SELECT f_add(a, b) as result FROM test1") |> DataFrame
```

## Performance

The performance of the auto-generated wrapper is comparable to pure
DuckDB/Julia. I measured the elapsed time (in seconds) of adding 10
million numbers in a coarse benchmark:

|               | Int         | Float       |
| ------------- | ----------- | ----------- |
| DataFrames.jl | 0.092947083 | 0.090409625 |
| DuckDB        | 0.065306042 | 0.054156167 |
| UDF           | 0.078665125 | 0.080781    |



## Internals

The scalar functions are tracked in a dictionary in the DuckDBHandle
struct. Currently only registering scalar function is supported.

The wrapper is generated via the macro and should be fully type stable.
The wrapper is generated via the function `_udf_generate_wrapper()`.

Because of limitations of the `@cfunction` macro in Julia, the wrapper
needs to be globally accessible. I implemented this by introducing a
global (constant) dictionary variable in DuckDB `_UDF_WRAPPER_CACHE`
which is used to store the generated wrappers. This is defined in
`_udf_register_wrapper()`. This is, in my opinion not ideal but works. I
asked in Julia related forums for a better solution and will update the
code, if something better is possible.


This is an update to PR #14024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants