add support for scaning over numpy arrays #6523

vlowingkloude · 2023-03-01T17:20:05Z

Support for scanning over numpy arrays.

Three types of Numpy-related parameter are accepted:

Dict[str (name), array] - use the keys of the dict to set the column names
List[array] - use default names (column0, column1, etc..)
array - same as List

To support scanning over numpy arrays, code for scanning pandas dataframes is reused.

Tishj

Thanks for the PR!

I definitely think it's a good start, but I'd like this to not piggy-back off of pandas, and instead create DuckDB vectors from numpy arrays directly.

The pandas code is essentially:
dataframe -> internal numpy arrays -> duckdb

In this case we can cut out the first step and deal directly with the numpy arrays - a lot of code should be able to be re-used for this.

tools/pythonpkg/src/include/duckdb_python/pyconnection.hpp

tools/pythonpkg/src/pyconnection.cpp

Tishj · 2023-03-01T21:50:53Z

Is this PR supposed related to this discussion?

It looks to me like this PR adds this:

dict_of_arrays = {'a': np.ndarray([1,2,3])}
duckdb.sql('select * from dict_of_arrays')

As an alternate path for:

df = pd.DataFrame({'a': np.ndarray([1,2,3])})
duckdb.sql('select * from df')

vlowingkloude · 2023-03-02T09:13:42Z

Thanks for the comment.

Yes, this PR is related with your mentioned link.

I understand your point. Actually I tried to implement this without converting numpy arrays back to pandas. For numerical values, this implementation works perfectly. However there is some bugs when dealing with strings. I'll try to solve it and re-file a PR from now on.

tools/pythonpkg/tests/fast/numpy/test_numpy_new_path.py

Mytherin · 2023-03-14T08:44:45Z

@Tishj could you have another look at this PR?

Tishj

Thanks for the changes, it looks like it's heading in the right direction!
I just have some questions and requests for some cleanup

tools/pythonpkg/src/pandas/scan.cpp

tools/pythonpkg/src/pyconnection.cpp

tools/pythonpkg/src/vector_conversion.cpp

tools/pythonpkg/tests/fast/numpy/test_numpy_new_path.py

tools/pythonpkg/src/vector_conversion.cpp

Tishj · 2023-03-24T11:46:39Z

I think your clang-format version might be too new, and is incompatible with the one the CI uses, please try to downgrade it:
python3 -m pip install clang-format==11.0.1

merged with latest duckdb code base

Mytherin · 2023-04-07T20:03:26Z

@Tishj could you have another look at this PR?

Tishj · 2023-04-07T21:24:27Z

tools/pythonpkg/src/vector_conversion.cpp

+		} else if (bind_data.pandas_type == PandasType::OBJECT && string(py::str(df_types[col_idx])) == "string") {
+			bind_data.pandas_type = PandasType::CATEGORY;
+			auto enum_name = string(py::str(df_columns[col_idx]));
+			auto uniq = py::cast<py::tuple>(py::module_::import("numpy").attr("unique")(column, false, true));


I feel like this needs a comment to explain that this produces a tuple containing the distinct entries (0) and the indices into that array for the corresponding original values (1)

Probably not verbatim, but that line stumped me a little and I had to go into a python shell to look at help(numpy.unique) 😅

Tishj

Thanks for the changes, I just have a nitpick about a missing comment but other than that I think this is ready 👍

merged with latest duckdb

Tishj

Thanks for the changes, LGTM 👍

Mytherin · 2023-04-10T06:54:36Z

Thanks!

add support for scaning over numpy arrays

9586e71

Tishj suggested changes Mar 1, 2023

View reviewed changes

vlowingkloude added 5 commits March 3, 2023 00:36

support for scaning over numpy arrays without converting to pandas

ac7c65d

format fix

81b43e2

format fix

8f83042

format fix

a5eace5

tidy fix

c6bf66e

Mytherin reviewed Mar 8, 2023

View reviewed changes

tools/pythonpkg/tests/fast/numpy/test_numpy_new_path.py Show resolved Hide resolved

Mytherin reviewed Mar 8, 2023

View reviewed changes

tools/pythonpkg/tests/fast/numpy/test_numpy_new_path.py Show resolved Hide resolved

Mytherin reviewed Mar 8, 2023

View reviewed changes

tools/pythonpkg/tests/fast/numpy/test_numpy_new_path.py Show resolved Hide resolved

add more test cases and a small fix on throwing exception in TryReplace

60c6810

Tishj reviewed Mar 14, 2023

View reviewed changes

vlowingkloude added 3 commits March 15, 2023 18:10

reformatting of some code; use pytest.raises for testing

6a272ca

create ENUM type for numpy array of strings

5dbed05

format fix

9ce3e69

vlowingkloude and others added 4 commits March 25, 2023 18:34

format fix

7897f8f

naming style fix

9b6f466

Merge pull request #1 from duckdb/master

18f8cda

merged with latest duckdb code base

merged with latest duckdb, changed uniqle to uniq

f2e75af

Tishj reviewed Apr 7, 2023

View reviewed changes

Tishj suggested changes Apr 7, 2023

View reviewed changes

Tishj mentioned this pull request Apr 8, 2023

[Python] Add support for Pandas 2.0.0 #7005

Merged

vlowingkloude and others added 3 commits April 8, 2023 23:03

add a comment on numpy.unique function call

8826209

Merge pull request #2 from duckdb/master

1d69aa0

merged with latest duckdb

small format fix

8ab8c73

Tishj approved these changes Apr 9, 2023

View reviewed changes

Mytherin merged commit 7bdaddf into duckdb:master Apr 10, 2023

add support for scaning over numpy arrays #6523

add support for scaning over numpy arrays #6523

Uh oh!

Conversation

vlowingkloude commented Mar 1, 2023

Uh oh!

Tishj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tishj commented Mar 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vlowingkloude commented Mar 2, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mytherin commented Mar 14, 2023

Uh oh!

Tishj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tishj commented Mar 24, 2023

Uh oh!

Mytherin commented Apr 7, 2023

Uh oh!

Tishj Apr 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tishj left a comment

Choose a reason for hiding this comment

Uh oh!

Tishj left a comment

Choose a reason for hiding this comment

Uh oh!

Mytherin commented Apr 10, 2023

Uh oh!

Uh oh!

Tishj commented Mar 1, 2023 •

edited

Loading

Tishj Apr 7, 2023 •

edited

Loading