-
Notifications
You must be signed in to change notification settings - Fork 2.5k
add support for scaning over numpy arrays #6523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
I definitely think it's a good start, but I'd like this to not piggy-back off of pandas, and instead create DuckDB vectors from numpy arrays directly.
The pandas code is essentially:
dataframe -> internal numpy arrays -> duckdb
In this case we can cut out the first step and deal directly with the numpy arrays - a lot of code should be able to be re-used for this.
Is this PR supposed related to this discussion? It looks to me like this PR adds this: dict_of_arrays = {'a': np.ndarray([1,2,3])}
duckdb.sql('select * from dict_of_arrays') As an alternate path for: df = pd.DataFrame({'a': np.ndarray([1,2,3])})
duckdb.sql('select * from df') |
Thanks for the comment. Yes, this PR is related with your mentioned link. I understand your point. Actually I tried to implement this without converting numpy arrays back to pandas. For numerical values, this implementation works perfectly. However there is some bugs when dealing with strings. I'll try to solve it and re-file a PR from now on. |
@Tishj could you have another look at this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, it looks like it's heading in the right direction!
I just have some questions and requests for some cleanup
I think your clang-format version might be too new, and is incompatible with the one the CI uses, please try to downgrade it: |
merged with latest duckdb code base
@Tishj could you have another look at this PR? |
} else if (bind_data.pandas_type == PandasType::OBJECT && string(py::str(df_types[col_idx])) == "string") { | ||
bind_data.pandas_type = PandasType::CATEGORY; | ||
auto enum_name = string(py::str(df_columns[col_idx])); | ||
auto uniq = py::cast<py::tuple>(py::module_::import("numpy").attr("unique")(column, false, true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this needs a comment to explain that this produces a tuple containing the distinct entries (0) and the indices into that array for the corresponding original values (1)
Probably not verbatim, but that line stumped me a little and I had to go into a python shell to look at help(numpy.unique)
😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, I just have a nitpick about a missing comment but other than that I think this is ready 👍
merged with latest duckdb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, LGTM 👍
Thanks! |
Support for scanning over numpy arrays.
Three types of Numpy-related parameter are accepted:
To support scanning over numpy arrays, code for scanning pandas dataframes is reused.