Skip to content

Conversation

vlowingkloude
Copy link
Contributor

Support for scanning over numpy arrays.

Three types of Numpy-related parameter are accepted:

  • Dict[str (name), array] - use the keys of the dict to set the column names
  • List[array] - use default names (column0, column1, etc..)
  • array - same as List

To support scanning over numpy arrays, code for scanning pandas dataframes is reused.

Copy link
Contributor

@Tishj Tishj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I definitely think it's a good start, but I'd like this to not piggy-back off of pandas, and instead create DuckDB vectors from numpy arrays directly.

The pandas code is essentially:
dataframe -> internal numpy arrays -> duckdb

In this case we can cut out the first step and deal directly with the numpy arrays - a lot of code should be able to be re-used for this.

@Tishj
Copy link
Contributor

Tishj commented Mar 1, 2023

Is this PR supposed related to this discussion?

It looks to me like this PR adds this:

dict_of_arrays = {'a': np.ndarray([1,2,3])}
duckdb.sql('select * from dict_of_arrays')

As an alternate path for:

df = pd.DataFrame({'a': np.ndarray([1,2,3])})
duckdb.sql('select * from df')

@vlowingkloude
Copy link
Contributor Author

Thanks for the comment.

Yes, this PR is related with your mentioned link.

I understand your point. Actually I tried to implement this without converting numpy arrays back to pandas. For numerical values, this implementation works perfectly. However there is some bugs when dealing with strings. I'll try to solve it and re-file a PR from now on.

@Mytherin
Copy link
Collaborator

@Tishj could you have another look at this PR?

Copy link
Contributor

@Tishj Tishj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, it looks like it's heading in the right direction!
I just have some questions and requests for some cleanup

@Tishj
Copy link
Contributor

Tishj commented Mar 24, 2023

I think your clang-format version might be too new, and is incompatible with the one the CI uses, please try to downgrade it:
python3 -m pip install clang-format==11.0.1

@Mytherin
Copy link
Collaborator

Mytherin commented Apr 7, 2023

@Tishj could you have another look at this PR?

} else if (bind_data.pandas_type == PandasType::OBJECT && string(py::str(df_types[col_idx])) == "string") {
bind_data.pandas_type = PandasType::CATEGORY;
auto enum_name = string(py::str(df_columns[col_idx]));
auto uniq = py::cast<py::tuple>(py::module_::import("numpy").attr("unique")(column, false, true));
Copy link
Contributor

@Tishj Tishj Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this needs a comment to explain that this produces a tuple containing the distinct entries (0) and the indices into that array for the corresponding original values (1)

Probably not verbatim, but that line stumped me a little and I had to go into a python shell to look at help(numpy.unique) 😅

Copy link
Contributor

@Tishj Tishj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, I just have a nitpick about a missing comment but other than that I think this is ready 👍

Copy link
Contributor

@Tishj Tishj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, LGTM 👍

@Mytherin Mytherin merged commit 7bdaddf into duckdb:master Apr 10, 2023
@Mytherin
Copy link
Collaborator

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants