[Python] Pandas Analyzer no longer trips up when the `pandas_analyze_sample` would only let it find nulls. #9811

Tishj · 2023-11-27T12:55:56Z

This PR fixes #6669

The logic for getting the offset for sampling is:

	auto sample = sample_size;
	if (sample > rows) {
		sample = rows;
	}
	return rows / sample;

This let's us scan sample_size columns from the dataset.

A problem with this is that if there are null values recurring in the dataset at this offset, the analyzer would only find nulls and set the type as NULL, when a non-null value is encountered it's cast to the result type NULL and that causes this issue.

To combat this issue we now find the first non-null value starting from that the offset, instead of just taking the first value we find at that offset.

tools/pythonpkg/src/pandas/analyzer.cpp

tools/pythonpkg/tests/fast/pandas/test_df_object_resolution.py

…and first object is none

Mytherin · 2023-11-28T15:38:43Z

Thanks!

Admolly · 2023-11-29T09:20:17Z

Question, would it also be possible to add an INFER SCHEMA FROM <table> statement to the SQL dialect that would allow a user to specify an existing table to use when inferring types of a pandas dataframe?

yuanweixin · 2023-12-08T03:05:27Z

+1 for having a way to tell duckdb to use the schema of an existing table, instead of doing inference on the data frame content.

I ran into this issue when importing data into tables I already set up with schema. To mitigate, I ended up setting the sample window to a very large value to practically guarantee seeing a non-null value.

Merge pull request duckdb/duckdb#9802 from Giorgi/master Merge pull request duckdb/duckdb#9811 from Tishj/pandas_analyzer_skip_nulls Merge pull request duckdb/duckdb#9827 from Mytherin/cifix Merge pull request duckdb/duckdb#9826 from carlopi/verboseerror

find the first non-null value when analyzing a pandas dataframe

4ebb339

Mytherin reviewed Nov 27, 2023

View reviewed changes

tools/pythonpkg/src/pandas/analyzer.cpp Show resolved Hide resolved

Mytherin added the Changes Requested label Nov 27, 2023

Mytherin reviewed Nov 27, 2023

View reviewed changes

tools/pythonpkg/tests/fast/pandas/test_df_object_resolution.py Show resolved Hide resolved

add better coverage for pandas analyzer, fix bug if sample size is 1 …

0e93d1f

…and first object is none

github-actions bot marked this pull request as draft November 28, 2023 08:29

Tishj added Ready For Review and removed Changes Requested labels Nov 28, 2023

Mytherin marked this pull request as ready for review November 28, 2023 11:10

Mytherin merged commit c55af63 into duckdb:main Nov 28, 2023

Tishj mentioned this pull request May 16, 2024

Modify the pandas analyzer code to always respect the sample size #12097

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Python] Pandas Analyzer no longer trips up when the `pandas_analyze_sample` would only let it find nulls. #9811

[Python] Pandas Analyzer no longer trips up when the `pandas_analyze_sample` would only let it find nulls. #9811

Uh oh!

Tishj commented Nov 27, 2023

Uh oh!

Uh oh!

Uh oh!

Mytherin commented Nov 28, 2023

Uh oh!

Admolly commented Nov 29, 2023

Uh oh!

yuanweixin commented Dec 8, 2023

Uh oh!

Uh oh!

[Python] Pandas Analyzer no longer trips up when the pandas_analyze_sample would only let it find nulls. #9811

[Python] Pandas Analyzer no longer trips up when the pandas_analyze_sample would only let it find nulls. #9811

Uh oh!

Conversation

Tishj commented Nov 27, 2023

Uh oh!

Uh oh!

Uh oh!

Mytherin commented Nov 28, 2023

Uh oh!

Admolly commented Nov 29, 2023

Uh oh!

yuanweixin commented Dec 8, 2023

Uh oh!

Uh oh!

[Python] Pandas Analyzer no longer trips up when the `pandas_analyze_sample` would only let it find nulls. #9811

[Python] Pandas Analyzer no longer trips up when the `pandas_analyze_sample` would only let it find nulls. #9811