Skip to content

Conversation

Tishj
Copy link
Contributor

@Tishj Tishj commented Nov 27, 2023

This PR fixes #6669

The logic for getting the offset for sampling is:

	auto sample = sample_size;
	if (sample > rows) {
		sample = rows;
	}
	return rows / sample;

This let's us scan sample_size columns from the dataset.

A problem with this is that if there are null values recurring in the dataset at this offset, the analyzer would only find nulls and set the type as NULL, when a non-null value is encountered it's cast to the result type NULL and that causes this issue.

To combat this issue we now find the first non-null value starting from that the offset, instead of just taking the first value we find at that offset.

@github-actions github-actions bot marked this pull request as draft November 28, 2023 08:29
@Mytherin Mytherin marked this pull request as ready for review November 28, 2023 11:10
@Mytherin Mytherin merged commit c55af63 into duckdb:main Nov 28, 2023
@Mytherin
Copy link
Collaborator

Thanks!

@Admolly
Copy link

Admolly commented Nov 29, 2023

Question, would it also be possible to add an INFER SCHEMA FROM <table> statement to the SQL dialect that would allow a user to specify an existing table to use when inferring types of a pandas dataframe?

@yuanweixin
Copy link

+1 for having a way to tell duckdb to use the schema of an existing table, instead of doing inference on the data frame content.

I ran into this issue when importing data into tables I already set up with schema. To mitigate, I ended up setting the sample window to a very large value to practically guarantee seeing a non-null value.

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 14, 2023
Merge pull request duckdb/duckdb#9802 from Giorgi/master
Merge pull request duckdb/duckdb#9811 from Tishj/pandas_analyzer_skip_nulls
Merge pull request duckdb/duckdb#9827 from Mytherin/cifix
Merge pull request duckdb/duckdb#9826 from carlopi/verboseerror
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unimplemented type for cast (VARCHAR -> NULL) error for aggregation on large string column
4 participants