-
Notifications
You must be signed in to change notification settings - Fork 2.6k
MultiFileReader - Provide unified methods for multi-file reader functions (Parquet, CSV, JSON) #6912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…filename options together with filters
… to work in recursive CTEs
…fix them and re-enable
@Mytherin By the way, since you are looking into handling of If this is the case (please ignore if it isn't), I am seeing a segfault here with a similar stack trace from Perhaps the stack trace might be connected to this PR:
And the SQL file (note that the system paths need to be specified as absolute paths): SET memory_limit='1GB';
SET temp_directory='/home/david/ddb_example/';
with tbl0 as (
select
*
from read_parquet(
'/home/david/ddb_example/*.parquet'
, filename=true
, file_row_number=true
, union_by_name=true
)
where
dt is not null
and dt >= '2020-04-01'
and dt <= '2020-08-01'
and cat1 is not null
and num1 is not null
)
, tbl_join1 as (
select
cat1
, favg("num2") as "num2_avg"
, count(*) as num2_cnt
from tbl0
where
"num2" is not null
and not isnan("num2"::float)
group by
cat1
)
, tbl_join2 as (
select
cat3
, favg("num3") as "num3_avg"
, count(*) as num3_cnt
from tbl0
where
"num3" is not null
and not isnan("num3"::float)
group by cat3
)
select
tbl0.cat1
, tbl0.cat3
, cat2
, num1
, num3
, num2
, dt
, num2_avg
, num2_cnt
, num3_avg
, num3_cnt
, row_number() over(order by filename, file_row_number) as row_num
from tbl0
join tbl_join1
on tbl_join1.cat1 = tbl0.cat1
join tbl_join2
on tbl_join2.cat3 = tbl0.cat3
where
"dt" is not null
order by filename, file_row_number |
The issue I found with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks excellent,
The only thing I'm missing is verification that the projections are being pushed down on the CSV reader.
I guess a test that regex the plan should suffice?
Otherwise, we could also add tests like the one in this commit:
cbdcb1b
To verify we are executing the correct reader (parallel or single threaded)
https://github.com/duckdb/duckdb/blob/cbdcb1b8e90687f3b531c9d32f674db48a84c072/test/sql/copy/csv/parallel/test_parallel_option.test
This PR creates a new
MultiFileReader
class that provides helper functions for reader functions that can read from multiple files (e.g. Parquet, CSV, JSON functions). The goal of this class is to unify the logic for the various tasks that are required, including:filename
parameter (outputting file names)hive_partitioning
parameter (reading from hive partitions and outputting extra columns based on partition info)union_by_name
parameter (unifying the schema of all files)By unifying the code, we can provide a consistent experience for any function that operates on multiple files, while avoiding having to duplicate (often rather complex!) logic.
This PR unifies the logic of the Parquet reader and the CSV reader, and partly the JSON reader (but more work remains on the JSON reader side). This has the following positive effects:
union_by_name
,filename
andhive_partitioning
in the Parallel CSV readerunion_by_name
with filter pushdown in the Parquet readerunion_by_name
andhive_partitioning