Parquet Reader: avoid caching the compressed buffer in the ColumnReader #16263

Mytherin · 2025-02-17T12:53:08Z

Currently we keep the compressed buffer around, when we really only need it for a short amount of time (when we read data to then decompress it). Since we keep the compressed buffer around per column, these resizeable buffers can grow to a rather substantial amount of memory, which can lead to us using 50-100% extra memory when reading Parquet files. Destroying and re-allocating the buffer does not seem to greatly affect performance - below are ClickBench timings ran on my laptop on a single big file (which is when caching matters the most given that caches are re-instantiated between files anyway):

We can perhaps do something more clever where we keep around a single cache that we share across readers, but not for the bug-fix release.

Query	Old	New	Ratio
Q00	0.165	0.156	0.95
Q01	0.156	0.155	0.99
Q02	0.192	0.191	0.99
Q03	0.179	0.188	1.05
Q04	0.292	0.287	0.98
Q05	0.364	0.349	0.96
Q06	0.154	0.157	1.02
Q07	0.158	0.159	1.01
Q08	0.361	0.348	0.96
Q09	0.449	0.440	0.98
Q10	0.225	0.220	0.98
Q11	0.251	0.249	0.99
Q12	0.379	0.355	0.94
Q13	0.552	0.539	0.98
Q14	0.426	0.412	0.97
Q15	0.351	0.325	0.93
Q16	0.665	0.629	0.95
Q17	0.617	0.596	0.97
Q18	0.971	0.947	0.98
Q19	0.215	0.196	0.91
Q20	0.929	0.892	0.96
Q21	0.730	0.719	0.98
Q22	1.213	1.226	1.01
Q23	3.062	2.905	0.95
Q24	0.189	0.189	1.00
Q25	0.178	0.172	0.97
Q26	0.201	0.186	0.93
Q27	0.771	0.729	0.95
Q28	7.475	7.246	0.97
Q29	0.254	0.236	0.93
Q30	0.483	0.477	0.99
Q31	0.508	0.549	1.08
Q32	1.168	1.254	1.07
Q33	1.431	1.493	1.04
Q34	1.453	1.511	1.04
Q35	0.359	0.369	1.03
Q36	0.165	0.184	1.12
Q37	0.161	0.166	1.03
Q38	0.173	0.172	0.99
Q39	0.177	0.182	1.03
Q40	0.165	0.169	1.02
Q41	0.162	0.172	1.06

lnkuiper

LGTM!

Parquet Reader: avoid caching the compressed buffer in the ColumnReader (duckdb/duckdb#16263)

Avoid caching the compressed buffer in the ColumnReader

b27267e

Mytherin requested a review from lnkuiper February 17, 2025 12:53

lnkuiper approved these changes Feb 17, 2025

View reviewed changes

Mytherin merged commit c27c052 into duckdb:v1.2-histrionicus Feb 17, 2025
48 checks passed

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Mar 7, 2025

vendor: Update vendored sources to duckdb/duckdb@c27c052

5c691cb

Parquet Reader: avoid caching the compressed buffer in the ColumnReader (duckdb/duckdb#16263)

Mytherin deleted the dontcachecompressedbufg branch April 2, 2025 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parquet Reader: avoid caching the compressed buffer in the ColumnReader #16263

Parquet Reader: avoid caching the compressed buffer in the ColumnReader #16263

Mytherin commented Feb 17, 2025

Uh oh!

lnkuiper left a comment

Uh oh!

Uh oh!

Uh oh!

Parquet Reader: avoid caching the compressed buffer in the ColumnReader #16263

Parquet Reader: avoid caching the compressed buffer in the ColumnReader #16263

Conversation

Mytherin commented Feb 17, 2025

Uh oh!

lnkuiper left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!