Skip to content

Conversation

Mytherin
Copy link
Collaborator

Currently we keep the compressed buffer around, when we really only need it for a short amount of time (when we read data to then decompress it). Since we keep the compressed buffer around per column, these resizeable buffers can grow to a rather substantial amount of memory, which can lead to us using 50-100% extra memory when reading Parquet files. Destroying and re-allocating the buffer does not seem to greatly affect performance - below are ClickBench timings ran on my laptop on a single big file (which is when caching matters the most given that caches are re-instantiated between files anyway):

We can perhaps do something more clever where we keep around a single cache that we share across readers, but not for the bug-fix release.

Query Old New Ratio
Q00 0.165 0.156 0.95
Q01 0.156 0.155 0.99
Q02 0.192 0.191 0.99
Q03 0.179 0.188 1.05
Q04 0.292 0.287 0.98
Q05 0.364 0.349 0.96
Q06 0.154 0.157 1.02
Q07 0.158 0.159 1.01
Q08 0.361 0.348 0.96
Q09 0.449 0.440 0.98
Q10 0.225 0.220 0.98
Q11 0.251 0.249 0.99
Q12 0.379 0.355 0.94
Q13 0.552 0.539 0.98
Q14 0.426 0.412 0.97
Q15 0.351 0.325 0.93
Q16 0.665 0.629 0.95
Q17 0.617 0.596 0.97
Q18 0.971 0.947 0.98
Q19 0.215 0.196 0.91
Q20 0.929 0.892 0.96
Q21 0.730 0.719 0.98
Q22 1.213 1.226 1.01
Q23 3.062 2.905 0.95
Q24 0.189 0.189 1.00
Q25 0.178 0.172 0.97
Q26 0.201 0.186 0.93
Q27 0.771 0.729 0.95
Q28 7.475 7.246 0.97
Q29 0.254 0.236 0.93
Q30 0.483 0.477 0.99
Q31 0.508 0.549 1.08
Q32 1.168 1.254 1.07
Q33 1.431 1.493 1.04
Q34 1.453 1.511 1.04
Q35 0.359 0.369 1.03
Q36 0.165 0.184 1.12
Q37 0.161 0.166 1.03
Q38 0.173 0.172 0.99
Q39 0.177 0.182 1.03
Q40 0.165 0.169 1.02
Q41 0.162 0.172 1.06

@Mytherin Mytherin requested a review from lnkuiper February 17, 2025 12:53
Copy link
Contributor

@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Mytherin Mytherin merged commit c27c052 into duckdb:v1.2-histrionicus Feb 17, 2025
48 checks passed
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Mar 7, 2025
Parquet Reader: avoid caching the compressed buffer in the ColumnReader (duckdb/duckdb#16263)
@Mytherin Mytherin deleted the dontcachecompressedbufg branch April 2, 2025 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants