-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Parquet Writer: Early out creating dictionary #11461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think this is a good idea, but I would like for it to be configurable through a setting. That way we can also disable dictionary writing entirely using the same setting by setting the ratio to 0 (which has been requested before). Could we also add some tests that trigger explicitly this behavior?
@Mytherin, thanks for the feedback. I've now made it configurable through a parameter I thought it made more sense this way, i.e., if you only want to use dictionary compression when you have a good ratio, you can set the parameter to e.g., 2.0, and it will start applying dictionary compression when the compression ratio is greater than 2.0. |
test/sql/copy/parquet/dictionary_compression_ratio_threshold.test
Outdated
Show resolved
Hide resolved
Should there be a very early out if the ratio is impossible to achieve that skips the routine at all? And the check on compression better than 1.0 at the end should be restored? (might also make sense to say that values between 0 and 1.0 are basically useful only for testing) |
If you set the parameter to
Why? Then the supplied parameter wouldn't be respected no? |
+1, missed the early out check on Max
Unsure, maybe elsewhere we are relying on some bound on size, unsure. |
Thanks! |
Merge pull request duckdb/duckdb#11461 from lnkuiper/parquet_dict_early_out Merge pull request duckdb/duckdb#11137 from Tishj/sqllogic_parser Merge pull request duckdb/duckdb#11095 from Tishj/python_struct_child_count_mismatch Merge pull request duckdb/duckdb#10382 from jzavala-gonzalez/python-write-csv-options
We currently create the whole dictionary, and only after creating it do we check if the compression ratio is greater than 1. If yes, we apply dictionary compression.
However, creating the dictionary requires a lot of string hashing and comparisons. This PR checks whether the compression ratio is less than 1 after inserting 10k values into the dictionary. If yes, we stop inserting into the dictionary, preventing costly string hashing and comparisons. This speeds up writing lineitem at SF10 by 15-20%, as it can early out for the
l_comment
column.Not sure if this is the best fix, so I'm happy to receive any feedback. I've sent this PR to
main
, but I'm also happy to change it tofeature
instead.