Parquet Writer: Early out creating dictionary #11461

lnkuiper · 2024-04-02T09:23:01Z

We currently create the whole dictionary, and only after creating it do we check if the compression ratio is greater than 1. If yes, we apply dictionary compression.

However, creating the dictionary requires a lot of string hashing and comparisons. This PR checks whether the compression ratio is less than 1 after inserting 10k values into the dictionary. If yes, we stop inserting into the dictionary, preventing costly string hashing and comparisons. This speeds up writing lineitem at SF10 by 15-20%, as it can early out for the l_comment column.

Not sure if this is the best fix, so I'm happy to receive any feedback. I've sent this PR to main, but I'm also happy to change it to feature instead.

Mytherin

Thanks! I think this is a good idea, but I would like for it to be configurable through a setting. That way we can also disable dictionary writing entirely using the same setting by setting the ratio to 0 (which has been requested before). Could we also add some tests that trigger explicitly this behavior?

lnkuiper · 2024-04-02T12:40:32Z

@Mytherin, thanks for the feedback. I've now made it configurable through a parameter dictionary_compression_ratio_threshold. When set to 0, dictionary compression is always applied, when set to -1, dictionary compression is never applied.

I thought it made more sense this way, i.e., if you only want to use dictionary compression when you have a good ratio, you can set the parameter to e.g., 2.0, and it will start applying dictionary compression when the compression ratio is greater than 2.0.

test/sql/copy/parquet/dictionary_compression_ratio_threshold.test

carlopi · 2024-04-04T15:06:25Z

Should there be a very early out if the ratio is impossible to achieve that skips the routine at all?

And the check on compression better than 1.0 at the end should be restored? (might also make sense to say that values between 0 and 1.0 are basically useful only for testing)

lnkuiper · 2024-04-04T15:15:32Z

Should there be a very early out if the ratio is impossible to achieve that skips the routine at all?

If you set the parameter to -1 you get out immediately, and waste no time at all.

And the check on compression better than 1.0 at the end should be restored? (might also make sense to say that values between 0 and 1.0 are basically useful only for testing)

Why? Then the supplied parameter wouldn't be respected no?

carlopi · 2024-04-04T15:25:23Z

If you set the parameter to -1 you get out immediately, and waste no time at all.

+1, missed the early out check on Max

Why? Then the supplied parameter wouldn't be respected no?

Unsure, maybe elsewhere we are relying on some bound on size, unsure.

Mytherin · 2024-04-09T10:24:09Z

Thanks!

Merge pull request duckdb/duckdb#11461 from lnkuiper/parquet_dict_early_out Merge pull request duckdb/duckdb#11137 from Tishj/sqllogic_parser Merge pull request duckdb/duckdb#11095 from Tishj/python_struct_child_count_mismatch Merge pull request duckdb/duckdb#10382 from jzavala-gonzalez/python-write-csv-options

lnkuiper added 2 commits April 2, 2024 10:35

early out of analyze if we know we wont write a dictionary

54d6e72

clean up code

247260e

Mytherin reviewed Apr 2, 2024

View reviewed changes

Mytherin added the Changes Requested label Apr 2, 2024

make dictionary compression configurable

7d12a07

lnkuiper added Ready For Review and removed Changes Requested labels Apr 3, 2024

lnkuiper marked this pull request as draft April 3, 2024 14:28

lnkuiper marked this pull request as ready for review April 3, 2024 14:28

Mytherin reviewed Apr 4, 2024

View reviewed changes

test/sql/copy/parquet/dictionary_compression_ratio_threshold.test Outdated Show resolved Hide resolved

Mytherin added Changes Requested and removed Ready For Review labels Apr 4, 2024

remove skips

05768fa

github-actions bot marked this pull request as draft April 4, 2024 14:54

lnkuiper marked this pull request as ready for review April 4, 2024 14:54

lnkuiper added Ready For Review Needs Documentation Use for issues or PRs that require changes in the documentation and removed Changes Requested labels Apr 5, 2024

duckdblabs-bot mentioned this pull request Apr 5, 2024

[duckdb/#11461] - Parquet Writer: Early out creating dictionary needs documentation duckdb/duckdb-web#2680

Closed

Mytherin merged commit b8533af into duckdb:main Apr 9, 2024

lnkuiper deleted the parquet_dict_early_out branch May 29, 2024 14:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parquet Writer: Early out creating dictionary #11461

Parquet Writer: Early out creating dictionary #11461

lnkuiper commented Apr 2, 2024

Uh oh!

Mytherin left a comment

Uh oh!

lnkuiper commented Apr 2, 2024

Uh oh!

Uh oh!

carlopi commented Apr 4, 2024

Uh oh!

lnkuiper commented Apr 4, 2024

Uh oh!

carlopi commented Apr 4, 2024 •

edited

Loading

Uh oh!

Mytherin commented Apr 9, 2024

Uh oh!

Uh oh!

Parquet Writer: Early out creating dictionary #11461

Parquet Writer: Early out creating dictionary #11461

Conversation

lnkuiper commented Apr 2, 2024

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

lnkuiper commented Apr 2, 2024

Uh oh!

Uh oh!

carlopi commented Apr 4, 2024

Uh oh!

lnkuiper commented Apr 4, 2024

Uh oh!

carlopi commented Apr 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mytherin commented Apr 9, 2024

Uh oh!

Uh oh!

carlopi commented Apr 4, 2024 •

edited

Loading