Skip to content

Conversation

lnkuiper
Copy link
Contributor

@lnkuiper lnkuiper commented Apr 2, 2024

We currently create the whole dictionary, and only after creating it do we check if the compression ratio is greater than 1. If yes, we apply dictionary compression.

However, creating the dictionary requires a lot of string hashing and comparisons. This PR checks whether the compression ratio is less than 1 after inserting 10k values into the dictionary. If yes, we stop inserting into the dictionary, preventing costly string hashing and comparisons. This speeds up writing lineitem at SF10 by 15-20%, as it can early out for the l_comment column.

Not sure if this is the best fix, so I'm happy to receive any feedback. I've sent this PR to main, but I'm also happy to change it to feature instead.

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think this is a good idea, but I would like for it to be configurable through a setting. That way we can also disable dictionary writing entirely using the same setting by setting the ratio to 0 (which has been requested before). Could we also add some tests that trigger explicitly this behavior?

@lnkuiper
Copy link
Contributor Author

lnkuiper commented Apr 2, 2024

@Mytherin, thanks for the feedback. I've now made it configurable through a parameter dictionary_compression_ratio_threshold. When set to 0, dictionary compression is always applied, when set to -1, dictionary compression is never applied.

I thought it made more sense this way, i.e., if you only want to use dictionary compression when you have a good ratio, you can set the parameter to e.g., 2.0, and it will start applying dictionary compression when the compression ratio is greater than 2.0.

@lnkuiper lnkuiper marked this pull request as draft April 3, 2024 14:28
@lnkuiper lnkuiper marked this pull request as ready for review April 3, 2024 14:28
@github-actions github-actions bot marked this pull request as draft April 4, 2024 14:54
@lnkuiper lnkuiper marked this pull request as ready for review April 4, 2024 14:54
@carlopi
Copy link
Contributor

carlopi commented Apr 4, 2024

Should there be a very early out if the ratio is impossible to achieve that skips the routine at all?

And the check on compression better than 1.0 at the end should be restored? (might also make sense to say that values between 0 and 1.0 are basically useful only for testing)

@lnkuiper
Copy link
Contributor Author

lnkuiper commented Apr 4, 2024

Should there be a very early out if the ratio is impossible to achieve that skips the routine at all?

If you set the parameter to -1 you get out immediately, and waste no time at all.

And the check on compression better than 1.0 at the end should be restored? (might also make sense to say that values between 0 and 1.0 are basically useful only for testing)

Why? Then the supplied parameter wouldn't be respected no?

@carlopi
Copy link
Contributor

carlopi commented Apr 4, 2024

If you set the parameter to -1 you get out immediately, and waste no time at all.

+1, missed the early out check on Max

Why? Then the supplied parameter wouldn't be respected no?

Unsure, maybe elsewhere we are relying on some bound on size, unsure.

@Mytherin
Copy link
Collaborator

Mytherin commented Apr 9, 2024

Thanks!

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Apr 9, 2024
Merge pull request duckdb/duckdb#11461 from lnkuiper/parquet_dict_early_out
Merge pull request duckdb/duckdb#11137 from Tishj/sqllogic_parser
Merge pull request duckdb/duckdb#11095 from Tishj/python_struct_child_count_mismatch
Merge pull request duckdb/duckdb#10382 from jzavala-gonzalez/python-write-csv-options
@lnkuiper lnkuiper deleted the parquet_dict_early_out branch May 29, 2024 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Documentation Use for issues or PRs that require changes in the documentation Ready For Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants