Skip to content

Conversation

lnkuiper
Copy link
Contributor

No longer use the StringHeap when using the buffer manager allocator. When strings are read, we check if the pointers are still valid. If they are, we read as usual. If not, we will correct them. This should give around the same performance as the StringHeap if strings are not spilled, and only very slight overhead when they have not been spilled. This is realized by keeping track of just a little bit of meta data.

Strings that are larger than Storage::BLOCK_SIZE are written a single non-standard size block.

As a bonus, I've also parallelized unswizzling strings in the RowDataCollection for the hash join. This was the last part of the hash join that was not fully parallel.

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks great. One comment below. Perhaps it also makes sense to run the tests with both string inlining disabled and with buffer manager verification turned on, at least locally?

@lnkuiper
Copy link
Contributor Author

lnkuiper commented Dec 7, 2022

I fixed some CI issues, but now there is a stack overflow in one of the python windows tests. I suspect it is unrelated, but I can investigate if need be.

@lnkuiper
Copy link
Contributor Author

lnkuiper commented Dec 8, 2022

Test passed this time, all green now

@Mytherin Mytherin merged commit 2e664ac into duckdb:feature Dec 8, 2022
@Mytherin
Copy link
Collaborator

Mytherin commented Dec 8, 2022

Thanks for the updates! LGTM

@lnkuiper lnkuiper mentioned this pull request Apr 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants