-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Allow strings in ColumnDataCollection to be written to disk #5543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Looks great. One comment below. Perhaps it also makes sense to run the tests with both string inlining disabled and with buffer manager verification turned on, at least locally?
I fixed some CI issues, but now there is a stack overflow in one of the python windows tests. I suspect it is unrelated, but I can investigate if need be. |
Test passed this time, all green now |
Thanks for the updates! LGTM |
No longer use the
StringHeap
when using the buffer manager allocator. When strings are read, we check if the pointers are still valid. If they are, we read as usual. If not, we will correct them. This should give around the same performance as theStringHeap
if strings are not spilled, and only very slight overhead when they have not been spilled. This is realized by keeping track of just a little bit of meta data.Strings that are larger than
Storage::BLOCK_SIZE
are written a single non-standard size block.As a bonus, I've also parallelized unswizzling strings in the
RowDataCollection
for the hash join. This was the last part of the hash join that was not fully parallel.