Allow strings in ColumnDataCollection to be written to disk #5543

lnkuiper · 2022-11-29T19:09:50Z

No longer use the StringHeap when using the buffer manager allocator. When strings are read, we check if the pointers are still valid. If they are, we read as usual. If not, we will correct them. This should give around the same performance as the StringHeap if strings are not spilled, and only very slight overhead when they have not been spilled. This is realized by keeping track of just a little bit of meta data.

Strings that are larger than Storage::BLOCK_SIZE are written a single non-standard size block.

As a bonus, I've also parallelized unswizzling strings in the RowDataCollection for the hash join. This was the last part of the hash join that was not fully parallel.

Mytherin

Thanks for the PR! Looks great. One comment below. Perhaps it also makes sense to run the tests with both string inlining disabled and with buffer manager verification turned on, at least locally?

src/common/types/column_data_allocator.cpp

lnkuiper · 2022-12-07T08:55:01Z

I fixed some CI issues, but now there is a stack overflow in one of the python windows tests. I suspect it is unrelated, but I can investigate if need be.

lnkuiper · 2022-12-08T08:14:37Z

Test passed this time, all green now

Mytherin · 2022-12-08T08:16:44Z

Thanks for the updates! LGTM

lnkuiper added 11 commits November 22, 2022 10:16

progress: spilling strings to disk CDC

15aeb2a

progress with CDC string pointer swizzling

27ae31e

Merge branch 'master' into oochj

90ff052

string_t swizzling in CDC works - no BlockHandle callbacks

b3f2749

fix issue with allocating too many string vectors in CDC

90d8ad6

correctly use offset when indexing vector/validitymask

177afc0

unswizzle in parallel while finalizing hash table

a166b67

remove unused function

602d6cd

merge with master after buffer manager fixes

ee72f06

remove verify (can't tell if blob here)

4d10bee

Merge branch 'feature' into oochj

abd257c

Mytherin reviewed Nov 30, 2022

View reviewed changes

src/common/types/column_data_allocator.cpp Show resolved Hide resolved

lnkuiper added 10 commits November 30, 2022 12:09

skip inlined strings (well-spotted, Mark)

c517fcf

Merge branch 'feature' into oochj

5a4d540

add dictionary_heap so strings stay valid

b53ed9a

check string length before adding to heap

27dcc3f

merge no string inline / destroy unpinned blocks

0c52333

add include for windows

f7f5991

add blob instead of string so we don't trigger the assert

5d1dad5

some CI fixes

c0fc948

some more DUCKDB_API for windows

f4a0c26

more DUCKDB_API

41383e3

Mytherin merged commit 2e664ac into duckdb:feature Dec 8, 2022

lnkuiper mentioned this pull request Apr 7, 2023

Tuple Data Collection #6998

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow strings in ColumnDataCollection to be written to disk #5543

Allow strings in ColumnDataCollection to be written to disk #5543

Uh oh!

lnkuiper commented Nov 29, 2022

Uh oh!

Mytherin left a comment

Uh oh!

Uh oh!

lnkuiper commented Dec 7, 2022

Uh oh!

lnkuiper commented Dec 8, 2022

Uh oh!

Mytherin commented Dec 8, 2022

Uh oh!

Uh oh!

Allow strings in ColumnDataCollection to be written to disk #5543

Allow strings in ColumnDataCollection to be written to disk #5543

Uh oh!

Conversation

lnkuiper commented Nov 29, 2022

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lnkuiper commented Dec 7, 2022

Uh oh!

lnkuiper commented Dec 8, 2022

Uh oh!

Mytherin commented Dec 8, 2022

Uh oh!

Uh oh!