Rework `vector_hash` for ARRAYs #11558

Maxxen · 2024-04-08T14:24:51Z

This PR reworks the hash operation for array vectors thats been having issues.

The underlying problem is that array vectors implicitly assumes that the array elements corresponding to the array at position x will be located in the child vector at position x * (array_size) + (elem_offset). This works well in the general case when you get a populated array vector supplied to you. However, during hashing we want to create a temporary vector to store the hashes of the child vector, but as our input vector can be of different vector types, as well as have an additional rsel selection vector applied it becomes pretty difficult to calculate the total required size for the temporary child hash vector as the rsel can end up indexing outside the assumed count * array_size number of child elements.

My previous band-aide fix was to look through any selection vectors to find the "max offset" and simply create a hash vector thats always large enough to contain the furthest selection index, but this is pretty wasteful, especially if the rsel or an input dictionary vector is sparse.

Now there's two paths: a fast path if the input and output vectors are contiguous (flat/constant and no rsel) where we just hash all count * array_size child elements in one go, as well as a slow(er) path where we hash the elements of every input array one at a time (although reusing the temporary hash vector). In theory you could optimize this even further and size the temporary hash vector by figuring out the largest contiguous segment so you could hash the elements of multiple "adjacent" arrays at once. But this is more memory-friendly.

Closes #11552

Mytherin · 2024-04-09T07:56:19Z

Thanks!

Merge pull request duckdb/duckdb#11558 from Maxxen/bugfixes Merge pull request duckdb/duckdb#11532 from Tmonster/run_new_micro_benchmarks_to_check_for_improvement

Maxxen added 3 commits April 8, 2024 11:06

allocate enough child hashes for flat array vectors too

d996a40

rework array hash, add old fuzzer test

5971dbb

Merge branch 'main' into bugfixes

336fa06

Maxxen added the Ready For Review label Apr 8, 2024

Mytherin merged commit cc3547d into duckdb:main Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rework `vector_hash` for ARRAYs #11558

Rework `vector_hash` for ARRAYs #11558

Uh oh!

Maxxen commented Apr 8, 2024

Uh oh!

Mytherin commented Apr 9, 2024

Uh oh!

Uh oh!

Rework vector_hash for ARRAYs #11558

Rework vector_hash for ARRAYs #11558

Uh oh!

Conversation

Maxxen commented Apr 8, 2024

Uh oh!

Mytherin commented Apr 9, 2024

Uh oh!

Uh oh!

Rework `vector_hash` for ARRAYs #11558

Rework `vector_hash` for ARRAYs #11558