Skip to content

Conversation

Maxxen
Copy link
Member

@Maxxen Maxxen commented Apr 8, 2024

This PR reworks the hash operation for array vectors thats been having issues.

The underlying problem is that array vectors implicitly assumes that the array elements corresponding to the array at position x will be located in the child vector at position x * (array_size) + (elem_offset). This works well in the general case when you get a populated array vector supplied to you. However, during hashing we want to create a temporary vector to store the hashes of the child vector, but as our input vector can be of different vector types, as well as have an additional rsel selection vector applied it becomes pretty difficult to calculate the total required size for the temporary child hash vector as the rsel can end up indexing outside the assumed count * array_size number of child elements.

My previous band-aide fix was to look through any selection vectors to find the "max offset" and simply create a hash vector thats always large enough to contain the furthest selection index, but this is pretty wasteful, especially if the rsel or an input dictionary vector is sparse.

Now there's two paths: a fast path if the input and output vectors are contiguous (flat/constant and no rsel) where we just hash all count * array_size child elements in one go, as well as a slow(er) path where we hash the elements of every input array one at a time (although reusing the temporary hash vector). In theory you could optimize this even further and size the temporary hash vector by figuring out the largest contiguous segment so you could hash the elements of multiple "adjacent" arrays at once. But this is more memory-friendly.

Closes #11552

@Mytherin Mytherin merged commit cc3547d into duckdb:main Apr 9, 2024
@Mytherin
Copy link
Collaborator

Mytherin commented Apr 9, 2024

Thanks!

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Apr 9, 2024
Merge pull request duckdb/duckdb#11558 from Maxxen/bugfixes
Merge pull request duckdb/duckdb#11532 from Tmonster/run_new_micro_benchmarks_to_check_for_improvement
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect Results in Joins with Array Keys Containing Null Values
2 participants