Rework vector_hash
for ARRAYs
#11558
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR reworks the hash operation for array vectors thats been having issues.
The underlying problem is that array vectors implicitly assumes that the array elements corresponding to the array at position
x
will be located in the child vector at positionx * (array_size) + (elem_offset)
. This works well in the general case when you get a populated array vector supplied to you. However, during hashing we want to create a temporary vector to store the hashes of the child vector, but as our input vector can be of different vector types, as well as have an additionalrsel
selection vector applied it becomes pretty difficult to calculate the total required size for the temporary child hash vector as thersel
can end up indexing outside the assumedcount * array_size
number of child elements.My previous band-aide fix was to look through any selection vectors to find the "max offset" and simply create a hash vector thats always large enough to contain the furthest selection index, but this is pretty wasteful, especially if the
rsel
or an input dictionary vector is sparse.Now there's two paths: a fast path if the input and output vectors are contiguous (flat/constant and no
rsel
) where we just hash allcount * array_size
child elements in one go, as well as a slow(er) path where we hash the elements of every input array one at a time (although reusing the temporary hash vector). In theory you could optimize this even further and size the temporary hash vector by figuring out the largest contiguous segment so you could hash the elements of multiple "adjacent" arrays at once. But this is more memory-friendly.Closes #11552