Skip to content

Conversation

mhilton
Copy link
Contributor

@mhilton mhilton commented Jun 24, 2025

Update flux's String array type to either be an arrow Binary array or an arrow Dictionary with Binary values. This removes the non-arrow compatible single string variant, instead using a dictionary to provide the low memory version for repeated values. A dictionary provides a more general purpose implementation of the same idea to not keep repeating identical values.

The StringBuilder still swaps to a standard Binary array after a second unique value of a String is observed, but does not do so just because a NULL is added to the array. In the future the heuristic could be changed to provide memory efficient string representations in other contexts.

Moving to a completely arrow-compatible interface makes the String array type much less fragile. It is now possible to use the String array in any context that an arrow Array can be used, and removes the special-case code previously required to split a String array.

Checklist

Dear Author 👋, the following checks should be completed (or explicitly dismissed) before merging.

  • ✏️ Write a PR description, regardless of triviality, to include the value of this PR
  • 🔗 Reference related issues
  • 🏃 Test cases are included to exercise the new code
  • 🧪 If new packages are being introduced to stdlib, link to Working Group discussion notes and ensure it lands under experimental/
  • 📖 If language features are changing, ensure docs/Spec.md has been updated

Dear Reviewer(s) 👋, you are responsible (among others) for ensuring the completeness and quality of the above before approval.

mhilton added 2 commits June 24, 2025 14:28
Update flux's String array type to either be an arrow Binary array or an
arrow Dictionary with Binary values. This removes the non-arrow
compatible single string variant, instead using a dictionary to provide
the low memory version for repeated values. A dictionary provides a more
general purpose implementation of the same idea to not keep repeating
identical values.

The StringBuilder still swaps to a standard Binary array after a second
unique value of a String is observed, but does not do so just because a
NULL is added to the array. In the future the heuristic could be
changed to provide memory efficient string representations in other
contexts.

Moving to a completely arrow-compatible interface makes the String array
type much less fragile. It is now possible to use the String array in
any context that an arrow Array can be used, and removes the
special-case code previously required to split a String array.
The NewStringFromBinaryArray function is only used in one place, where
it can be replaced with NewStringData to perform a more succinct
conversion. Repurpose the NewStringFromBinaryArray test to test
NewStringData.
@mhilton mhilton requested a review from a team as a code owner June 24, 2025 14:07
Comment on lines +89 to +90
values.Release()
indices.Release()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like I'm back in my C++ days 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is because the go API is modelled off of the C++ one.

@mhilton mhilton merged commit 35d8e67 into master Jun 25, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants