mirror of
https://github.com/langchain-ai/arrow-rs.git
synced 2026-06-30 21:47:55 -04:00
1bed04c1e0
# Which issue does this PR close? - Part of https://github.com/apache/arrow-rs/issues/7456 # Rationale for this change Currently the `coalesce` kernel buffers views / data until there are enough rows and then concat's the results together. StringViewArrays can be even worse as there is a second copy in `gc_string_view_batch` This is wasteful because it 1. Buffers memory (has 2x the peak usage) 2. Copies the data twice We can make it faster and more memory efficient by directly creating the output array # What changes are included in this PR? 1. Add a specialization for incrementally building `StringViewArray` without buffering Note this PR does NOT (yet) add specialized filtering -- instead it focuses on reducing the overhead of appending views by not copying them (again!) with `gc_string_view_batch` # Open questions: 1. There is substantial overlap / duplication with StringViewBuilder -- I wonder if we can / should consolidate them somehow The differences are that the 1. Block size calculation management (aka look at the buffer sizes of the incoming buffers) 2. Finishing array allocates sufficient space for views # Are there any user-facing changes? The kernel is faster, no API changes