arrow-rs

mirror of https://github.com/langchain-ai/arrow-rs.git synced 2026-07-01 21:34:01 -04:00

Author	SHA1	Message	Date
Andrew Lamb	df702cfc71	Prepare for `55.2.0` release (#7722 ) # Which issue does this PR close? - Part of https://github.com/apache/arrow-rs/issues/7394 # Rationale for this change Prepare for next software release # What changes are included in this PR? 1. Update version to `55.2.0` 2. See rendered changelog here: https://github.com/alamb/arrow-rs/blob/alamb/prepare_55.2.0/CHANGELOG.md # Are there any user-facing changes? New release version	2025-06-22 09:08:47 -04:00
Nick Lanham	2788762c63	fix JSON decoder error checking for UTF16 / surrogate parsing panic (#7721 ) # Which issue does this PR close? - Closes #7712 . # Rationale for this change Shouldn't panic, especially in a fallible function. # What changes are included in this PR? Validate that the high and low surrogates are in the expected range, which guarantees that the subtractions won't overflow. # Are there any user-facing changes? No (well, things that used to panic now won't, but I don't think that counts)	2025-06-22 08:34:28 -04:00
Oleks V	e54b72bc4d	fix: Do not add null buffer for `NullArray` in MutableArrayData (#7726 ) # Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Closes #7725. # Rationale for this change Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. Do not populate null buffers when building `MutableArrayData` for `NullArray` # What changes are included in this PR? There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out.	2025-06-22 08:09:05 -04:00
Matthew Kim	1ededfe024	[Variant] Introduce new type over &str for ShortString (#7718 ) # Which issue does this PR close? - Closes https://github.com/apache/arrow-rs/issues/7700 This commit introduces `ShortString`, a newtype that wraps around `&str` that enforces a maximum length constraint. This also allows us to perform validation once and removes a superfluous validation check in `append_value`. The now-superflous validation check was needed since users could construct `Variant::ShortString`s directly, without doing input validation. This means you can have a short string variant which actually contains a string that is no longer than 63 bytes. But since we enforce this check upon construction, we can directly match against `Variant::String` and `Variant::ShortString` arms with their respective appending functions (`append_string` and `append_short_string`).	2025-06-21 07:16:51 -04:00
Frederic Branczyk	7b374b9b7a	arrow-array: Implement PartialEq for RunArray (#7727 ) # Which issue does this PR close? Closes #7691 # Are there any user-facing changes? No, just implementing for ease of use and consistency in tests and elsewhere across the wider repository. @alamb	2025-06-21 06:22:43 -04:00
Bruno	469c7ee177	Define a "arrow-pyrarrow" crate to implement the "pyarrow" feature. (#7694 ) This follows the pattern of other parts of the arrow-rs codebase: arrow-array, arrow-schema, etc. With this change, polyglot codebases can use pyarrow without making all their crates that use arrow pull in pyarrow (& pyo3). It also allows interfacing with PyArrow without pulling in Arrow. # Which issue does this PR close? Closes https://github.com/apache/arrow-rs/issues/7668. # Rationale for this change Part of a codebase can use pyarrow without arrow pulling in pyo3 across the codebase. # Are there any user-facing changes? Nope.	2025-06-20 15:55:10 -04:00
Qi Zhu	fbaf7cea2d	Support write to buffer api for SerializedFileWriter (#7714 ) # Which issue does this PR close? Currently, no pub api to support write the internal buffer for SerializedFileWriter, it's very helpful when we want to add low level API for example: - https://github.com/apache/datafusion/issues/16374 - https://github.com/apache/datafusion/pull/16395 Because that we want to update the buf bytes written, if we use the buf internal file to write, we can't update the internal buf written bytes. The consistent update for the bytes written metrics is the key for our custom index write. # Rationale for this change Add API to support write with buf byteswritten updating. # What changes are included in this PR? Add API to support write with buf byteswritten updating. # Are there any user-facing changes? No If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2025-06-20 12:08:30 -04:00
Andrew Lamb	1bed04c1e0	Optimize coalesce kernel for StringView (10-50% faster) (#7650 ) # Which issue does this PR close? - Part of https://github.com/apache/arrow-rs/issues/7456 # Rationale for this change Currently the `coalesce` kernel buffers views / data until there are enough rows and then concat's the results together. StringViewArrays can be even worse as there is a second copy in `gc_string_view_batch` This is wasteful because it 1. Buffers memory (has 2x the peak usage) 2. Copies the data twice We can make it faster and more memory efficient by directly creating the output array # What changes are included in this PR? 1. Add a specialization for incrementally building `StringViewArray` without buffering Note this PR does NOT (yet) add specialized filtering -- instead it focuses on reducing the overhead of appending views by not copying them (again!) with `gc_string_view_batch` # Open questions: 1. There is substantial overlap / duplication with StringViewBuilder -- I wonder if we can / should consolidate them somehow The differences are that the 1. Block size calculation management (aka look at the buffer sizes of the incoming buffers) 2. Finishing array allocates sufficient space for views # Are there any user-facing changes? The kernel is faster, no API changes	2025-06-20 09:24:22 -04:00
Ryan Johnson	7276819d0d	Split out variant code into several new sub-modules (#7717 ) # Which issue does this PR close? Housekeeping, part of * https://github.com/apache/arrow-rs/issues/6736 # Rationale for this change The variant module was starting to become unwieldy. # What changes are included in this PR? Split out metadata, object, and list to sub-modules; move `OffsetSize` to the decoder module where it arguably belongs. Result: variant.rs is "only" ~900 LoC instead of ~2kLoc. # Are there any user-facing changes? No. Public re-exports should hide the change from users. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2025-06-20 09:21:47 -04:00
Alex Wilcoxson	75008eb580	feat: add min max aggregate support for FixedSizeBinary (#7675 ) # Which issue does this PR close? Closes #7674. # Rationale for this change Adding support for these min/max functions [so DataFusion can utilize them](https://github.com/apache/datafusion/blob/dd936cb1b25cb685e0e146f297950eb00048c64c/datafusion/functions-aggregate/src/min_max.rs#L600) # What changes are included in this PR? Added new min and max functions for fixed size binary and updated existing tests. # Are there any user-facing changes? Yes new functions `min_fixed_size_binary` and `max_fixed_size_binary` added.	2025-06-20 09:14:27 -04:00
Frederic Branczyk	ecd2905cc2	arrow-data: Add REE support for `build_extend` and `build_extend_nulls` (#7671 ) # Which issue does this PR close? Part 3 of https://github.com/apache/datafusion/issues/16011 # What changes are included in this PR? This is a piece of the puzzle to support aggregating on REE arrays. # Are there any user-facing changes? No user facing changes, just extending functionality of existing APIs to support extracting rows from REE arrays. @alamb	2025-06-19 20:19:37 -04:00
Andrew Lamb	fe65b8d937	[Variant] Add variant docs and examples (#7661 ) # Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Follow on to https://github.com/apache/arrow-rs/pull/7644 - Part of https://github.com/apache/arrow-rs/issues/6736 # Rationale for this change Using the parquet APIs came up in https://github.com/apache/arrow-rs/pull/7644#discussion_r2145228349 so I wanted to help contribute some additional documentation / tests # What changes are included in this PR? Add documentation and tests about `Variant`, specifically some examples of how to create `Variant` values # Are there any user-facing changes? More docs --------- Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>	2025-06-19 06:56:08 -04:00
Ryan Johnson	20c1c34cce	Make variant iterators safely infallible (#7704 ) # Which issue does this PR close? * Closes https://github.com/apache/arrow-rs/issues/7684 * Closes https://github.com/apache/arrow-rs/issues/7685 * Part of https://github.com/apache/arrow-rs/issues/6736 # Rationale for this change Infallible iteration is _much_ easier to work with, vs. Iterator of Result or Result of Iterator. Iteration and validation are strongly correlated, because the iterator can only be infallible if the constructor previously validated everything the iterator depends on. # What changes are included in this PR? In all three of `VariantMetadata,` `VariantList,` and `VariantObject`: * The header object is cleaned up to _only_ consider actual header state. Other state is moved to the object itself. * Constructors fully validate the object by consuming a fallible iterator * The externally visible iterator does a `map(Result::unwrap)` on the same fallible iterator, relying on the constructor to prove the unwrap is safe. * The externally visible iterator is obtained by calling `iter()` method. In addition: * `VariantObject` methods no longer materialize the whole offset+field array * Removed validation that is covered by the new iterator testing * A bunch of dead code removed, several methods renamed for clarity * `first_byte_from_slice` now returns `u8` instead of `&u8` # Are there any user-facing changes? Visibility and signatures of some methods changed. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2025-06-19 06:31:59 -04:00
Daniël Heres	6227419d22	Speedup `interleave_views` (4-7x faster) (#7695 ) # Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. Closes #https://github.com/apache/arrow-rs/issues/7688 # Rationale for this change interleave_views is really slow - taking up ~25% of the samples in `SortPreservingMergeExec` We can make it faster. <details> ``` interleave str_view(0.0) 100 [0..100, 100..230, 450..1000] time: [369.33 ns 371.42 ns 374.48 ns] change: [−77.355% −77.199% −77.051%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) low mild 3 (3.00%) high severe interleave str_view(0.0) 400 [0..100, 100..230, 450..1000] time: [932.11 ns 937.68 ns 945.43 ns] change: [−84.672% −84.528% −84.382%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe interleave str_view(0.0) 1024 [0..100, 100..230, 450..1000] time: [2.0938 µs 2.1058 µs 2.1235 µs] change: [−86.449% −86.310% −86.167%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) high mild 2 (2.00%) high severe interleave str_view(0.0) 1024 [0..100, 100..230, 450..1000, 0..1000] time: [2.2045 µs 2.2098 µs 2.2170 µs] change: [−84.595% −84.493% −84.401%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) low mild 1 (1.00%) high mild 1 (1.00%) high severe ``` </details> # What changes are included in this PR? # Are there any user-facing changes?	2025-06-18 07:57:49 +02:00
Li Jiaying	56ac4dc242	Initial Builder API for Creating Variant Values (#7653 ) # Which issue does this PR close? - Closes #7424 - Part of https://github.com/apache/arrow-rs/issues/6736 # Rationale for this change This PR introduces a basic builder API for creating Variant values, building on the foundation laid by @mkarbo. The builder provides a user-friendly nested API while maintaining performance through a single-buffer design. The design was shaped with huge help from @alamb, @scovich and @Weijun-H ’s feedback, and draws much inspiration from the excellent work by @zeroshade This is an initial version and does not yet support nested values, metadata key sorting, and so on # What changes are included in this PR? - Adds VariantBuilder, ObjectBuilder, ArrayBuilder # Are there any user-facing changes? The new API's added in parquet-variant will be user facing. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2025-06-17 18:32:43 -04:00
Connor Sanders	ed25bbaf9d	Implement Array Decoding in arrow-avro (#7559 ) # Which issue does this PR close? Part of https://github.com/apache/arrow-rs/issues/4886 Related to https://github.com/apache/arrow-rs/pull/6965 # Rationale for this change Avro supports arrays as a core data type, but previously arrow-avro had incomplete decoding logic to handle them. As a result, any Avro file containing array fields would fail to parse correctly within the Arrow ecosystem. This PR addresses this gap by: 1. Completing the implementation of explicit `Array` -> `List` decoding: It completes the `Decoder::Array` logic that reads array blocks in Avro format and constructs an Arrow `ListArray`. Overall, these changes expand Arrow’s Avro reader capabilities, allowing users to work with array-encoded data in a standardized Arrow format. # What changes are included in this PR? 1. arrow-avro/src/reader/record.rs: * Completed the Array decoding path which leverages blockwise reads of Avro array data. * Implemented decoder unit tests for Array types. # Are there any user-facing changes? N/A	2025-06-17 16:57:15 -04:00
Andrew Lamb	f37b1149db	Document REE row format and add some more tests (#7680 ) ~Draft until https://github.com/apache/arrow-rs/pull/7649 is merged~ # Which issue does this PR close? - Follow on to https://github.com/apache/arrow-rs/pull/7649 from @brancz # Rationale for this change I noticed some extra testing and docs I would like to see so I made a PR to add them # What changes are included in this PR? 1. Add docs + additional tests # Are there any user-facing changes? No code changes, only some docs (and more tests)	2025-06-17 15:53:41 -04:00
Frederic Branczyk	3837ac01dc	arrow-row: Add support for REE (#7649 ) # Which issue does this PR close? Part 2 of https://github.com/apache/datafusion/issues/16011 # Are there any user-facing changes? No user facing changes, just extending functionality of existing APIs to support extracting rows from REE arrays. @alamb	2025-06-17 15:05:51 -04:00
Emil Ernerfeldt	e6c93c02fd	Add `RecordBatch::schema_metadata_mut` and `Field::metadata_mut` (#7664 ) # Which issue does this PR close? * Closes https://github.com/apache/arrow-rs/issues/7628 # Rationale for this change Allows for fast and convenient mutating of the metadata of record batches and fields. # What changes are included in this PR? Added: * `RecordBatch::schema_metadata_mut` * `Field::metadata_mut` # Why call it `schema_metadata_mut` and not just `metadata_mut`? See https://github.com/apache/arrow-rs/issues/7628#issuecomment-2970823649 for motivation --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2025-06-17 12:21:18 -04:00
Andrew Lamb	a19fc628b9	Add `BatchCoalescer::push_filtered_batch` and docs (#7652 ) # Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Part of https://github.com/apache/arrow-rs/pull/7650 # Rationale for this change In order to coalesce the result of applying a filter currently requires first copying the results into an intermediate array (calling `filter`). My plan is to remove this extra copy by building the final array up directly incrementally To do to so, there needs to be an API that can take the original data and the filter # What changes are included in this PR? 1. Add `BatchCoalescer::push_filtered_batch` and docs 2. Update benchmarks to use it # Are there any user-facing changes? New API	2025-06-17 07:37:39 -04:00
Matthijs Brobbel	e1ade7b036	chore: fix a typo in `ExtensionType::supports_data_type` docs (#7682 ) # Which issue does this PR close? None # Rationale for this change Noticed a typo. # What changes are included in this PR? Fixes the typo. # Are there any user-facing changes? Updated docs.	2025-06-17 07:26:46 -04:00
Ryan Johnson	f5f09eaa71	Finish implementing Variant::Object and Variant::List (#7666 ) # Which issue does this PR close? - Closes https://github.com/apache/arrow-rs/issues/7665 # Rationale for this change Continuing the ongoing variant implementation effort. # What changes are included in this PR? As per title -- implement fairly complete support for variant objects and arrays. Also add some unit tests. Note: This PR renames `VariantArray` as `VariantList` to align with parquet and arrow terminology, and to not conflict with the `VariantArray` we will eventually need to define for holding an arrow array of variant-typed data. # Are there any user-facing changes? Those variant subtypes should now be usable. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2025-06-17 07:16:02 -04:00
Matthijs Brobbel	639b5bb93e	chore(dependabot): explicitly include root workspace and arrow-pyarrow-integration-testing (#7673 ) # Which issue does this PR close? The attempt in #7672 to include `arrow-pyarrow-integration-testing` using a wildcard did not work: https://github.com/apache/arrow-rs/actions/runs/15678211946/job/44163362777#step:3:6376 # Rationale for this change Using a wildcard does not work. # What changes are included in this PR? Explicitly include root workspace and `arrow-pyarrow-integration-testing` in dependabot config for cargo. # Are there any user-facing changes? No.	2025-06-16 13:58:56 -04:00
Andrew Lamb	3a15f84e81	[Variant] Simplify creation of Variants from metadata and value (#7663 ) # Which issue does this PR close? - Part of https://github.com/apache/arrow-rs/issues/6736 # Rationale for this change While making documentation / examples for working with `Variant` in https://github.com/apache/arrow-rs/pull/7661, I found it was somewhat awkward to make `Variant` values directly from the metadata and value. Specifically you have to ```rust let metadata = [0x01, 0x00, 0x00]; let value = [0x09, 0x48, 0x49]; // parse the header metadata let metadata = VariantMetadata::try_new(&metadata).unwrap(); // and only then can you make the Variant Variant::try_new(&metadata, &value).unwrap() ``` I would really like to be able to create `Variant `directly from `metadata` and `value` without having to make a `VariantMetadata` structure # What changes are included in this PR? This PR proposes a small change to the API so creating a Variant now looks like: ```rust let metadata = [0x01, 0x00, 0x00]; let value = [0x09, 0x48, 0x49]; // You can now make the Variant directly from the metadata and value Variant::try_new(&metadata, &value).unwrap() ``` # Are there any user-facing changes? Yes, the API for creating APIs is slightly different (and I think better)	2025-06-16 13:55:58 -04:00
Matthijs Brobbel	f48efc21a2	chore(dependabot): update all Cargo manifests (#7672 ) # Which issue does this PR close? None # Rationale for this change There are more `Cargo.toml` files that should be considered by Dependabot. # What changes are included in this PR? Search recursively for manifests instead of just the one in the root. # Are there any user-facing changes? No	2025-06-16 06:21:02 -04:00
Expyron	58b34cbabb	Remove `lazy_static` dependency (#7669 ) # Which issue does this PR close? None # Rationale for this change This removes a dependency by using the new LazyLock feature available since Rust 1.80. This crate already has an MSRV of 1.81, so this is not a breaking change. # Are there any user-facing changes? No user-facing changes.	2025-06-16 11:06:20 +02:00
Matthijs Brobbel	1029974bc0	chore: group prost dependabot updates (#7659 ) # Which issue does this PR close? None. # Rationale for this change Group PRs like - #7656 - #7657 - #7658 # What changes are included in this PR? Group for prost updates in dependabot config. # Are there any user-facing changes? No.	2025-06-13 22:11:16 +02:00
张林伟	c87a4d9d5e	Add `pretty_format_batches_with_schema` function (#7642 ) # Which issue does this PR close? Closes #NNN. # Rationale for this change Improve empty batches format. # What changes are included in this PR? Add new `pretty_format_batches_with_schema` function. # Are there any user-facing changes? Yes. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2025-06-13 15:56:51 -04:00
albertlockett	2f2e705734	feat: add constructor to help efficiently upgrade key for GenericBytesDictionaryBuilder (#7611 ) # Which issue does this PR close? - Closes https://github.com/apache/arrow-rs/issues/7610 # Rationale for this change I'm adding this because I would like to have a more efficient method for upgrading the key type of a dictionary builder in the case where my dictionary keys have overflowed. # What changes are included in this PR? This adds a method called `try_new_from_builder` to `GenericByteDictionaryBuilder` that can be used to construct a new builder from the passed argument with the same values and internal state, but a keys array builder of a different type (the motivation being that the new key type could hold more values). # Are there any user-facing changes? --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2025-06-13 12:20:49 -04:00
superserious-dev	71ee9d9aa0	[Variant] Implement read support for remaining primitive types (#7644 ) # Which issue does this PR close? Closes #7630. # What changes are included in this PR? This PR implements support for the following primitive variant types: - Binary - Date - TimestampMicros - TimestampNtzMicros - Int16 - Int32 - Int64 - Decimal4 - Decimal8 - Decimal16 - Float - Double The following types are not yet implemented(see [here](https://github.com/apache/parquet-testing/blob/b68bea40fed8d1a780a9e09dd2262017e04b19ad/variant/regen.py#L78-L83) for details): - TimeNTZ - TimestampNanos - TimestampNtzNanos - UUID # Are there any user-facing changes? Users who opt-in to the Variant feature can use these primitives.	2025-06-13 10:34:04 -04:00
Adam Reeve	e32f545c75	Use approximate comparisons for pow tests (#7646 )	2025-06-11 21:50:31 -07:00
Ed Seidl	8d6cd7627c	Ensure page encoding statistics are written to Parquet file (#7643 )	2025-06-11 21:18:34 -07:00
Sergei Grebnov	e5ad232c90	Update FlightSQL `GetDbSchemas` and `GetTables` schemas to fully match the protocol (#7638 ) # Which issue does this PR close? PR updates FlightSQL `GetDbSchemas` and `GetTables` schemas to fully match the FlightSQL protocol (fields nullability). Fixes - https://github.com/apache/arrow-rs/issues/7637 # Are there any user-facing changes? It could technically be considered a user-facing breaking change, as the schema returned by the `CommandGetDbSchemas` and `CommandGetTables` FlightSQL commands will change. However, since the change only affects field nullability, there should be no practical impact, or it is very unlikely.	2025-06-11 11:40:23 -04:00
Ed Seidl	2be261b78b	Deprecate old Parquet page index parsing functions (#7640 ) # Which issue does this PR close? - Closes #6447. # Rationale for this change This deprecates the last of the old standalone Parquet metadata parsing functions that have since been replaced by `ParquetMetaDataReader`. # What changes are included in this PR? # Are there any user-facing changes? No, only adds deprecation warnings to public API	2025-06-11 00:24:19 -04:00
Ed Seidl	3fe458ef85	Minor: Add version to deprecation notice for `ParquetMetaDataReader::decode_footer` (#7639 ) # Which issue does this PR close? Found a `#[deprecated]` missing a `since` while preparing to remove deprecated APIs. # Rationale for this change # What changes are included in this PR? # Are there any user-facing changes? No, just adds clarification	2025-06-11 00:23:02 -04:00
Ed Seidl	04300b4deb	Minor: Remove outdated FIXME from `ParquetMetaDataReader` (#7635 ) # Which issue does this PR close? Related to #6447. While reviewing other PRs I happened to notice an old FIXME I left behind that should have been removed in #6639. # Rationale for this change # What changes are included in this PR? # Are there any user-facing changes? No, just removes a comment	2025-06-11 00:20:06 -04:00
Adam Reeve	857614c87e	Fix reading encrypted Parquet pages when using the page index (#7633 ) # Which issue does this PR close? Closes #7629. I also noticed that skipping pages in encrypted files was broken so have fixed that too. # What changes are included in this PR? * Refactors `SerializedPageReader` to reduce the use of `#[cfg(...)]` inline. To work with the borrow checker, I created a new `SerializedPageReaderContext` type to hold the `CryptoContext`. * Updates `SerializedPageReader::get_next_page` so that page headers and page data are decrypted when page indexes are used. * Updates `SerializedPageReader::skip_next_page` to update the page index so that encryption AADs are calculated correctly. * Adds new unit tests for reading with a page index and skipping pages in encrypted files. # Are there any user-facing changes? Only bug fixes. --------- Co-authored-by: Ed Seidl <etseidl@live.com>	2025-06-11 16:15:06 +12:00
albertlockett	721150286b	feat: support append_nulls on additional builders (#7606 ) # Which issue does this PR close? - Closes https://github.com/apache/arrow-rs/issues/7605 # Rationale for this change I thought it would be nice if `append_nulls` was supported for additional types of array builders. Currently it is available on some builder types, but not all. # What changes are included in this PR? Add an `append_nulls` method to: - FixedSizeBinaryDictionaryBuilder - FixedSizedBinaryBuilder - GenericBytesBuilder - GenericListBuilder - StructBuilder # Are there any user-facing changes? --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2025-06-10 12:29:23 -04:00
xudong.w	13fc3c8ef1	Fix the error info of `StructArray::try_new` (#7634 ) # Rationale for this change There isn't `StructArray::new_empty`, it's `StructArray::new_empty_fields`, the error info is misleading.	2025-06-10 23:17:31 +08:00
Christian	9482f785fc	[array] Remove unwrap checks from GenericByteArray::value_unchecked (#7573 ) # Which issue does this PR close? `GenericByteArray::value_unchecked` permits `unsafe` code, but still introduces a check due to `unwrap` being called here: ```diff let b = std::slice::from_raw_parts( - self.value_data.as_ptr().offset(start.to_isize().unwrap()), - (end - start).to_usize().unwrap(), + self.value_data.as_ptr().offset(start.to_isize().unwrap_unchecked()), + (end - start).to_usize().unwrap_unchecked(), ); ``` I believe it is sensible to use `unwrap_unsafe` here instead. While the compiler may be able to prune the first unwrap as unreachable, I believe it can not prove at compile time that `end >= start` and eliminate the second unwrap. This is an invariant of GenericByteArray. # Are there any user-facing changes? No. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2025-06-09 14:38:21 -04:00
Julian Popescu	05363f6ae2	feat: add AsyncArrowWriter::into_inner (#7604 ) # Which issue does this PR close? - Closes #7603. # What changes are included in this PR? I've added an `into_inner` function for the `AsyncArrowWriter` # Are there any user-facing changes? This is not a breaking change. I've added some documentation to describe the methods and possible pitfalls --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2025-06-09 14:32:07 -04:00
Christian	8d4beaeb32	Optimize length calculation in row encoding for fixed-length columns (#7564 ) # Rationale for this change When converting data into row format, a significant portion of cycles is spent determining the lengths of the rows to be created. For columns with fixed-size elements (determined by datatype), this calculation can be optimized by avoiding writes to an intermediate vector for length tracking. # What changes are included in this PR? - Implements `LengthTracker` which only materializes lengths for variable-size columns - Updates length calculation in `row_lengths(..)` and offset computation in `RowConverter::append` to use the `LengthTracker` # Are there any user-facing changes? No.	2025-06-09 18:13:51 +02:00
Andrew Lamb	375bee76b1	[Variant] Add commented out primitive test casees (#7631 ) # Which issue does this PR close? - Part of https://github.com/apache/arrow-rs/issues/7630 # Rationale for this change Make it easy to add this feature by preparing the path with tests # What changes are included in this PR? Add tests (commented out) that should pass after https://github.com/apache/arrow-rs/issues/7630 is done # Are there any user-facing changes? No	2025-06-09 09:40:53 -04:00
Andrew Lamb	312e2fd44a	Move variant interop test to Rust integration test (#7602 ) # Which issue does this PR close? - Part of https://github.com/apache/arrow-rs/issues/6736 # Rationale for this change Rust integration tests (in `parquet-variant/tests`) are compiled as a external program would be compiled and thus can only use the exposed API. This helps verify that the crate is usable # What changes are included in this PR? 1. Move the tests that read/write variant values into `variant_interop` test (`cargo test --test variant_interop`) 2. Publically expose `pub` structures # Are there any user-facing changes? There are now pub APIs in the parquet-variant crate	2025-06-09 09:07:12 -04:00
Andrew Lamb	23e18bceba	Improve `coalesce` kernel tests (#7626 ) # Which issue does this PR close? - Follow on to https://github.com/apache/arrow-rs/pull/7625 from @Dandandan # Rationale for this change I want to eventually remove `gc_string_view` but currently the unit tests are in terms of that function # What changes are included in this PR? Rewrite tests to be in terms of `coalesce` instead Also, 1. Add additional coverage for the issue we saw in https://github.com/apache/arrow-rs/pull/7623 2. Add add coverage for the case where there are data buffers in the view, but they are not referenced by any view https://github.com/apache/arrow-rs/pull/7625#discussion_r2134634467 Codecov of this module is now 100% # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out.	2025-06-08 20:02:22 -04:00
Jigao Luo	9d172a860a	Adding Encoding argument in `parquet-rewrite` (#7576 ) # Which issue does this PR close? - Closes #7575. # Rationale for this change Need a option to set encoding for all columns. # What changes are included in this PR? This PR: - introduces an encoding parameter for `set_encoding`. - groups the encoding-related code part together in the file. # Are there any user-facing changes? No --------- Signed-off-by: Jigao Luo <jigao.luo@outlook.com>	2025-06-08 09:52:13 -04:00
Daniël Heres	52d8d568f4	Revert "Revert "Improve `coalesce` and `concat` performance for views… (#7625 ) … (#7614)" (#7623)" This reverts commit `da461c8754`. This adds a test and fix for the wrong index issue. I also verified the change for DataFusion (and benchmarks show notable improvements). # Which issue does this PR close? Closes #NNN. # Rationale for this change # What changes are included in this PR? # Are there any user-facing changes?	2025-06-08 09:40:22 -04:00
Daniël Heres	da461c8754	Revert "Improve `coalesce` and `concat` performance for views (#7614 )" (#7623 ) This reverts commit `7739a83fe0`. # Which issue does this PR close? # Rationale for this change I found this errors in DataFusion (see https://github.com/apache/datafusion/pull/16249#issuecomment-2952353060), so let's revert it and find the error. # What changes are included in this PR? # Are there any user-facing changes?	2025-06-07 07:55:16 -04:00
Daniël Heres	7739a83fe0	Improve `coalesce` and `concat` performance for views (#7614 ) # Which issue does this PR close? - Closes #7615 - Follow on to https://github.com/apache/arrow-rs/pull/7597 # Rationale for this change Improve performance of `gc_string_view_batch` ``` filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.001 1.00 30.4±1.05ms ? ?/sec 1.29 39.3±0.88ms ? ?/sec filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.01 1.00 4.3±0.17ms ? ?/sec 1.20 5.2±0.15ms ? ?/sec filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.1 1.00 1805.1±25.77µs ? ?/sec 1.32 2.4±0.20ms ? ?/sec filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.8 1.00 2.6±0.12ms ? ?/sec 1.48 3.8±0.11ms ? ?/sec filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.001 1.00 42.5±0.48ms ? ?/sec 1.23 52.2±1.33ms ? ?/sec filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.01 1.00 5.8±0.12ms ? ?/sec 1.28 7.4±0.20ms ? ?/sec filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.1 1.00 2.2±0.02ms ? ?/sec 1.37 3.1±0.18ms ? ?/sec filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.8 1.00 3.6±0.15ms ? ?/sec 1.43 5.1±0.12ms ? ?/sec filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001 1.00 51.0±0.59ms ? ?/sec 1.38 70.3±1.11ms ? ?/sec filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01 1.00 6.7±0.03ms ? ?/sec 1.32 8.8±0.16ms ? ?/sec filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1 1.00 3.0±0.01ms ? ?/sec 1.41 4.3±0.09ms ? ?/sec filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8 1.00 4.5±0.34ms ? ?/sec 1.71 7.7±0.28ms ? ?/sec filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001 1.00 64.2±0.74ms ? ?/sec 1.33 85.1±1.52ms ? ?/sec filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01 1.00 9.4±0.09ms ? ?/sec 1.35 12.6±0.26ms ? ?/sec filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1 1.00 3.8±0.03ms ? ?/sec 1.46 5.6±0.11ms ? ?/sec filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8 1.00 5.7±0.28ms ? ?/sec 1.73 9.9±0.27ms ? ?/sec ``` # What changes are included in this PR? * Avoiding recreating the views from scratch. * Specialize concat for view types * Takes owned RecordBatch (effect on performance is small, might be measurable with smaller batch size / more columns). # Are there any user-facing changes? no --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2025-06-07 11:56:45 +02:00
Andrew Lamb	44d7194712	Improve coalesce_kernel benchmark to capture inline vs non inline views (#7619 ) # Which issue does this PR close? - Follow on to https://github.com/apache/arrow-rs/pull/7597 # Rationale for this change While reviewing the code and the concat kernel for - https://github.com/apache/arrow-rs/pull/7617 I realized there is a non trivial difference when there all inlined views vs some inlined views vs mostly large strings so the benchmarks should capture that # What changes are included in this PR? 1. Add variations of benchmark with different size strings in StringViewArray # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out.	2025-06-07 00:19:23 +02:00

1 2 3 4 5 ...

6571 Commits