6571 Commits

Author SHA1 Message Date
Andrew Lamb df702cfc71 Prepare for 55.2.0 release (#7722)
# Which issue does this PR close?

- Part of https://github.com/apache/arrow-rs/issues/7394

# Rationale for this change

Prepare for next software release

# What changes are included in this PR?
1. Update version to `55.2.0`
2. See rendered changelog here:
https://github.com/alamb/arrow-rs/blob/alamb/prepare_55.2.0/CHANGELOG.md


# Are there any user-facing changes?

New release version
2025-06-22 09:08:47 -04:00
Nick Lanham 2788762c63 fix JSON decoder error checking for UTF16 / surrogate parsing panic (#7721)
# Which issue does this PR close?

- Closes #7712 .

# Rationale for this change

Shouldn't panic, especially in a fallible function.

# What changes are included in this PR?

Validate that the high and low surrogates are in the expected range,
which guarantees that the subtractions won't overflow.

# Are there any user-facing changes?

No (well, things that used to panic now won't, but I don't think that
counts)
2025-06-22 08:34:28 -04:00
Oleks V e54b72bc4d fix: Do not add null buffer for NullArray in MutableArrayData (#7726)
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Closes #7725.

# Rationale for this change

Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.

Do not populate null buffers when building `MutableArrayData` for
`NullArray`

# What changes are included in this PR?

There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.

# Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
2025-06-22 08:09:05 -04:00
Matthew Kim 1ededfe024 [Variant] Introduce new type over &str for ShortString (#7718)
# Which issue does this PR close?

- Closes https://github.com/apache/arrow-rs/issues/7700

This commit introduces `ShortString`, a newtype that wraps around `&str`
that enforces a maximum length constraint. This also allows us to
perform validation once and removes a superfluous validation check in
`append_value`.

The now-superflous validation check was needed since users could
construct `Variant::ShortString`s directly, without doing input
validation. This means you can have a short string variant which
actually contains a string that is no longer than 63 bytes.

But since we enforce this check upon construction, we can directly match
against `Variant::String` and `Variant::ShortString` arms with their
respective appending functions (`append_string` and
`append_short_string`).
2025-06-21 07:16:51 -04:00
Frederic Branczyk 7b374b9b7a arrow-array: Implement PartialEq for RunArray (#7727)
# Which issue does this PR close?

Closes #7691

# Are there any user-facing changes?

No, just implementing for ease of use and consistency in tests and
elsewhere across the wider repository.

@alamb
2025-06-21 06:22:43 -04:00
Bruno 469c7ee177 Define a "arrow-pyrarrow" crate to implement the "pyarrow" feature. (#7694)
This follows the pattern of other parts of the arrow-rs codebase:
arrow-array, arrow-schema, etc.

With this change, polyglot codebases can use pyarrow without making all
their crates that use arrow pull in pyarrow (& pyo3).

It also allows interfacing with PyArrow without pulling in Arrow.

# Which issue does this PR close?

Closes https://github.com/apache/arrow-rs/issues/7668.

# Rationale for this change

Part of a codebase can use pyarrow without arrow pulling in pyo3 across
the codebase.

# Are there any user-facing changes?

Nope.
2025-06-20 15:55:10 -04:00
Qi Zhu fbaf7cea2d Support write to buffer api for SerializedFileWriter (#7714)
# Which issue does this PR close?

Currently, no pub api to support write the internal buffer for
SerializedFileWriter, it's very helpful when we want to add low level
API for example:
- https://github.com/apache/datafusion/issues/16374
- https://github.com/apache/datafusion/pull/16395

Because that we want to update the buf bytes written, if we use the buf
internal file to write, we can't update the internal buf written bytes.

The consistent update for the bytes written metrics is the key for our
custom index write.


# Rationale for this change

Add API to support write with buf byteswritten updating.

# What changes are included in this PR?

Add API to support write with buf byteswritten updating.

# Are there any user-facing changes?
No

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2025-06-20 12:08:30 -04:00
Andrew Lamb 1bed04c1e0 Optimize coalesce kernel for StringView (10-50% faster) (#7650)
# Which issue does this PR close?


- Part of https://github.com/apache/arrow-rs/issues/7456

# Rationale for this change

Currently the `coalesce` kernel buffers views / data until there are
enough rows and then concat's the results together. StringViewArrays can
be even worse as there is a second copy in `gc_string_view_batch`

This is wasteful because it
1. Buffers memory (has 2x the peak usage)
2. Copies the data twice

We can make it faster and more memory efficient by directly creating the
output array

# What changes are included in this PR?
1. Add a specialization for incrementally building `StringViewArray`
without buffering

Note this PR does NOT (yet) add specialized filtering -- instead it
focuses on reducing the
overhead of appending views by not copying them (again!) with
`gc_string_view_batch`

# Open questions:
1. There is substantial overlap / duplication with StringViewBuilder --
I wonder if we can / should consolidate them somehow

The differences are that the
1. Block size calculation management (aka look at the buffer sizes of
the incoming buffers)
2. Finishing array allocates sufficient space for views

# Are there any user-facing changes?
The kernel is faster, no API changes
2025-06-20 09:24:22 -04:00
Ryan Johnson 7276819d0d Split out variant code into several new sub-modules (#7717)
# Which issue does this PR close?

Housekeeping, part of
* https://github.com/apache/arrow-rs/issues/6736

# Rationale for this change

The variant module was starting to become unwieldy.

# What changes are included in this PR?

Split out metadata, object, and list to sub-modules; move `OffsetSize`
to the decoder module where it arguably belongs.

Result: variant.rs is "only" ~900 LoC instead of ~2kLoc.

# Are there any user-facing changes?

No. Public re-exports should hide the change from users.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2025-06-20 09:21:47 -04:00
Alex Wilcoxson 75008eb580 feat: add min max aggregate support for FixedSizeBinary (#7675)
# Which issue does this PR close?

Closes #7674.

# Rationale for this change

Adding support for these min/max functions [so DataFusion can utilize
them](https://github.com/apache/datafusion/blob/dd936cb1b25cb685e0e146f297950eb00048c64c/datafusion/functions-aggregate/src/min_max.rs#L600)

# What changes are included in this PR?

Added new min and max functions for fixed size binary and updated
existing tests.

# Are there any user-facing changes?

Yes new functions `min_fixed_size_binary` and `max_fixed_size_binary`
added.
2025-06-20 09:14:27 -04:00
Frederic Branczyk ecd2905cc2 arrow-data: Add REE support for build_extend and build_extend_nulls (#7671)
# Which issue does this PR close?

Part 3 of https://github.com/apache/datafusion/issues/16011

# What changes are included in this PR?

This is a piece of the puzzle to support aggregating on REE arrays.

# Are there any user-facing changes?

No user facing changes, just extending functionality of existing APIs to
support extracting rows from REE arrays.

@alamb
2025-06-19 20:19:37 -04:00
Andrew Lamb fe65b8d937 [Variant] Add variant docs and examples (#7661)
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Follow on to https://github.com/apache/arrow-rs/pull/7644
- Part of https://github.com/apache/arrow-rs/issues/6736

# Rationale for this change

Using the parquet APIs came up in
https://github.com/apache/arrow-rs/pull/7644#discussion_r2145228349 so I
wanted to help contribute some additional documentation / tests

# What changes are included in this PR?

Add documentation and tests about `Variant`, specifically some examples
of how to create `Variant` values

# Are there any user-facing changes?

More docs

---------

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
2025-06-19 06:56:08 -04:00
Ryan Johnson 20c1c34cce Make variant iterators safely infallible (#7704)
# Which issue does this PR close?

* Closes https://github.com/apache/arrow-rs/issues/7684
* Closes https://github.com/apache/arrow-rs/issues/7685
* Part of https://github.com/apache/arrow-rs/issues/6736  

# Rationale for this change

Infallible iteration is _much_ easier to work with, vs. Iterator of
Result or Result of Iterator. Iteration and validation are strongly
correlated, because the iterator can only be infallible if the
constructor previously validated everything the iterator depends on.

# What changes are included in this PR?

In all three of `VariantMetadata,` `VariantList,` and `VariantObject`:
* The header object is cleaned up to _only_ consider actual header
state. Other state is moved to the object itself.
* Constructors fully validate the object by consuming a fallible
iterator
* The externally visible iterator does a `map(Result::unwrap)` on the
same fallible iterator, relying on the constructor to prove the unwrap
is safe.
* The externally visible iterator is obtained by calling `iter()`
method.

In addition:
* `VariantObject` methods no longer materialize the whole offset+field
array
* Removed validation that is covered by the new iterator testing
* A bunch of dead code removed, several methods renamed for clarity
* `first_byte_from_slice` now returns `u8` instead of `&u8`

# Are there any user-facing changes?

Visibility and signatures of some methods changed.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2025-06-19 06:31:59 -04:00
Daniël Heres 6227419d22 Speedup interleave_views (4-7x faster) (#7695)
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

Closes #https://github.com/apache/arrow-rs/issues/7688



# Rationale for this change

interleave_views is *really* slow - taking up ~25% of the samples in
`SortPreservingMergeExec`

We can make it faster.

<details>

```
interleave str_view(0.0) 100 [0..100, 100..230, 450..1000]
                        time:   [369.33 ns 371.42 ns 374.48 ns]
                        change: [−77.355% −77.199% −77.051%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high severe

interleave str_view(0.0) 400 [0..100, 100..230, 450..1000]
                        time:   [932.11 ns 937.68 ns 945.43 ns]
                        change: [−84.672% −84.528% −84.382%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

interleave str_view(0.0) 1024 [0..100, 100..230, 450..1000]
                        time:   [2.0938 µs 2.1058 µs 2.1235 µs]
                        change: [−86.449% −86.310% −86.167%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

interleave str_view(0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]
                        time:   [2.2045 µs 2.2098 µs 2.2170 µs]
                        change: [−84.595% −84.493% −84.401%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
```

</details>


# What changes are included in this PR?

# Are there any user-facing changes?
2025-06-18 07:57:49 +02:00
Li Jiaying 56ac4dc242 Initial Builder API for Creating Variant Values (#7653)
# Which issue does this PR close?

- Closes #7424
- Part of https://github.com/apache/arrow-rs/issues/6736

# Rationale for this change

This PR introduces a basic builder API for creating Variant values,
building on the foundation laid by @mkarbo. The builder provides a
user-friendly nested API while maintaining performance through a
single-buffer design. The design was shaped with huge help from @alamb,
@scovich and @Weijun-H ’s feedback, and draws much inspiration from the
excellent work by @zeroshade

This is an initial version and does not yet support nested values,
metadata key sorting, and so on

# What changes are included in this PR?

- Adds VariantBuilder, ObjectBuilder, ArrayBuilder

# Are there any user-facing changes?

The new API's added in parquet-variant will be user facing.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2025-06-17 18:32:43 -04:00
Connor Sanders ed25bbaf9d Implement Array Decoding in arrow-avro (#7559)
# Which issue does this PR close?

Part of https://github.com/apache/arrow-rs/issues/4886

Related to https://github.com/apache/arrow-rs/pull/6965

# Rationale for this change
 
Avro supports arrays as a core data type, but previously arrow-avro had
incomplete decoding logic to handle them. As a result, any Avro file
containing array fields would fail to parse correctly within the Arrow
ecosystem. This PR addresses this gap by:

1. Completing the implementation of explicit `Array` -> `List` decoding:
It completes the `Decoder::Array` logic that reads array blocks in Avro
format and constructs an Arrow `ListArray`.

Overall, these changes expand Arrow’s Avro reader capabilities, allowing
users to work with array-encoded data in a standardized Arrow format.

# What changes are included in this PR?

**1. arrow-avro/src/reader/record.rs:**

* Completed the Array decoding path which leverages blockwise reads of
Avro array data.
* Implemented decoder unit tests for Array types.

# Are there any user-facing changes?

N/A
2025-06-17 16:57:15 -04:00
Andrew Lamb f37b1149db Document REE row format and add some more tests (#7680)
~Draft until https://github.com/apache/arrow-rs/pull/7649 is merged~

# Which issue does this PR close?
- Follow on to https://github.com/apache/arrow-rs/pull/7649 from @brancz

# Rationale for this change
I noticed some extra testing and docs I would like to see so I made a PR
to add them

# What changes are included in this PR?

1. Add docs + additional tests
# Are there any user-facing changes?
No code changes, only some docs (and more tests)
2025-06-17 15:53:41 -04:00
Frederic Branczyk 3837ac01dc arrow-row: Add support for REE (#7649)
# Which issue does this PR close?

Part 2 of https://github.com/apache/datafusion/issues/16011

# Are there any user-facing changes?

No user facing changes, just extending functionality of existing APIs to
support extracting rows from REE arrays.

@alamb
2025-06-17 15:05:51 -04:00
Emil Ernerfeldt e6c93c02fd Add RecordBatch::schema_metadata_mut and Field::metadata_mut (#7664)
# Which issue does this PR close?
* Closes https://github.com/apache/arrow-rs/issues/7628

# Rationale for this change
Allows for fast and convenient mutating of the metadata of record
batches and fields.

# What changes are included in this PR?
Added:
* `RecordBatch::schema_metadata_mut`
* `Field::metadata_mut`

# Why call it `schema_metadata_mut` and not just `metadata_mut`?
See
https://github.com/apache/arrow-rs/issues/7628#issuecomment-2970823649
for motivation

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2025-06-17 12:21:18 -04:00
Andrew Lamb a19fc628b9 Add BatchCoalescer::push_filtered_batch and docs (#7652)
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Part of https://github.com/apache/arrow-rs/pull/7650

# Rationale for this change

In order to coalesce the result of applying a filter currently requires
first copying the results into an intermediate array (calling `filter`).

My plan is to remove this extra copy by building the final array up
directly incrementally

To do to so, there needs to be an API that can take the original data
and the filter

# What changes are included in this PR?

1. Add `BatchCoalescer::push_filtered_batch` and docs
2. Update benchmarks to use it

# Are there any user-facing changes?
New API
2025-06-17 07:37:39 -04:00
Matthijs Brobbel e1ade7b036 chore: fix a typo in ExtensionType::supports_data_type docs (#7682)
# Which issue does this PR close?

None

# Rationale for this change

Noticed a typo.

# What changes are included in this PR?

Fixes the typo.

# Are there any user-facing changes?

Updated docs.
2025-06-17 07:26:46 -04:00
Ryan Johnson f5f09eaa71 Finish implementing Variant::Object and Variant::List (#7666)
# Which issue does this PR close?

- Closes https://github.com/apache/arrow-rs/issues/7665

# Rationale for this change

Continuing the ongoing variant implementation effort.

# What changes are included in this PR?

As per title -- implement fairly complete support for variant objects
and arrays. Also add some unit tests.

Note: This PR renames `VariantArray` as `VariantList` to align with
parquet and arrow terminology, and to not conflict with the
`VariantArray` we will eventually need to define for holding an arrow
array of variant-typed data.

# Are there any user-facing changes?

Those variant subtypes should now be usable.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2025-06-17 07:16:02 -04:00
Matthijs Brobbel 639b5bb93e chore(dependabot): explicitly include root workspace and arrow-pyarrow-integration-testing (#7673)
# Which issue does this PR close?

The attempt in #7672 to include `arrow-pyarrow-integration-testing`
using a wildcard did not work:
https://github.com/apache/arrow-rs/actions/runs/15678211946/job/44163362777#step:3:6376

# Rationale for this change

Using a wildcard does not work.

# What changes are included in this PR?

Explicitly include root workspace and
`arrow-pyarrow-integration-testing` in dependabot config for cargo.

# Are there any user-facing changes?

No.
2025-06-16 13:58:56 -04:00
Andrew Lamb 3a15f84e81 [Variant] Simplify creation of Variants from metadata and value (#7663)
# Which issue does this PR close?

- Part of https://github.com/apache/arrow-rs/issues/6736

# Rationale for this change

While making documentation / examples for working with `Variant` in
https://github.com/apache/arrow-rs/pull/7661, I found it was somewhat
awkward to make `Variant` values directly from the metadata and value.
Specifically you have to

```rust
let metadata = [0x01, 0x00, 0x00];
let value = [0x09, 0x48, 0x49];
// parse the header metadata
let metadata = VariantMetadata::try_new(&metadata).unwrap();
// and only then can you make the Variant
Variant::try_new(&metadata, &value).unwrap()
```

I would really like to be able to create `Variant `directly from
`metadata` and `value` without having to make a `VariantMetadata`
structure

# What changes are included in this PR?

This PR proposes a small change to the API so creating a Variant now
looks like:

```rust
let metadata = [0x01, 0x00, 0x00];
let value = [0x09, 0x48, 0x49];
// You can now make the Variant directly from the metadata and value
Variant::try_new(&metadata, &value).unwrap()
```


# Are there any user-facing changes?
Yes, the API for creating APIs is slightly different (and I think
better)
2025-06-16 13:55:58 -04:00
Matthijs Brobbel f48efc21a2 chore(dependabot): update all Cargo manifests (#7672)
# Which issue does this PR close?

None

# Rationale for this change

There are more `Cargo.toml` files that should be considered by
Dependabot.

# What changes are included in this PR?

Search recursively for manifests instead of just the one in the root.

# Are there any user-facing changes?

No
2025-06-16 06:21:02 -04:00
Expyron 58b34cbabb Remove lazy_static dependency (#7669)
# Which issue does this PR close?

None

# Rationale for this change

This removes a dependency by using the new LazyLock feature available
since Rust 1.80.
This crate already has an MSRV of 1.81, so this is not a breaking
change.

# Are there any user-facing changes?

No user-facing changes.
2025-06-16 11:06:20 +02:00
Matthijs Brobbel 1029974bc0 chore: group prost dependabot updates (#7659)
# Which issue does this PR close?

None.

# Rationale for this change

Group PRs like
- #7656 
- #7657
- #7658

# What changes are included in this PR?

Group for prost updates in dependabot config.

# Are there any user-facing changes?

No.
2025-06-13 22:11:16 +02:00
张林伟 c87a4d9d5e Add pretty_format_batches_with_schema function (#7642)
# Which issue does this PR close?

Closes #NNN.

# Rationale for this change

Improve empty batches format.

# What changes are included in this PR?

Add new `pretty_format_batches_with_schema` function.

# Are there any user-facing changes?

Yes.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2025-06-13 15:56:51 -04:00
albertlockett 2f2e705734 feat: add constructor to help efficiently upgrade key for GenericBytesDictionaryBuilder (#7611)
# Which issue does this PR close?

- Closes https://github.com/apache/arrow-rs/issues/7610

# Rationale for this change


I'm adding this because I would like to have a more efficient method for
upgrading the key type of a dictionary builder in the case where my
dictionary keys have overflowed.

# What changes are included in this PR?


This adds a method called `try_new_from_builder` to
`GenericByteDictionaryBuilder` that can be used to construct a new
builder from the passed argument with the same values and internal
state, but a keys array builder of a different type (the motivation
being that the new key type could hold more values).

# Are there any user-facing changes?

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2025-06-13 12:20:49 -04:00
superserious-dev 71ee9d9aa0 [Variant] Implement read support for remaining primitive types (#7644)
# Which issue does this PR close?

Closes #7630.

# What changes are included in this PR?

This PR implements support for the following primitive variant types:
- Binary
- Date
- TimestampMicros
- TimestampNtzMicros
- Int16
- Int32
- Int64
- Decimal4
- Decimal8
- Decimal16
- Float
- Double

The following types are not yet implemented(see
[here](https://github.com/apache/parquet-testing/blob/b68bea40fed8d1a780a9e09dd2262017e04b19ad/variant/regen.py#L78-L83)
for details):
- TimeNTZ
- TimestampNanos
- TimestampNtzNanos
- UUID

# Are there any user-facing changes?

Users who opt-in to the Variant feature can use these primitives.
2025-06-13 10:34:04 -04:00
Adam Reeve e32f545c75 Use approximate comparisons for pow tests (#7646) 2025-06-11 21:50:31 -07:00
Ed Seidl 8d6cd7627c Ensure page encoding statistics are written to Parquet file (#7643) 2025-06-11 21:18:34 -07:00
Sergei Grebnov e5ad232c90 Update FlightSQL GetDbSchemas and GetTables schemas to fully match the protocol (#7638)
# Which issue does this PR close?

PR updates FlightSQL `GetDbSchemas` and `GetTables` schemas to fully
match the FlightSQL protocol (fields nullability).

Fixes  
- https://github.com/apache/arrow-rs/issues/7637

# Are there any user-facing changes?

It could technically be considered a user-facing breaking change, as the
schema returned by the `CommandGetDbSchemas` and `CommandGetTables`
FlightSQL commands will change. However, since the change only affects
field nullability, there should be no practical impact, or it is very
unlikely.
2025-06-11 11:40:23 -04:00
Ed Seidl 2be261b78b Deprecate old Parquet page index parsing functions (#7640)
# Which issue does this PR close?

- Closes #6447.

# Rationale for this change

This deprecates the last of the old standalone Parquet metadata parsing
functions that have since been replaced by `ParquetMetaDataReader`.

# What changes are included in this PR?

# Are there any user-facing changes?

No, only adds deprecation warnings to public API
2025-06-11 00:24:19 -04:00
Ed Seidl 3fe458ef85 Minor: Add version to deprecation notice for ParquetMetaDataReader::decode_footer (#7639)
# Which issue does this PR close?

Found a `#[deprecated]` missing a `since` while preparing to remove
deprecated APIs.

# Rationale for this change

# What changes are included in this PR?

# Are there any user-facing changes?

No, just adds clarification
2025-06-11 00:23:02 -04:00
Ed Seidl 04300b4deb Minor: Remove outdated FIXME from ParquetMetaDataReader (#7635)
# Which issue does this PR close?
Related to #6447.

While reviewing other PRs I happened to notice an old FIXME I left
behind that should have been removed in #6639.

# Rationale for this change


# What changes are included in this PR?


# Are there any user-facing changes?
No, just removes a comment
2025-06-11 00:20:06 -04:00
Adam Reeve 857614c87e Fix reading encrypted Parquet pages when using the page index (#7633)
# Which issue does this PR close?

Closes #7629.

I also noticed that skipping pages in encrypted files was broken so have
fixed that too.

# What changes are included in this PR?

* Refactors `SerializedPageReader` to reduce the use of `#[cfg(...)]`
inline. To work with the borrow checker, I created a new
`SerializedPageReaderContext` type to hold the `CryptoContext`.
* Updates `SerializedPageReader::get_next_page` so that page headers and
page data are decrypted when page indexes are used.
* Updates `SerializedPageReader::skip_next_page` to update the page
index so that encryption AADs are calculated correctly.
* Adds new unit tests for reading with a page index and skipping pages
in encrypted files.

# Are there any user-facing changes?

Only bug fixes.

---------

Co-authored-by: Ed Seidl <etseidl@live.com>
2025-06-11 16:15:06 +12:00
albertlockett 721150286b feat: support append_nulls on additional builders (#7606)
# Which issue does this PR close?


- Closes https://github.com/apache/arrow-rs/issues/7605

# Rationale for this change

I thought it would be nice if `append_nulls` was supported for
additional types of array builders. Currently it is available on some
builder types, but not all.

# What changes are included in this PR?

Add an `append_nulls` method to:
- FixedSizeBinaryDictionaryBuilder
- FixedSizedBinaryBuilder
- GenericBytesBuilder
- GenericListBuilder
- StructBuilder

# Are there any user-facing changes?

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2025-06-10 12:29:23 -04:00
xudong.w 13fc3c8ef1 Fix the error info of StructArray::try_new (#7634)
# Rationale for this change

There isn't `StructArray::new_empty`, it's
`StructArray::new_empty_fields`, the error info is misleading.
2025-06-10 23:17:31 +08:00
Christian 9482f785fc [array] Remove unwrap checks from GenericByteArray::value_unchecked (#7573)
# Which issue does this PR close?

`GenericByteArray::value_unchecked` permits `unsafe` code, but still
introduces a check due to `unwrap` being called here:

```diff
        let b = std::slice::from_raw_parts(
-           self.value_data.as_ptr().offset(start.to_isize().unwrap()),
-           (end - start).to_usize().unwrap(),
+           self.value_data.as_ptr().offset(start.to_isize().unwrap_unchecked()),
+           (end - start).to_usize().unwrap_unchecked(),
        );
```

I believe it is sensible to use `unwrap_unsafe` here instead. While the
compiler may be able to prune the first unwrap as unreachable, I believe
it can **not** prove at compile time that `end >= start` and eliminate
the second unwrap. This is an invariant of GenericByteArray.

# Are there any user-facing changes?

No.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2025-06-09 14:38:21 -04:00
Julian Popescu 05363f6ae2 feat: add AsyncArrowWriter::into_inner (#7604)
# Which issue does this PR close?

- Closes #7603.

# What changes are included in this PR?

I've added an `into_inner` function for the `AsyncArrowWriter`

# Are there any user-facing changes?

This is not a breaking change. I've added some documentation to describe
the methods and possible pitfalls

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2025-06-09 14:32:07 -04:00
Christian 8d4beaeb32 Optimize length calculation in row encoding for fixed-length columns (#7564)
# Rationale for this change
 
When converting data into row format, a significant portion of cycles is
spent determining the lengths of the rows to be created. For columns
with fixed-size elements (determined by datatype), this calculation can
be optimized by avoiding writes to an intermediate vector for length
tracking.

# What changes are included in this PR?

- Implements `LengthTracker` which only materializes lengths for
variable-size columns
- Updates length calculation in `row_lengths(..)` and offset computation
in `RowConverter::append` to use the `LengthTracker`

# Are there any user-facing changes?

No.
2025-06-09 18:13:51 +02:00
Andrew Lamb 375bee76b1 [Variant] Add commented out primitive test casees (#7631)
# Which issue does this PR close?

- Part of https://github.com/apache/arrow-rs/issues/7630

# Rationale for this change

Make it easy to add this feature by preparing the path with tests

# What changes are included in this PR?

Add tests (commented out) that should pass after
https://github.com/apache/arrow-rs/issues/7630 is done

# Are there any user-facing changes?

No
2025-06-09 09:40:53 -04:00
Andrew Lamb 312e2fd44a Move variant interop test to Rust integration test (#7602)
# Which issue does this PR close?
- Part of https://github.com/apache/arrow-rs/issues/6736

# Rationale for this change

Rust integration tests (in `parquet-variant/tests`) are compiled as a
external program would be compiled and thus can only use the exposed
API. This helps verify that the crate is usable

# What changes are included in this PR?
1. Move the tests that read/write variant values into `variant_interop`
test (`cargo test --test variant_interop`)
2. Publically expose `pub` structures


# Are there any user-facing changes?
There are now pub APIs in the parquet-variant crate
2025-06-09 09:07:12 -04:00
Andrew Lamb 23e18bceba Improve coalesce kernel tests (#7626)
# Which issue does this PR close?

- Follow on to https://github.com/apache/arrow-rs/pull/7625 from
@Dandandan

# Rationale for this change

I want to eventually remove `gc_string_view` but currently the unit
tests are in terms of that function

# What changes are included in this PR?

Rewrite tests to be in terms of `coalesce` instead

Also, 
1. Add additional coverage for the issue we saw in
https://github.com/apache/arrow-rs/pull/7623
2. Add add coverage for the case where there are data buffers in the
view, but they are not referenced by any view
https://github.com/apache/arrow-rs/pull/7625#discussion_r2134634467

Codecov of this module is now 100%

# Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
2025-06-08 20:02:22 -04:00
Jigao Luo 9d172a860a Adding Encoding argument in parquet-rewrite (#7576)
# Which issue does this PR close?

- Closes #7575.

# Rationale for this change
 
Need a option to set encoding for all columns.

# What changes are included in this PR?

This PR:
- introduces an encoding parameter for `set_encoding`.
- groups the encoding-related code part together in the file.

# Are there any user-facing changes?

No

---------

Signed-off-by: Jigao Luo <jigao.luo@outlook.com>
2025-06-08 09:52:13 -04:00
Daniël Heres 52d8d568f4 Revert "Revert "Improve coalesce and concat performance for views… (#7625)
… (#7614)" (#7623)"

This reverts commit da461c8754.

This adds a test and fix for the wrong index issue.
I also verified the change for DataFusion (and benchmarks show notable
improvements).

# Which issue does this PR close?


Closes #NNN.

# Rationale for this change


# What changes are included in this PR?



# Are there any user-facing changes?
2025-06-08 09:40:22 -04:00
Daniël Heres da461c8754 Revert "Improve coalesce and concat performance for views (#7614)" (#7623)
This reverts commit 7739a83fe0.

# Which issue does this PR close?



# Rationale for this change
I found this errors in DataFusion (see
https://github.com/apache/datafusion/pull/16249#issuecomment-2952353060),
so let's revert it and find the error.


# What changes are included in this PR?


# Are there any user-facing changes?
2025-06-07 07:55:16 -04:00
Daniël Heres 7739a83fe0 Improve coalesce and concat performance for views (#7614)
# Which issue does this PR close?
- Closes #7615
- Follow on to https://github.com/apache/arrow-rs/pull/7597



# Rationale for this change

Improve performance of `gc_string_view_batch`

```
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.001       1.00     30.4±1.05ms        ? ?/sec    1.29     39.3±0.88ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.01        1.00      4.3±0.17ms        ? ?/sec    1.20      5.2±0.15ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.1         1.00  1805.1±25.77µs        ? ?/sec    1.32      2.4±0.20ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.8         1.00      2.6±0.12ms        ? ?/sec    1.48      3.8±0.11ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.001     1.00     42.5±0.48ms        ? ?/sec    1.23     52.2±1.33ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.01      1.00      5.8±0.12ms        ? ?/sec    1.28      7.4±0.20ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.1       1.00      2.2±0.02ms        ? ?/sec    1.37      3.1±0.18ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.8       1.00      3.6±0.15ms        ? ?/sec    1.43      5.1±0.12ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001      1.00     51.0±0.59ms        ? ?/sec    1.38     70.3±1.11ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01       1.00      6.7±0.03ms        ? ?/sec    1.32      8.8±0.16ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1        1.00      3.0±0.01ms        ? ?/sec    1.41      4.3±0.09ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8        1.00      4.5±0.34ms        ? ?/sec    1.71      7.7±0.28ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001    1.00     64.2±0.74ms        ? ?/sec    1.33     85.1±1.52ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01     1.00      9.4±0.09ms        ? ?/sec    1.35     12.6±0.26ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1      1.00      3.8±0.03ms        ? ?/sec    1.46      5.6±0.11ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8      1.00      5.7±0.28ms        ? ?/sec    1.73      9.9±0.27ms        ? ?/sec
```

# What changes are included in this PR?

* Avoiding recreating the views from scratch.
* Specialize concat for view types
* Takes owned RecordBatch (effect on performance is small, might be
measurable with smaller batch size / more columns).

# Are there any user-facing changes?

no

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2025-06-07 11:56:45 +02:00
Andrew Lamb 44d7194712 Improve coalesce_kernel benchmark to capture inline vs non inline views (#7619)
# Which issue does this PR close?

- Follow on to https://github.com/apache/arrow-rs/pull/7597

# Rationale for this change


While reviewing the code and the concat kernel for
- https://github.com/apache/arrow-rs/pull/7617

I realized there is a non trivial difference when there all inlined
views vs some inlined views vs mostly large strings so the benchmarks
should capture that


# What changes are included in this PR?

1. Add variations of benchmark with different size strings in
StringViewArray

# Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
2025-06-07 00:19:23 +02:00