# Which issue does this PR close?
- Closes#7712 .
# Rationale for this change
Shouldn't panic, especially in a fallible function.
# What changes are included in this PR?
Validate that the high and low surrogates are in the expected range,
which guarantees that the subtractions won't overflow.
# Are there any user-facing changes?
No (well, things that used to panic now won't, but I don't think that
counts)
# Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
- Closes#7725.
# Rationale for this change
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
Do not populate null buffers when building `MutableArrayData` for
`NullArray`
# What changes are included in this PR?
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
# Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.
# Which issue does this PR close?
- Closes https://github.com/apache/arrow-rs/issues/7700
This commit introduces `ShortString`, a newtype that wraps around `&str`
that enforces a maximum length constraint. This also allows us to
perform validation once and removes a superfluous validation check in
`append_value`.
The now-superflous validation check was needed since users could
construct `Variant::ShortString`s directly, without doing input
validation. This means you can have a short string variant which
actually contains a string that is no longer than 63 bytes.
But since we enforce this check upon construction, we can directly match
against `Variant::String` and `Variant::ShortString` arms with their
respective appending functions (`append_string` and
`append_short_string`).
# Which issue does this PR close?
Closes#7691
# Are there any user-facing changes?
No, just implementing for ease of use and consistency in tests and
elsewhere across the wider repository.
@alamb
This follows the pattern of other parts of the arrow-rs codebase:
arrow-array, arrow-schema, etc.
With this change, polyglot codebases can use pyarrow without making all
their crates that use arrow pull in pyarrow (& pyo3).
It also allows interfacing with PyArrow without pulling in Arrow.
# Which issue does this PR close?
Closes https://github.com/apache/arrow-rs/issues/7668.
# Rationale for this change
Part of a codebase can use pyarrow without arrow pulling in pyo3 across
the codebase.
# Are there any user-facing changes?
Nope.
# Which issue does this PR close?
Currently, no pub api to support write the internal buffer for
SerializedFileWriter, it's very helpful when we want to add low level
API for example:
- https://github.com/apache/datafusion/issues/16374
- https://github.com/apache/datafusion/pull/16395
Because that we want to update the buf bytes written, if we use the buf
internal file to write, we can't update the internal buf written bytes.
The consistent update for the bytes written metrics is the key for our
custom index write.
# Rationale for this change
Add API to support write with buf byteswritten updating.
# What changes are included in this PR?
Add API to support write with buf byteswritten updating.
# Are there any user-facing changes?
No
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close?
- Part of https://github.com/apache/arrow-rs/issues/7456
# Rationale for this change
Currently the `coalesce` kernel buffers views / data until there are
enough rows and then concat's the results together. StringViewArrays can
be even worse as there is a second copy in `gc_string_view_batch`
This is wasteful because it
1. Buffers memory (has 2x the peak usage)
2. Copies the data twice
We can make it faster and more memory efficient by directly creating the
output array
# What changes are included in this PR?
1. Add a specialization for incrementally building `StringViewArray`
without buffering
Note this PR does NOT (yet) add specialized filtering -- instead it
focuses on reducing the
overhead of appending views by not copying them (again!) with
`gc_string_view_batch`
# Open questions:
1. There is substantial overlap / duplication with StringViewBuilder --
I wonder if we can / should consolidate them somehow
The differences are that the
1. Block size calculation management (aka look at the buffer sizes of
the incoming buffers)
2. Finishing array allocates sufficient space for views
# Are there any user-facing changes?
The kernel is faster, no API changes
# Which issue does this PR close?
Housekeeping, part of
* https://github.com/apache/arrow-rs/issues/6736
# Rationale for this change
The variant module was starting to become unwieldy.
# What changes are included in this PR?
Split out metadata, object, and list to sub-modules; move `OffsetSize`
to the decoder module where it arguably belongs.
Result: variant.rs is "only" ~900 LoC instead of ~2kLoc.
# Are there any user-facing changes?
No. Public re-exports should hide the change from users.
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close?
Part 3 of https://github.com/apache/datafusion/issues/16011
# What changes are included in this PR?
This is a piece of the puzzle to support aggregating on REE arrays.
# Are there any user-facing changes?
No user facing changes, just extending functionality of existing APIs to
support extracting rows from REE arrays.
@alamb
# Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
- Follow on to https://github.com/apache/arrow-rs/pull/7644
- Part of https://github.com/apache/arrow-rs/issues/6736
# Rationale for this change
Using the parquet APIs came up in
https://github.com/apache/arrow-rs/pull/7644#discussion_r2145228349 so I
wanted to help contribute some additional documentation / tests
# What changes are included in this PR?
Add documentation and tests about `Variant`, specifically some examples
of how to create `Variant` values
# Are there any user-facing changes?
More docs
---------
Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
# Which issue does this PR close?
* Closes https://github.com/apache/arrow-rs/issues/7684
* Closes https://github.com/apache/arrow-rs/issues/7685
* Part of https://github.com/apache/arrow-rs/issues/6736
# Rationale for this change
Infallible iteration is _much_ easier to work with, vs. Iterator of
Result or Result of Iterator. Iteration and validation are strongly
correlated, because the iterator can only be infallible if the
constructor previously validated everything the iterator depends on.
# What changes are included in this PR?
In all three of `VariantMetadata,` `VariantList,` and `VariantObject`:
* The header object is cleaned up to _only_ consider actual header
state. Other state is moved to the object itself.
* Constructors fully validate the object by consuming a fallible
iterator
* The externally visible iterator does a `map(Result::unwrap)` on the
same fallible iterator, relying on the constructor to prove the unwrap
is safe.
* The externally visible iterator is obtained by calling `iter()`
method.
In addition:
* `VariantObject` methods no longer materialize the whole offset+field
array
* Removed validation that is covered by the new iterator testing
* A bunch of dead code removed, several methods renamed for clarity
* `first_byte_from_slice` now returns `u8` instead of `&u8`
# Are there any user-facing changes?
Visibility and signatures of some methods changed.
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
Closes #https://github.com/apache/arrow-rs/issues/7688
# Rationale for this change
interleave_views is *really* slow - taking up ~25% of the samples in
`SortPreservingMergeExec`
We can make it faster.
<details>
```
interleave str_view(0.0) 100 [0..100, 100..230, 450..1000]
time: [369.33 ns 371.42 ns 374.48 ns]
change: [−77.355% −77.199% −77.051%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low mild
3 (3.00%) high severe
interleave str_view(0.0) 400 [0..100, 100..230, 450..1000]
time: [932.11 ns 937.68 ns 945.43 ns]
change: [−84.672% −84.528% −84.382%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) high mild
4 (4.00%) high severe
interleave str_view(0.0) 1024 [0..100, 100..230, 450..1000]
time: [2.0938 µs 2.1058 µs 2.1235 µs]
change: [−86.449% −86.310% −86.167%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) high mild
2 (2.00%) high severe
interleave str_view(0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]
time: [2.2045 µs 2.2098 µs 2.2170 µs]
change: [−84.595% −84.493% −84.401%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe
```
</details>
# What changes are included in this PR?
# Are there any user-facing changes?
# Which issue does this PR close?
- Closes#7424
- Part of https://github.com/apache/arrow-rs/issues/6736
# Rationale for this change
This PR introduces a basic builder API for creating Variant values,
building on the foundation laid by @mkarbo. The builder provides a
user-friendly nested API while maintaining performance through a
single-buffer design. The design was shaped with huge help from @alamb,
@scovich and @Weijun-H ’s feedback, and draws much inspiration from the
excellent work by @zeroshade
This is an initial version and does not yet support nested values,
metadata key sorting, and so on
# What changes are included in this PR?
- Adds VariantBuilder, ObjectBuilder, ArrayBuilder
# Are there any user-facing changes?
The new API's added in parquet-variant will be user facing.
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close?
Part of https://github.com/apache/arrow-rs/issues/4886
Related to https://github.com/apache/arrow-rs/pull/6965
# Rationale for this change
Avro supports arrays as a core data type, but previously arrow-avro had
incomplete decoding logic to handle them. As a result, any Avro file
containing array fields would fail to parse correctly within the Arrow
ecosystem. This PR addresses this gap by:
1. Completing the implementation of explicit `Array` -> `List` decoding:
It completes the `Decoder::Array` logic that reads array blocks in Avro
format and constructs an Arrow `ListArray`.
Overall, these changes expand Arrow’s Avro reader capabilities, allowing
users to work with array-encoded data in a standardized Arrow format.
# What changes are included in this PR?
**1. arrow-avro/src/reader/record.rs:**
* Completed the Array decoding path which leverages blockwise reads of
Avro array data.
* Implemented decoder unit tests for Array types.
# Are there any user-facing changes?
N/A
~Draft until https://github.com/apache/arrow-rs/pull/7649 is merged~
# Which issue does this PR close?
- Follow on to https://github.com/apache/arrow-rs/pull/7649 from @brancz
# Rationale for this change
I noticed some extra testing and docs I would like to see so I made a PR
to add them
# What changes are included in this PR?
1. Add docs + additional tests
# Are there any user-facing changes?
No code changes, only some docs (and more tests)
# Which issue does this PR close?
Part 2 of https://github.com/apache/datafusion/issues/16011
# Are there any user-facing changes?
No user facing changes, just extending functionality of existing APIs to
support extracting rows from REE arrays.
@alamb
# Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
- Part of https://github.com/apache/arrow-rs/pull/7650
# Rationale for this change
In order to coalesce the result of applying a filter currently requires
first copying the results into an intermediate array (calling `filter`).
My plan is to remove this extra copy by building the final array up
directly incrementally
To do to so, there needs to be an API that can take the original data
and the filter
# What changes are included in this PR?
1. Add `BatchCoalescer::push_filtered_batch` and docs
2. Update benchmarks to use it
# Are there any user-facing changes?
New API
# Which issue does this PR close?
None
# Rationale for this change
Noticed a typo.
# What changes are included in this PR?
Fixes the typo.
# Are there any user-facing changes?
Updated docs.
# Which issue does this PR close?
- Closes https://github.com/apache/arrow-rs/issues/7665
# Rationale for this change
Continuing the ongoing variant implementation effort.
# What changes are included in this PR?
As per title -- implement fairly complete support for variant objects
and arrays. Also add some unit tests.
Note: This PR renames `VariantArray` as `VariantList` to align with
parquet and arrow terminology, and to not conflict with the
`VariantArray` we will eventually need to define for holding an arrow
array of variant-typed data.
# Are there any user-facing changes?
Those variant subtypes should now be usable.
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close?
The attempt in #7672 to include `arrow-pyarrow-integration-testing`
using a wildcard did not work:
https://github.com/apache/arrow-rs/actions/runs/15678211946/job/44163362777#step:3:6376
# Rationale for this change
Using a wildcard does not work.
# What changes are included in this PR?
Explicitly include root workspace and
`arrow-pyarrow-integration-testing` in dependabot config for cargo.
# Are there any user-facing changes?
No.
# Which issue does this PR close?
- Part of https://github.com/apache/arrow-rs/issues/6736
# Rationale for this change
While making documentation / examples for working with `Variant` in
https://github.com/apache/arrow-rs/pull/7661, I found it was somewhat
awkward to make `Variant` values directly from the metadata and value.
Specifically you have to
```rust
let metadata = [0x01, 0x00, 0x00];
let value = [0x09, 0x48, 0x49];
// parse the header metadata
let metadata = VariantMetadata::try_new(&metadata).unwrap();
// and only then can you make the Variant
Variant::try_new(&metadata, &value).unwrap()
```
I would really like to be able to create `Variant `directly from
`metadata` and `value` without having to make a `VariantMetadata`
structure
# What changes are included in this PR?
This PR proposes a small change to the API so creating a Variant now
looks like:
```rust
let metadata = [0x01, 0x00, 0x00];
let value = [0x09, 0x48, 0x49];
// You can now make the Variant directly from the metadata and value
Variant::try_new(&metadata, &value).unwrap()
```
# Are there any user-facing changes?
Yes, the API for creating APIs is slightly different (and I think
better)
# Which issue does this PR close?
None
# Rationale for this change
There are more `Cargo.toml` files that should be considered by
Dependabot.
# What changes are included in this PR?
Search recursively for manifests instead of just the one in the root.
# Are there any user-facing changes?
No
# Which issue does this PR close?
None
# Rationale for this change
This removes a dependency by using the new LazyLock feature available
since Rust 1.80.
This crate already has an MSRV of 1.81, so this is not a breaking
change.
# Are there any user-facing changes?
No user-facing changes.
# Which issue does this PR close?
None.
# Rationale for this change
Group PRs like
- #7656
- #7657
- #7658
# What changes are included in this PR?
Group for prost updates in dependabot config.
# Are there any user-facing changes?
No.
# Which issue does this PR close?
Closes #NNN.
# Rationale for this change
Improve empty batches format.
# What changes are included in this PR?
Add new `pretty_format_batches_with_schema` function.
# Are there any user-facing changes?
Yes.
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close?
- Closes https://github.com/apache/arrow-rs/issues/7610
# Rationale for this change
I'm adding this because I would like to have a more efficient method for
upgrading the key type of a dictionary builder in the case where my
dictionary keys have overflowed.
# What changes are included in this PR?
This adds a method called `try_new_from_builder` to
`GenericByteDictionaryBuilder` that can be used to construct a new
builder from the passed argument with the same values and internal
state, but a keys array builder of a different type (the motivation
being that the new key type could hold more values).
# Are there any user-facing changes?
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close?
Closes#7630.
# What changes are included in this PR?
This PR implements support for the following primitive variant types:
- Binary
- Date
- TimestampMicros
- TimestampNtzMicros
- Int16
- Int32
- Int64
- Decimal4
- Decimal8
- Decimal16
- Float
- Double
The following types are not yet implemented(see
[here](https://github.com/apache/parquet-testing/blob/b68bea40fed8d1a780a9e09dd2262017e04b19ad/variant/regen.py#L78-L83)
for details):
- TimeNTZ
- TimestampNanos
- TimestampNtzNanos
- UUID
# Are there any user-facing changes?
Users who opt-in to the Variant feature can use these primitives.
# Which issue does this PR close?
PR updates FlightSQL `GetDbSchemas` and `GetTables` schemas to fully
match the FlightSQL protocol (fields nullability).
Fixes
- https://github.com/apache/arrow-rs/issues/7637
# Are there any user-facing changes?
It could technically be considered a user-facing breaking change, as the
schema returned by the `CommandGetDbSchemas` and `CommandGetTables`
FlightSQL commands will change. However, since the change only affects
field nullability, there should be no practical impact, or it is very
unlikely.
# Which issue does this PR close?
- Closes#6447.
# Rationale for this change
This deprecates the last of the old standalone Parquet metadata parsing
functions that have since been replaced by `ParquetMetaDataReader`.
# What changes are included in this PR?
# Are there any user-facing changes?
No, only adds deprecation warnings to public API
# Which issue does this PR close?
Found a `#[deprecated]` missing a `since` while preparing to remove
deprecated APIs.
# Rationale for this change
# What changes are included in this PR?
# Are there any user-facing changes?
No, just adds clarification
# Which issue does this PR close?
Related to #6447.
While reviewing other PRs I happened to notice an old FIXME I left
behind that should have been removed in #6639.
# Rationale for this change
# What changes are included in this PR?
# Are there any user-facing changes?
No, just removes a comment
# Which issue does this PR close?
Closes#7629.
I also noticed that skipping pages in encrypted files was broken so have
fixed that too.
# What changes are included in this PR?
* Refactors `SerializedPageReader` to reduce the use of `#[cfg(...)]`
inline. To work with the borrow checker, I created a new
`SerializedPageReaderContext` type to hold the `CryptoContext`.
* Updates `SerializedPageReader::get_next_page` so that page headers and
page data are decrypted when page indexes are used.
* Updates `SerializedPageReader::skip_next_page` to update the page
index so that encryption AADs are calculated correctly.
* Adds new unit tests for reading with a page index and skipping pages
in encrypted files.
# Are there any user-facing changes?
Only bug fixes.
---------
Co-authored-by: Ed Seidl <etseidl@live.com>
# Which issue does this PR close?
- Closes https://github.com/apache/arrow-rs/issues/7605
# Rationale for this change
I thought it would be nice if `append_nulls` was supported for
additional types of array builders. Currently it is available on some
builder types, but not all.
# What changes are included in this PR?
Add an `append_nulls` method to:
- FixedSizeBinaryDictionaryBuilder
- FixedSizedBinaryBuilder
- GenericBytesBuilder
- GenericListBuilder
- StructBuilder
# Are there any user-facing changes?
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close?
`GenericByteArray::value_unchecked` permits `unsafe` code, but still
introduces a check due to `unwrap` being called here:
```diff
let b = std::slice::from_raw_parts(
- self.value_data.as_ptr().offset(start.to_isize().unwrap()),
- (end - start).to_usize().unwrap(),
+ self.value_data.as_ptr().offset(start.to_isize().unwrap_unchecked()),
+ (end - start).to_usize().unwrap_unchecked(),
);
```
I believe it is sensible to use `unwrap_unsafe` here instead. While the
compiler may be able to prune the first unwrap as unreachable, I believe
it can **not** prove at compile time that `end >= start` and eliminate
the second unwrap. This is an invariant of GenericByteArray.
# Are there any user-facing changes?
No.
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close?
- Closes#7603.
# What changes are included in this PR?
I've added an `into_inner` function for the `AsyncArrowWriter`
# Are there any user-facing changes?
This is not a breaking change. I've added some documentation to describe
the methods and possible pitfalls
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Rationale for this change
When converting data into row format, a significant portion of cycles is
spent determining the lengths of the rows to be created. For columns
with fixed-size elements (determined by datatype), this calculation can
be optimized by avoiding writes to an intermediate vector for length
tracking.
# What changes are included in this PR?
- Implements `LengthTracker` which only materializes lengths for
variable-size columns
- Updates length calculation in `row_lengths(..)` and offset computation
in `RowConverter::append` to use the `LengthTracker`
# Are there any user-facing changes?
No.
# Which issue does this PR close?
- Part of https://github.com/apache/arrow-rs/issues/7630
# Rationale for this change
Make it easy to add this feature by preparing the path with tests
# What changes are included in this PR?
Add tests (commented out) that should pass after
https://github.com/apache/arrow-rs/issues/7630 is done
# Are there any user-facing changes?
No
# Which issue does this PR close?
- Part of https://github.com/apache/arrow-rs/issues/6736
# Rationale for this change
Rust integration tests (in `parquet-variant/tests`) are compiled as a
external program would be compiled and thus can only use the exposed
API. This helps verify that the crate is usable
# What changes are included in this PR?
1. Move the tests that read/write variant values into `variant_interop`
test (`cargo test --test variant_interop`)
2. Publically expose `pub` structures
# Are there any user-facing changes?
There are now pub APIs in the parquet-variant crate
# Which issue does this PR close?
- Follow on to https://github.com/apache/arrow-rs/pull/7625 from
@Dandandan
# Rationale for this change
I want to eventually remove `gc_string_view` but currently the unit
tests are in terms of that function
# What changes are included in this PR?
Rewrite tests to be in terms of `coalesce` instead
Also,
1. Add additional coverage for the issue we saw in
https://github.com/apache/arrow-rs/pull/7623
2. Add add coverage for the case where there are data buffers in the
view, but they are not referenced by any view
https://github.com/apache/arrow-rs/pull/7625#discussion_r2134634467
Codecov of this module is now 100%
# Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.
# Which issue does this PR close?
- Closes#7575.
# Rationale for this change
Need a option to set encoding for all columns.
# What changes are included in this PR?
This PR:
- introduces an encoding parameter for `set_encoding`.
- groups the encoding-related code part together in the file.
# Are there any user-facing changes?
No
---------
Signed-off-by: Jigao Luo <jigao.luo@outlook.com>
… (#7614)" (#7623)"
This reverts commit da461c8754.
This adds a test and fix for the wrong index issue.
I also verified the change for DataFusion (and benchmarks show notable
improvements).
# Which issue does this PR close?
Closes #NNN.
# Rationale for this change
# What changes are included in this PR?
# Are there any user-facing changes?
# Which issue does this PR close?
- Follow on to https://github.com/apache/arrow-rs/pull/7597
# Rationale for this change
While reviewing the code and the concat kernel for
- https://github.com/apache/arrow-rs/pull/7617
I realized there is a non trivial difference when there all inlined
views vs some inlined views vs mostly large strings so the benchmarks
should capture that
# What changes are included in this PR?
1. Add variations of benchmark with different size strings in
StringViewArray
# Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.