12768 Commits

Author SHA1 Message Date
Andrew Lamb 8df75c3f04 Document guidance on how to evaluate breaking API changes (#20584)
## Which issue does this PR close?


## Rationale for this change

DataFusion does make API changes from time to time, and that is a normal
part of software development. However, it is important to evaluate the
impact of those API changes on downstream users and to ensure that the
benefits of the change are clear to those users.

I found a few times where API changes were made with the justification
that "some APIs in DataFusion are cleaner" or "this is more consistent
with other APIs". While those may be valid justifications, it is painful
for downstream users who have change their code to accommodate the API
change when they get nothing in return

This most recently happened in this PR
-
https://github.com/apache/datafusion/pull/19790#pullrequestreview-3863480182

thus I think the contributor guide should include some guidance on how
to evaluate breaking API changes and to ensure that the benefits of the
change are clear to downstream users.


## What changes are included in this PR?

Polish up the API guidance section

## Are these changes tested?

By CI

## Are there any user-facing changes?

Better / clearer docs
2026-02-28 14:16:28 +00:00
Neil Conway 3a23bb2531 perf: Optimize array_agg() using GroupsAccumulator (#20504)
## Which issue does this PR close?

- Closes #20465.
- Closes #17446.

## Rationale for this change

This PR optimizes the performance of `array_agg()` by adding support for
the `GroupsAccumulator` API.

The design tries to minimize the amount of per-batch work done in
`update_batch()`: we store a reference to the batch, and a `(group_idx,
row_idx)` pair for each row. In `evaluate()`, we assemble all the
requested output with a single `interleave` call.

This turns out to be significantly faster, because we copy much less
data and assembling the results can be vectorized more effectively. For
example, on a benchmark with 5000 groups and 5000 int64 values per
group, this approach is roughly 190x faster than the previous approach.

Releasing memory after a partial emit is a little more involved than the
previous approach, but with some determination it is still possible.

## What changes are included in this PR?

* Implement the `GroupsAccumulator` API for `array_agg()`
* Add benchmark for `array_agg` of a named struct over a dict, following
the workload in #17446
* Add unit tests
* Improve SLT test coverage
* Remove a redundant SLT test

## Are these changes tested?

Yes, and benchmarked.

## Are there any user-facing changes?

No.

## AI usage

Iterated with the help of multiple AI tools; I've reviewed and
understand the resulting code.
2026-02-28 13:21:11 +00:00
Andrew Lamb 73fbd48070 Upgrade DataFusion to arrow-rs/parquet 58.0.0 / object_store 0.13.0 (#19728)
## Which issue does this PR close?

- Follow on to https://github.com/apache/datafusion/pull/19355
- related to https://github.com/apache/arrow-rs/issues/8466
- Closes https://github.com/apache/datafusion/issues/17455

## Rationale for this change

Keep datafusion up to date (and test Arrow using DataFusion tests)

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Daniël Heres <danielheres@gmail.com>
2026-02-28 10:43:51 +00:00
Zhang Xiaofeng acec058cb5 perf: Use Arrow vectorized eq kernel for IN list with column references (#20528)
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Relates to #20427 .

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

When the IN list contains column references (e.g. `SELECT * FROM t WHERE
a IN (b, c, d, e)`), DataFusion falls back to a row-by-row
`make_comparator` path which is significantly slower than it needs to
be. Arrow provides SIMD-optimized `eq` kernels that can compare entire
arrays in one call.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

- Use Arrow's vectorized `eq` kernel instead of row-by-row
`make_comparator` for non-nested types (primitive, string, binary) in
the column-reference IN list evaluation path
- For nested types (Struct, List, etc.), fall back to `make_comparator`
since Arrow's `eq` kernel does not support them
- Add 6 unit tests covering the column-reference evaluation path (Int32,
Utf8, NOT IN, NULL handling, NaN semantics)

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

  Yes. 6 new unit tests added:
  - `test_in_list_with_columns_int32_scalars`
  - `test_in_list_with_columns_int32_column_refs`
  - `test_in_list_with_columns_utf8_column_refs`
  - `test_in_list_with_columns_negated`
  - `test_in_list_with_columns_null_in_list`
  - `test_in_list_with_columns_float_nan`


## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

No API changes. Queries with column-reference IN lists will run faster.

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
2026-02-28 04:40:16 +00:00
Eren Avsarogullari 5d8249ff16 fix: Fix and Refactor Spark shuffle function (#20484)
## Which issue does this PR close?
- Closes #20483.

## Rationale for this change
Currently, Spark `shuffle` function returns following error message when
`seed` is `null`. This needs to be fixed by exposing `NULL` instead of
`'Int64'`.

**Current:**
```
query error
SELECT shuffle([2, 1], NULL);
----
DataFusion error: Execution error: shuffle seed must be Int64 type, got 'Int64'
```

**New:**
```
query error DataFusion error: Execution error: shuffle seed must be Int64 type but got 'NULL'
SELECT shuffle([1, 2, 3], NULL);
```

In addition to this fix, this PR also introduces following refactoring
to `shuffle` function:
- Combining args validation checks with `single` error message,
- Extending current error message with expected data types:
```
Current:
shuffle does not support type '{array_type}'.

New:
shuffle does not support type '{array_type}'; expected types: List, LargeList, FixedSizeList or Null." 
```
- Adding new UT coverages for both `shuffle.rs` and `shuffle.slt`.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?
Yes, being added new UT cases.

## Are there any user-facing changes?
Yes, updating Spark `shuffle` functions error messages.
2026-02-28 00:05:26 +00:00
Brent Gardner e567cb91f4 Fix serde of window lead/lag defaults (#20608)
## Which issue does this PR close?

- Closes #20607.

## Rationale for this change

Don't lose values in serde

## What changes are included in this PR?

Preservation of window function arguments, particularly default value

## Are these changes tested?

A RTT is included

## Are there any user-facing changes?

Users with distributed query engines such as Ballista will have more
queries work than before

**note**: AI was used to create this PR
2026-02-27 21:58:29 +00:00
Neil Conway 451c79fc00 fix: Fix array_to_string with columnar third arg (#20536)
## Which issue does this PR close?

- Closes #20535 

## Rationale for this change

The previous coding used the `null_string` value for the first row as
the value for the remainder of the rows. This is wrong if `null_string`
is columnar.

## What changes are included in this PR?

## Are these changes tested?

Yes; added new SLT test.

## Are there any user-facing changes?

No, other than fixing the behavior of `array_to_string` in this
scenario.
2026-02-27 16:03:32 +00:00
Neil Conway 7ef62b988d chore: Enable workspace lint for all workspace members (#20577)
## Which issue does this PR close?

N/A

## Rationale for this change

Several workspace members did not opt into workspace-level lints. This
was almost certainly an oversight.

## What changes are included in this PR?

* Enable workspace-level lints for all workspace members
* Fix up code to fix various linter errors

This PR pulls out some of the changes made in #20250, although I didn't
actually use that PR.

## Are these changes tested?

Yes.

## Are there any user-facing changes?

No.
2026-02-27 16:00:37 +00:00
Kumar Ujjawal c63ca330d7 fix: increase ROUND decimal precision to prevent overflow truncation (#19926)
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #19921.

## Rationale for this change

`SELECT ROUND('999.9'::DECIMAL(4,1))` incorrectly returned `100.0`
instead of `1000`.

When rounding a decimal value causes a carry-over that increases the
number of digits (e.g., 999.9 → 1000.0), the result overflows the
original precision constraint. The overflow was silently truncated
during display, producing incorrect results.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

- Fixes ROUND on DECIMAL so carry-over doesn’t silently truncate/produce
wrong results.
- When decimal_places is a constant (including omitted/NULL), ROUND now
reduces the output scale to min(input_scale, max(decimal_places, 0))
(Spark/DuckDB-style), reclaiming
    precision for the integer part.
- When decimal_places is not a constant (e.g. a column/array), ROUND
keeps the original scale and may increase precision by 1 (capped); if
precision is already max and the rounded
value can’t fit, it returns an overflow error instead of a wrong value.
  - Adds/updates sqllogictests for these behaviors and edge cases.

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

- Added new sqllogictest case for the specific bug scenario
- Updated 2 existing tests with new expected precision values
- All existing tests pass

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

| Aspect | Before | After |
  | --- | --- | --- |
  | ROUND('999.9'::DECIMAL(4,1)) | 100.0 (wrong) | 1000 (correct) |
| Return type (default/literal dp) | Decimal128(4, 1) | Decimal128(4, 0)
|

- Return type for DECIMAL inputs can change: with literal dp it
generally reduces scale; with non-literal dp it keeps scale and may
increase precision by 1.
- New error case: when precision is already max and dp is non-literal,
ROUND may now error on overflow rather than return a truncated/wrong
decimal.
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
2026-02-27 14:10:20 +00:00
kosiew e583fe9d22 Add deterministic per-file timing summary to sqllogictest runner (#20569)
## Which issue does this PR close?

* Part of #20524.

## Rationale for this change

The sqllogictest runner executes files in parallel, but it was hard to
pinpoint which test files dominate wall-clock time. This change adds
**deterministic per-file elapsed timing observability** so we can
identify long-tail files and prioritize follow-up optimization work,
while keeping default output usable for both local development (TTY) and
CI (non-TTY).

## What changes are included in this PR?

* Collect per-file elapsed durations in the sqllogictest runner and
aggregate them at end-of-run.
* Print a **deterministic timing summary** (stable sort: elapsed desc,
path asc; stable formatting) via `MultiProgress` to avoid interleaved
progress-bar noise.
* Add CLI flags and environment variables to control output:

  * `--timing-summary auto|off|top|full` (also `SLT_TIMING_SUMMARY`)
  * `--timing-top-n <N>` (also `SLT_TIMING_TOP_N`, must be `>= 1`)
* Default behavior:

* `auto` maps to `off` for local TTY runs and `top` for CI/non-TTY runs.
* Add optional debug logging for slow files (over 30s) behind
`SLT_TIMING_DEBUG_SLOW_FILES=1`.
* Update `datafusion/sqllogictest/README.md` with usage examples.

## Are these changes tested?

* Covered by existing `sqllogictests` integration test execution; no new
unit tests were added.
* Manual validation plan (ran locally / in CI as applicable):

* `cargo test --test sqllogictests -- push_down_filter_ --test-threads
16`
  * `cargo test --test sqllogictests -- --test-threads 16`
* `cargo test --test sqllogictests -- --timing-summary top
--timing-top-n 10`
  * `cargo test --test sqllogictests -- --timing-summary full`
* Verified output properties:

* Summary ordering is deterministic across repeated runs (elapsed desc,
path asc).
* `auto` mode is quiet on TTY but prints a top-N summary on non-TTY/CI.
  * Pass/fail behavior and error reporting are unchanged.

## Are there any user-facing changes?

Yes (test-runner UX only):

* New optional timing summary output for `sqllogictests`.
* New CLI flags / env vars documented in
`datafusion/sqllogictest/README.md`:

  * `--timing-summary auto|off|top|full` / `SLT_TIMING_SUMMARY`
  * `--timing-top-n <N>` / `SLT_TIMING_TOP_N`
* `SLT_TIMING_DEBUG_SLOW_FILES=1` (optional debug logging for slow files
>30s)

No public DataFusion APIs are changed.

## LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated
content has been manually reviewed and tested.
2026-02-27 09:43:47 +00:00
kosiew bc600b3090 Split push_down_filter.slt into standalone sqllogictest files to reduce long-tail runtime (#20566)
## Which issue does this PR close?

* Part of #20524 

## Rationale for this change

`datafusion/sqllogictest/test_files/push_down_filter.slt` had grown into
a large sqllogictest file. Since the sqllogictest runner parallelizes at
**file granularity**, a single heavyweight file can become a straggler
and dominate wall-clock time.

This PR performs a non-invasive split of that file into smaller,
self-contained `.slt` files so the runner can distribute work more
evenly across threads, improving overall suite balance without changing
SQL semantics or test coverage.

## What changes are included in this PR?

* Removed the monolithic `push_down_filter.slt`.
* Added new standalone sqllogictest files, each with the minimal
setup/teardown required to run independently:

* `push_down_filter_unnest.slt` — unnest filter pushdown coverage
(including struct/field cases).
* `push_down_filter_parquet.slt` — parquet filter pushdown + limit +
cast predicate behavior + dynamic filter pushdown (swapped join inputs).
* `push_down_filter_outer_joins.slt` — LEFT/RIGHT join and anti-join
logical filter pushdown checks.
* `push_down_filter_regression.slt` — regression coverage for issues
#17188 and #17512, plus aggregate dynamic filter pushdown checks.
* Updated scratch output paths to be file-scoped (e.g.
`test_files/scratch/push_down_filter_parquet/...`) to reduce the chance
of conflicts when tests execute in parallel.
* Preserved all original query expectations and explain-plan assertions;
changes are organizational only.

## Are these changes tested?

Yes, with a python script to compare text blocks in the new slt files vs
old single slt file.

```
python - <<'PY'
import subprocess, re
from collections import defaultdict, deque

repo = '.'
old_spec = '692a7cb67^:datafusion/sqllogictest/test_files/push_down_filter.slt'
new_specs = [
  'HEAD:datafusion/sqllogictest/test_files/push_down_filter_outer_joins.slt',
  'HEAD:datafusion/sqllogictest/test_files/push_down_filter_parquet.slt',
  'HEAD:datafusion/sqllogictest/test_files/push_down_filter_regression.slt',
  'HEAD:datafusion/sqllogictest/test_files/push_down_filter_unnest.slt',
]

def git_show(spec):
  return subprocess.check_output(['git', '-C', repo, 'show', spec], text=True)

def normalize_sql(sql):
  s = sql.strip().lower()
  s = re.sub(
    r"test_files/scratch/push_down_filter(?:_[^'\s;)/]+)?[^'\s;)]*",
    "test_files/scratch/__norm__",
    s,
  )
  s = re.sub(r'\s+', ' ', s)
  return s

def blocks(text):
  lines = text.splitlines()
  out = []
  i = 0
  while i < len(lines):
    m = lines[i].strip()
    if m.startswith('query '):
      i += 1
      b = []
      while i < len(lines) and lines[i].strip() != '----':
        if not lines[i].lstrip().startswith('#'):
          b.append(lines[i])
        i += 1
      sql = '\n'.join(b).strip()
      if sql:
        out.append(('query', normalize_sql(sql)))
    elif m.startswith('statement '):
      i += 1
      b = []
      while i < len(lines):
        s = lines[i].strip()
        if s == '':
          break
        if s.startswith('query ') or s.startswith('statement '):
          i -= 1
          break
        if not lines[i].lstrip().startswith('#'):
          b.append(lines[i])
        i += 1
      sql = '\n'.join(b).strip()
      if sql:
        out.append(('statement', normalize_sql(sql)))
    i += 1
  return out

old_blocks = blocks(git_show(old_spec))
new_blocks = []
for s in new_specs:
  new_blocks.extend(blocks(git_show(s)))

q = defaultdict(deque)
for item in new_blocks:
  q[item].append(item)

missing = 0
extra = 0
for item in old_blocks:
  if q[item]:
    q[item].popleft()
  else:
    missing += 1

for v in q.values():
  extra += len(v)

print(f'old_blocks={len(old_blocks)}')
print(f'new_blocks={len(new_blocks)}')
print(f'missing={missing}')
print(f'extra={extra}')
print(f'baseline={old_spec}')
if missing != 0:
  raise SystemExit(1)
PY
```

Output:
```
old_blocks=107
new_blocks=108
missing=0
extra=1
```

The extra(1) is this statement block:

set datafusion.explain.physical_plan_only = true;
Why it shows as extra:

In split files, it appears 3 times:
push_down_filter_parquet.slt:21
push_down_filter_unnest.slt:21
push_down_filter_regression.slt:129
In the baseline monolithic file at e937cadbc^, it appears 2 times.
So comparison reports 3 - 2 = extra 1.

## Are there any user-facing changes?

No user-facing behavior changes. This is a test-suite
organization/performance improvement only.

## Note before merging

Revert e8369bb (it is a commit to trigger the CI extented tests for
sqllogictest)

## LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated
content has been manually reviewed and tested.
2026-02-27 08:36:31 +00:00
Neil Conway a79e6e6b39 fix(substrait): Correctly parse field references in subqueries (#20439)
## Which issue does this PR close?

- Closes #20438.

## Rationale for this change

The substrait consumer parsed field references in correlated subqueries
incorrectly. Field references were always resolved relative to the
schema of the current (innermost) subquery, leading to incorrect
results.

## What changes are included in this PR?

We now maintain a stack of outer query schemas, and pushes/pops elements
from it as we traverse subqueries. When resolving field references, we
now use `FieldReference.root_type` to detect outer query field
references and resolve them against the appropriate schema.

This commit updates the expected results for parsing TPC-H queries,
because several of them were parsed incorrectly (the misparsing was
probably not detected because the incorrect parse didn't result in any
illegal queries, by sheer luck). This also means we can enable Q17,
which failed to parse before.

## Are these changes tested?

Yes. Test results updated to reflect new, correct behavior, and new unit
tests added.

## Are there any user-facing changes?

The behavior of the substrait consumer has changed, although the
previous behavior was wrong and it seems a bit unlikely anyone would
have dependend on it. The `DefaultSubstraitConsumer` API is slightly
changed (new private field).
2026-02-27 07:14:05 +00:00
mishop-15 a257c29c26 add redirect for old upgrading.html URL to fix broken changelog links (#20582)
## Which issue does this PR close?

- Closes #20572

## Rationale for this change

When upgrade guides were split into separate pages (#20183), the old
`upgrading.html` URL broke. All changelog files still reference this old
URL, causing 404 errors for users.

## What changes are included in this PR?

Added a redirect in `docs/source/conf.py` using the existing
`sphinx_reredirects` extension to redirect
`library-user-guide/upgrading.html` to
`library-user-guide/upgrading/index.html`.

This preserves all existing changelog links without needing to update
historical files.

## Are these changes tested?

Tested locally - the redirect works correctly, including with anchor
links (e.g., `upgrading.html#datafusion-46-0-0`).

## Are there any user-facing changes?

Yes - users clicking old changelog links will now be redirected to the
correct page instead of getting a 404.
2026-02-26 21:23:26 +00:00
Gabriel Ferraté 3ab1301c53 fix: handle empty delimiter in split_part (closes #20503) (#20542)
## Which issue does this PR close?

- Closes #20503 

## Rationale for this change

`split_part` did not handle empty delimiters in a PostgreSQL-compatible
way (`split("")` in Rust creates leading/trailing empty fields).
This could return unexpected results for positions like `1` / `-1` and
out-of-range values.
This PR aligns behavior with Postgres semantics for empty delimiters.

## What changes are included in this PR?

Small change in how we treat the 1, -1

## Are these changes tested?

Indeed!

## Are there any user-facing changes?

Yes, behavior is now more consistent with PostgreSQL for
`split_part(str, '', n)`.
No API changes.
2026-02-26 21:03:34 +00:00
Acfboy e76f0eebe3 fix: IS NULL panic with invalid function without input arguments (#20306)
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #20201 .

## Rationale for this change

Unlike `Projection`, expressions within a `Filter` node do not always
have their types resolved during the initial LogicalPlan generation.
Validation is often deferred until the type_coercion phase. When
invoking f_up on the expression tree to perform type coercion, the check
is bypassed for function nodes with empty arguments, leading to a panic
during subsequent execution.

For example, all statements below cause a panic:
```
SELECT * FROM (SELECT 1) WHERE (STARTS_WITH() IS NULL);
SELECT * FROM (SELECT 1) WHERE (STARTS_WITH() IS NOT NULL);
SELECT * FROM (SELECT 'a') WHERE (STARTS_WITH() SIMILAR TO 'abc%');
SELECT * FROM (SELECT 1) WHERE CAST(STARTS_WITH() AS STRING) = 'x';
SELECT * FROM (SELECT 1) WHERE TRY_CAST(STARTS_WITH() AS INT) IS NULL;
```

This pr aims to stop panic. It would be better if reject these invalid
cases at planning.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

```diff
 fn coerce_arguments_for_signature<F: UDFCoercionExt>(
      schema: &DFSchema,
      func: &F,
  ) -> Result<Vec<Expr>> {
-    if expressions.is_empty() {
-        return Ok(expressions);
-    }
```

Deleted the early return. Thanks to @neilconway .

## Are these changes tested?
All original tests passed. Added some unit tests and sqllogictests.


## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
No.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2026-02-26 20:58:54 +00:00
Neil Conway d6fb3608b0 perf: Optimize array_position for scalar needle (#20532)
## Which issue does this PR close?

- Closes #20530 

## Rationale for this change

The previous implementation of `array_position` used
`compare_element_to_list` for every input row. When the needle is a
scalar (quite common), we can do much better by searching over the
entire flat haystack values array with a single call to
`arrow_ord::cmp::not_distinct`. We can then iterate over the resulting
set bits to determine per-row results.

This is ~5-10x faster than the previous implementation for typical
inputs.

## What changes are included in this PR?

* Implement new fast path for `array_position` with scalar needle
* Improve docs for `array_position`
* Don't use `internal_err` to report a user-visible error

## Are these changes tested?

Yes, and benchmarked. Additional tests added in a separate PR (#20531)

## Are there any user-facing changes?

No.
2026-02-26 18:40:55 +00:00
Adam Gutglick a026e7da2f perf: Optimize heap handling in TopK operator (#20556)
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #.

## Rationale for this change

This change to make a significant performance impact in the `TopK`
operator, which is a commonly used operator.

## What changes are included in this PR?

Instead of doing two operations on the inner heap (pop than push), we
use `Binary::peek_mut`, which allows us to replace the heap item
in-place and then sift it to its proper location in the heap.

Some SLT results seem to change, the only explanation I can find for it
is that pop/push vs the sift_down that `PeekMut` uses have some subtle
differences that resolve ties in a different way, ending up with a
slightly different result.

On my macbook, running the `topk_aggregate` benchmark, most benchmarks
are not changed significantly, aside from the following:
```
distinct 10000000 rows desc [no TopK]
                        time:   [554.69 ms 903.25 ms 1.3318 s]
                        change: [−82.888% −69.587% −47.591%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) high mild
  12 (12.00%) high severe

Benchmarking distinct 10000000 rows asc [no TopK]: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 113.7s, or reduce sample count to 10.
distinct 10000000 rows asc [no TopK]
                        time:   [405.87 ms 702.47 ms 1.0583 s]
                        change: [−86.490% −75.215% −51.486%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  3 (3.00%) high mild
  14 (14.00%) high severe

distinct 10000000 rows desc [TopK]
                        time:   [6.8372 ms 6.9933 ms 7.1523 ms]
                        change: [−0.5254% +2.2409% +5.0920%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

distinct 10000000 rows asc [TopK]
                        time:   [6.8731 ms 6.9952 ms 7.1226 ms]
                        change: [+3.3252% +5.3824% +7.5131%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
```

## Are these changes tested?

Existing test suite.

## Are there any user-facing changes?

No API changes, seems like some ordering might change in queries that
use the `TopK` operator, but in a way that seems correct.
2026-02-26 12:31:09 +00:00
Emily Matheys bcd42b090a fix: Unaccounted spill sort in row_hash (#20314)
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #20313 .

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

We must not use that much memory without reserving it.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Added a reservation before the sort, made a shrink call for the group
values after the emit and updated the reservation so the reservation
will be possible.

Moved the sort to use sort_chunked so we can immediately drop the
original batch and shrink the reservation to the used sizes, added a new
spill method for iterators, so we can use an accurate memory accounting.

If said reservation did not succeed, fallback to an incrementing sort
method which holds the original batch the whole time, and outputs one
batch at the time, this requires a much smaller reservation.
Made the reservation much more robust(otherwise the fuzz tests were
failing now that we actually reserve the memory in the sort)

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Current tests should still function, but memory should be reserved.
Added test that specifically verifies that we error on this when we
shouldn't do the sort.
Modified the tests that used to test the splitting function in the spill
to test the new iter spilling function

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

No
2026-02-26 08:22:25 +00:00
Andy Grove 33b86fe02e perf: Cache num_output_rows in sort merge join to avoid O(n) recount (#20478)
## Which issue does this PR close?

N/A - performance optimization

## Rationale for this change

In the SMJ tight loop (`join_partial`), `num_unfrozen_pairs()` was
called **twice per iteration**: once in the loop guard and once inside
`append_output_pair`. This method iterates all chunks in
`output_indices` and sums their lengths — O(num_chunks). Over a full
batch of `batch_size` iterations, this makes the inner loop O(batch_size
* num_chunks) instead of O(batch_size).

## What changes are included in this PR?

Add a `num_output_rows` field to `StreamedBatch` that is incremented on
each append and reset on freeze, replacing the O(n) summation with an
O(1) field read.

- Added `num_output_rows: usize` field to `StreamedBatch`, initialized
to `0`
- Increment `num_output_rows` in `append_output_pair()` after each
append
- `num_output_rows()` now returns the cached field directly
- Reset to `0` in `freeze_streamed()` when `output_indices` is cleared
- Removed the `num_unfrozen_pairs` parameter from `append_output_pair()`
since it can now read `self.num_output_rows` directly

## Are these changes tested?

Yes — all 48 existing `sort_merge_join` tests pass. This is a pure
refactor of an internal counter with no behavioral change.

## Performance

Very minor improvement.

### Before

```
sort_merge_join/inner_1to1/100000
                        time:   [3.8146 ms 3.8229 ms 3.8314 ms]
sort_merge_join/inner_1to10/100000
                        time:   [16.094 ms 16.125 ms 16.161 ms]
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe
sort_merge_join/left_1to1_unmatched/100000
                        time:   [3.7823 ms 3.7861 ms 3.7902 ms]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
sort_merge_join/left_semi_1to10/100000
                        time:   [3.0523 ms 3.0755 ms 3.1023 ms]
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe
sort_merge_join/left_anti_partial/100000
                        time:   [3.3458 ms 3.3498 ms 3.3542 ms]
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) high mild
  4 (4.00%) high severe
```

### After

```
sort_merge_join/inner_1to1/100000
                        time:   [3.7162 ms 3.7207 ms 3.7254 ms]
                        change: [−4.2320% −3.9309% −3.6431%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
sort_merge_join/inner_1to10/100000
                        time:   [15.556 ms 15.589 ms 15.626 ms]
                        change: [−5.2786% −4.8329% −4.4351%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
sort_merge_join/left_1to1_unmatched/100000
                        time:   [3.7059 ms 3.7101 ms 3.7146 ms]
                        change: [−4.4526% −4.1565% −3.8660%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
sort_merge_join/left_semi_1to10/100000
                        time:   [3.0832 ms 3.0899 ms 3.0981 ms]
                        change: [−4.0965% −3.4158% −2.7657%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
sort_merge_join/left_anti_partial/100000
                        time:   [3.2963 ms 3.3048 ms 3.3153 ms]
                        change: [−3.9413% −3.5316% −3.0884%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
```


## Are there any user-facing changes?

No.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 23:21:48 +00:00
Jack Kleeman 3a970c58ff Clamp early aggregation emit to the sort boundary when using partial group ordering (#20446)
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #20445.

## What changes are included in this PR?

Fix a panic on early emit with partial sort aggregations, by clamping
our emit point to the sort boundary

## Are these changes tested?

Yes

## Are there any user-facing changes?

No
2026-02-25 22:03:50 +00:00
Neil Conway e894a03bea perf: Use Hashbrown for array_distinct (#20538)
## Which issue does this PR close?

N/A

## Rationale for this change

#20364 recently optimized `array_distinct` to use batched row
conversion. As part of that PR, `std::HashSet` was used. This PR just
replaces `std::HashSet` with `hashbrown::HashSet`, which measurably
improves performance.

## What changes are included in this PR?

## Are these changes tested?

Yes.

## Are there any user-facing changes?

No.
2026-02-25 18:12:42 +00:00
Neil Conway e6849945bf fix: cardinality() of an empty array should be zero (#20533)
## Which issue does this PR close?

- Closes #20526.

## Rationale for this change

Per Postgres and the SQL spec, `cardinality()` of an empty array should
be zero; we previously returned `NULL`.

Along the way, fix another bug: we previously returned `0` for the
cardinality of an untyped `NULL` and `NULL` for the cardinality of a
typed null (e.g., `NULL::int[]`). We should return `NULL` in both cases.

## What changes are included in this PR?

Bug fixes, update SLT.

## Are these changes tested?

Yes.

## Are there any user-facing changes?

Yes: the behavior of `cardinality` has changed, albeit the previous
behavior was incorrect.
2026-02-25 14:02:43 +00:00
Kazantsev Maksim d7d646164d feat: Implement Spark bin function (#20479)
## Which issue does this PR close?

N/A

## Rationale for this change

Add new function:
https://spark.apache.org/docs/latest/api/sql/index.html#bin

## What changes are included in this PR?

- Implementation
- Unit Tests
- SLT tests

## Are these changes tested?

Yes, tests added as part of this PR.

## Are there any user-facing changes?

No, these are new function.

---------

Co-authored-by: Kazantsev Maksim <mn.kazantsev@gmail.com>
2026-02-25 11:38:51 +00:00
Lía Adriana e937cadbcc [fix] Add type coercion from NULL to Interval to make date_bin more postgres compatible (#20499)
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes https://github.com/apache/datafusion/issues/20502

## Rationale for this change

The following query is failing with the following error:

`SELECT date_bin(NULL, TIMESTAMP '2023-01-01 12:30:00', TIMESTAMP
'2023-01-01 12:00:00')
`

`Error: Error during planning: Failed to coerce arguments to satisfy a
call to 'date_bin' function: coercion from Null, Timestamp(ns),
Timestamp(ns) to the signature OneOf([....])`

## What changes are included in this PR?

Fix `date_bin(NULL, ...)` to return `NULL` instead of a planning error
by allowing Nulls to coerce to Interva.

## Are these changes tested?

I added a sqllogictest case to verify the query executes and returns
`NULL`.

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

Yes, previously `date_bin(NULL, ...) `returned a planning error. It now
returns NULL.
2026-02-25 08:02:30 +00:00
kosiew d75fcb83e3 Fix physical expr adapter to resolve physical fields by name, not column index (#20485)
## Which issue does this PR close?

*
[Comment](https://github.com/apache/datafusion/pull/20202#discussion_r2804840366)
on #20202

## Rationale for this change

When adapting physical expressions across differing logical/physical
schemas, relying on `Column::index()` can be incorrect if the physical
schema column ordering differs from the logical plan (or if a `Column`
is constructed with an index that doesn’t match the current physical
schema). This can lead to looking up the wrong physical field, causing
incorrect casts, type mismatches, or runtime failures.

This change ensures the adapter always resolves the physical field using
the column **name** against the physical file schema, making expression
rewriting robust to schema reordering and avoiding subtle bugs where an
index points at an unrelated column.

## What changes are included in this PR?

* Updated `create_cast_column_expr` to resolve the physical field via
`physical_file_schema.index_of(column.name())` instead of
`column.index()`.
* Added a regression test that deliberately supplies a mismatched
`Column` index and asserts the rewriter still selects the correct
physical field by name and produces the expected `CastColumnExpr`.

## Are these changes tested?

Yes.

* Added `test_create_cast_column_expr_uses_name_lookup_not_column_index`
which covers the scenario where physical and logical schemas have
different column orders and the provided `Column` index is incorrect.

## Are there any user-facing changes?

No direct user-facing changes.

This is an internal correctness fix that improves robustness of physical
expression adaptation when schema ordering differs between logical and
physical plans.

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
2026-02-25 07:52:59 +00:00
Haresh Khanna 2347306943 [Minor] Fix error messages for shrink and try_shrink (#20422)
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->
In the following code, when we fetch `prev` again to construct the error
message, the value we get may be different from the value that failed
`checked_sub` in the first place which would get us out of the
fetch_update CAS loop. Instead we should use the prev value that
`fetch_update` returned in the error message.

```rust
pub fn try_shrink(&self, capacity: usize) -> Result<usize> {
    let prev = self
        .size
        .fetch_update(
            atomic::Ordering::Relaxed,
            atomic::Ordering::Relaxed,
            |prev| prev.checked_sub(capacity),
        )
        .map_err(|_| {
            let prev = self.size.load(atomic::Ordering::Relaxed);
            internal_datafusion_err!(
                "Cannot free the capacity {capacity} out of allocated size {prev}"
            )
        })?;

    self.registration.pool.shrink(self, capacity);
    Ok(prev - capacity)
}
```

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->
Yes, with existing tests.

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
No
2026-02-25 01:47:39 +00:00
Albert Skalt 387e20cc58 Improve HashJoinExecBuilder to save state from previous fields (#20276)
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

Closes #20270

Prior the patch HashJoinExecBuilder constructed from an existing node
reseted some fields of the node, e.g. dynamic filters, metrics. It
significantly reduces usage scope of the builder.

## What changes are included in this PR?

This patch improves the implementation. Now builder created from the
existing node preserves all fields in case they have not been explicitly
updated. Also builder now tracks flag if it must recompute plan
properties.

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2026-02-24 22:15:34 +00:00
Neil Conway 585bbf35d3 perf: Optimize array_has_any() with scalar arg (#20385)
## Which issue does this PR close?

- Closes #20384.
- See #18181 for related context.

## Rationale for this change

When `array_has_any` is passed a scalar for either of its arguments, we
can use a much faster algorithm: rather than doing O(N*M) comparisons
for each row of the columnar arg, we can build a hash table on the
scalar argument and probe it instead.

## What changes are included in this PR?

* Add benchmark to cover the one-scalar-arg case
* Implement optimization as described above

Note that we fallback to a linear scan when the scalar arg is smaller
than a threshold (<= 8 elements), because benchmarks suggested probing a
HashSet is not profitable for very small arrays.

## Are these changes tested?

Yes. Tests pass and benchmarked.

## Are there any user-facing changes?

No.

---------

Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
2026-02-24 20:59:08 +00:00
Albert Skalt 34dad2ccee Cache PlanProperties, add fast-path for with_new_children (#19792)
- closes https://github.com/apache/datafusion/issues/19796

This patch aims to implement a fast-path for the
ExecutionPlan::with_new_children function for some plans, moving closer
to a physical plan re-use implementation and improving planning
performance. If the passed children properties are the same as in self,
we do not actually recompute self's properties (which could be costly if
projection mapping is required). Instead, we just replace the children
and re-use self's properties as-is.

To be able to compare two different properties --
ExecutionPlan::properties(...) signature is modified and now returns
`&Arc<PlanProperties>`. If `children` properties are the same in
`with_new_children` -- we clone our properties arc and then a parent
plan will consider our properties as unchanged, doing the same.

- Return `&Arc<PlanProperties>` from `ExecutionPlan::properties(...)`
instead of a reference.
- Implement `with_new_children` fast-path if there is no children
properties changes for all
  major plans.

Note: currently, `reset_plan_states` does not allow to re-use plan in
general: it is not
supported for dynamic filters and recursive queries features, as in this
case state reset
should update pointers in the children plans.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2026-02-24 20:58:06 +00:00
Ganesh Patil b8cebdde2a Fix incorrect regex pattern in regex_replace_posix_groups (#19827)
The `regex_replace_posix_groups` method was using the pattern `(\d*)` to
match
POSIX capture group references like `\1`. However, `*` matches zero or
more
digits, which caused a lone backslash `\` to incorrectly become `${}`.

Changed to `(\d+)` which requires at least one digit, fixing the issue.

Added unit tests to validate correct behavior.

- Fixes #19766

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2026-02-24 20:26:06 +00:00
Adam Gutglick e80694e369 Remove recursive const check in simplify_const_expr (#20234)
## Which issue does this PR close?

- Closes #20134 .

## Rationale for this change

The check for simplifying const expressions was recursive and expensive,
repeatedly checking the expression's children in a recursive way.

I've tried other approached like pre-computing the result for all
expressions outside of the loop and using that cache during the
traversal, but I've found that it only yielded between 5-8% improvement
while adding complexity, while this approach simplifies the code and
seems to be more performant in my benchmarks (change is compared to
current main branch):
```
tpc-ds/q76/cs/16        time:   [27.112 µs 27.159 µs 27.214 µs]
                        change: [−13.533% −13.167% −12.801%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

tpc-ds/q76/ws/16        time:   [26.175 µs 26.280 µs 26.394 µs]
                        change: [−14.312% −13.833% −13.346%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

tpc-ds/q76/cs/128       time:   [195.79 µs 196.17 µs 196.56 µs]
                        change: [−14.362% −14.080% −13.816%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild

tpc-ds/q76/ws/128       time:   [197.08 µs 197.61 µs 198.23 µs]
                        change: [−13.531% −13.142% −12.737%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
```

## What changes are included in this PR?

1. `simplify_const_expr` now only checks itself and whether all of its
children are literals, because it assumes the order of simplification is
bottoms-up.
2. Removes some code from the public API, see the last section for the
full details.

## Are these changes tested?

Existing test suite

## Are there any user-facing changes?

I suggest removing some of the physical expression simplification code
from the public API, which I believe reduces the maintenance burden
here. These changes also helps removing code like the distinct
`simplify_const_expr` and `simplify_const_expr_with_dummy`.

1. Makes all `datafusion-physical-expr::simplifier` sub-modules (`not`
and `const_evaluator`) private, including their key functions. They are
not used externally, and being able to change their behavior seems more
valuable long term. The simplifier is also not currently an extension
point as far as I can tell, so there's no value in providing atomic
building blocks like them for now.
2. Removes `has_column_references` completely, its trivial to
re-implement and isn't used anywhere in the codebase.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2026-02-24 19:49:37 +00:00
Andrew Lamb fdd36d0d21 Update comments on OptimizerRule about function name matching (#20346)
## Which issue does this PR close?

- Related to  https://github.com/apache/datafusion/pull/20180



## Rationale for this change

I gave feedback to @devanshu0987
https://github.com/apache/datafusion/pull/20180/changes#r2800720037 that
it was not a good idea to check for function names in optimizer rules,
but then I realized that the rationale for this is not written down
anywhere.

## What changes are included in this PR?

Document why checking for function names in optimizer rules is not good
and offer alternatives

## Are these changes tested?

By CI

## Are there any user-facing changes?

Just docs, no functional changes
2026-02-24 19:46:32 +00:00
Raz Luvaton b16ad9badc fix: SortMergeJoin don't wait for all input before emitting (#20482)
## Which issue does this PR close?

N/A

## Rationale for this change

I noticed while playing around with local tests and debugging memory
issue, that `SortMergeJoinStream` wait for all input before start
emitting, which shouldn't be the case as we can emit early when we have
enough data.

also, this cause huge memory pressure

## What changes are included in this PR?

Trying to fix the issue, not sure yet

## Are these changes tested?

Yes

## Are there any user-facing changes?


-----


## TODO:
- [x] update docs
- [x] finish fix
2026-02-24 19:12:42 +00:00
Neil Conway db5197b742 chore: Replace matches! on fieldless enums with == (#20525)
## Which issue does this PR close?

N/A

## Rationale for this change

When comparing a value with a field-less enum that implements
`PartialEq`, `==` is simpler and more readable than `matches!`.

## What changes are included in this PR?

## Are these changes tested?

Yes.

## Are there any user-facing changes?

No.
2026-02-24 15:48:06 +00:00
dependabot[bot] 932418b20c chore(deps): bump strum_macros from 0.27.2 to 0.28.0 (#20521)
Bumps [strum_macros](https://github.com/Peternator7/strum) from 0.27.2
to 0.28.0.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/Peternator7/strum/blob/master/CHANGELOG.md">strum_macros's
changelog</a>.</em></p>
<blockquote>
<h2>0.28.0</h2>
<ul>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/461">#461</a>:
Allow any kind of passthrough attributes on
<code>EnumDiscriminants</code>.</p>
<ul>
<li>Previously only list-style attributes (e.g.
<code>#[strum_discriminants(derive(...))]</code>) were supported. Now
path-only
(e.g. <code>#[strum_discriminants(non_exhaustive)]</code>) and
name/value (e.g. <code>#[strum_discriminants(doc =
&quot;foo&quot;)]</code>)
attributes are also supported.</li>
</ul>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/462">#462</a>:
Add missing <code>#[automatically_derived]</code> to generated impls not
covered by <a
href="https://redirect.github.com/Peternator7/strum/pull/444">#444</a>.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/466">#466</a>:
Bump MSRV to 1.71, required to keep up with updated <code>syn</code> and
<code>windows-sys</code> dependencies. This is a breaking change if
you're on an old version of rust.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/469">#469</a>:
Use absolute paths in generated proc macro code to avoid
potential name conflicts.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/465">#465</a>:
Upgrade <code>phf</code> dependency to v0.13.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/473">#473</a>:
Fix <code>cargo fmt</code> / <code>clippy</code> issues and add GitHub
Actions CI.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/477">#477</a>:
<code>strum::ParseError</code> now implements
<code>core::fmt::Display</code> instead
<code>std::fmt::Display</code> to make it <code>#[no_std]</code>
compatible. Note the <code>Error</code> trait wasn't available in core
until <code>1.81</code>
so <code>strum::ParseError</code> still only implements that in std.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/476">#476</a>:
<strong>Breaking Change</strong> - <code>EnumString</code> now
implements <code>From&lt;&amp;str&gt;</code>
(infallible) instead of <code>TryFrom&lt;&amp;str&gt;</code> when the
enum has a <code>#[strum(default)]</code> variant. This more accurately
reflects that parsing cannot fail in that case. If you need the old
<code>TryFrom</code> behavior, you can opt back in using
<code>parse_error_ty</code> and <code>parse_error_fn</code>:</p>
<pre lang="rust"><code>#[derive(EnumString)]
#[strum(parse_error_ty = strum::ParseError, parse_error_fn =
make_error)]
pub enum Color {
    Red,
    #[strum(default)]
    Other(String),
}
<p>fn make_error(x: &amp;str) -&gt; strum::ParseError {
strum::ParseError::VariantNotFound
}
</code></pre></p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/431">#431</a>:
Fix bug where <code>EnumString</code> ignored the
<code>parse_err_ty</code>
attribute when the enum had a <code>#[strum(default)]</code>
variant.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/474">#474</a>:
EnumDiscriminants will now copy <code>default</code> over from the
original enum to the Discriminant enum.</p>
<pre lang="rust"><code>#[derive(Debug, Default, EnumDiscriminants)]
#[strum_discriminants(derive(Default))] // &lt;- Remove this in 0.28.
enum MyEnum {
    #[default] // &lt;- Will be the #[default] on the MyEnumDiscriminant
    #[strum_discriminants(default)] // &lt;- Remove this in 0.28
    Variant0,
    Variant1 { a: NonDefault },
}
</code></pre>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/Peternator7/strum/commit/7376771128834d28bb9beba5c39846cba62e71ec"><code>7376771</code></a>
Peternator7/0.28 (<a
href="https://redirect.github.com/Peternator7/strum/issues/475">#475</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/26e63cd964a2e364331a5dd977d589bb9f649d8c"><code>26e63cd</code></a>
Display exists in core (<a
href="https://redirect.github.com/Peternator7/strum/issues/477">#477</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/9334c728eedaa8a992d1388a8f4564bbccad1934"><code>9334c72</code></a>
Make TryFrom and FromStr infallible if there's a default (<a
href="https://redirect.github.com/Peternator7/strum/issues/476">#476</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/0ccbbf823c16e827afc263182cd55e99e3b2a52e"><code>0ccbbf8</code></a>
Honor parse_err_ty attribute when the enum has a default variant (<a
href="https://redirect.github.com/Peternator7/strum/issues/431">#431</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/2c9e5a9259189ce8397f2f4967060240c6bafd74"><code>2c9e5a9</code></a>
Automatically add Default implementation to EnumDiscriminant if it
exists on ...</li>
<li><a
href="https://github.com/Peternator7/strum/commit/e241243e48359b8b811b8eaccdcfa1ae87138e0d"><code>e241243</code></a>
Fix existing cargo fmt + clippy issues and add GH actions (<a
href="https://redirect.github.com/Peternator7/strum/issues/473">#473</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/639b67fefd20eaead1c5d2ea794e9afe70a00312"><code>639b67f</code></a>
feat: allow any kind of passthrough attributes on
<code>EnumDiscriminants</code> (<a
href="https://redirect.github.com/Peternator7/strum/issues/461">#461</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/0ea1e2d0fd1460e7492ea32e6b460394d9199ff8"><code>0ea1e2d</code></a>
docs: Fix typo (<a
href="https://redirect.github.com/Peternator7/strum/issues/463">#463</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/36c051b91086b37d531c63ccf5a49266832a846d"><code>36c051b</code></a>
Upgrade <code>phf</code> to v0.13 (<a
href="https://redirect.github.com/Peternator7/strum/issues/465">#465</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/9328b38617dc6f4a3bc5fdac03883d3fc766cf34"><code>9328b38</code></a>
Use absolute paths in proc macro (<a
href="https://redirect.github.com/Peternator7/strum/issues/469">#469</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/Peternator7/strum/compare/v0.27.2...v0.28.0">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=strum_macros&package-manager=cargo&previous-version=0.27.2&new-version=0.28.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-02-24 15:47:50 +00:00
Neil Conway e71e7a39bf chore: Cleanup code to use repeat_n in a few places (#20527)
## Which issue does this PR close?

N/A

## Rationale for this change

Using `repeat_n` is more readable and slightly faster than
`(0..n).map(|_| ...)`.

## What changes are included in this PR?

## Are these changes tested?

Yes.

## Are there any user-facing changes?

No.
2026-02-24 15:46:58 +00:00
Adrian Garcia Badaracco 670dbf481c fix: prevent duplicate alias collision with user-provided __datafusion_extracted names (#20432)
## Summary
- Fixes a bug where the optimizer's `AliasGenerator` could produce alias
names that collide with`__datafusion_extracted_N` aliases, causing a
"Schema contains duplicate unqualified field name" error
- I don't expect users themselves to create these aliases, but if you
run the optimizers twice (with different `AliasGenerator` instances)
you'll hit this.
- Adds `AliasGenerator::update_min_id()` to advance the counter past
existing aliases
- Scans each plan node's expressions during `ExtractLeafExpressions`
traversal to seed the generator before any extraction occurs
- Switches to controlling the traversal which also means the
config-based short circuit more clearly skips the entire rule.

Closes https://github.com/apache/datafusion/issues/20430

## Test plan
- [x] Unit test: `test_user_provided_extracted_alias_no_collision` in
`extract_leaf_expressions`
- [x] SLT regression test in `projection_pushdown.slt` with explicit
`__datafusion_extracted_2` alias

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 15:02:59 +00:00
mishop-15 17d770d6e5 fix: handle out of range errors in DATE_BIN instead of panicking (#20221)
## Which issue does this PR close?

Closes #20219

## Rationale for this change

The DATE_BIN function was panicking when datetime operations went out of
range instead of returning proper errors. The two specific cases were:
1. Month subtraction going out of range causing `DateTime - Months`
panic
2. `timestamp_nanos_opt()` returning None and then unwrapping

## What changes are included in this PR?

- Changed `date_bin_months_interval` and `to_utc_date_time` to return
`Result` instead of panicking
- Replaced `origin_date - Months` and `origin_date + Months` with
`checked_sub_months` and `checked_add_months`
- Replaced `.unwrap()` calls with proper `match` statements and error
handling
- Updated all callers throughout the file to handle `Result` types

## Are these changes tested?

Tested manually with the exact queries from the issue that were
panicking:
```sql
select DATE_BIN('1637426858', TO_TIMESTAMP_MILLIS(1040292460), TIMESTAMP '1984-01-07 00:00:00');
select DATE_BIN('1637426858', TO_TIMESTAMP_MILLIS(-1040292460), TIMESTAMP '1984-01-07 00:00:00');
```

Both queries now return NULL instead of panicking. All existing unit
tests pass.

## Are there any user-facing changes?

Yes - queries with DATE_BIN that would previously panic now return NULL
when datetime operations go out of range.
2026-02-24 13:55:32 +00:00
Neil Conway 9c85ac608f perf: Fix quadratic behavior of to_array_of_size (#20459)
## Which issue does this PR close?

- Closes #20458.
- Closes #18159.

## Rationale for this change

When `array_to_size(n)` was called on a `List`-like object containing a
`StringViewArray` with `b` data buffers, the previous implementation
returned a list containing a `StringViewArray` with `n*b` buffers, which
results in catastrophically bad performance if `b` grows even somewhat
large.

This issue was previously noticed causing poor nested loop join
performance. #18161 adjusted the NLJ code to avoid calling
`to_array_of_size` for this reason, but didn't attempt to fix the
underlying issue in `to_array_of_size`. This PR doesn't attempt to
revert the change to the NLJ code: the special-case code added in #18161
is still slightly faster than `to_array_of_size` after this
optimization. It might be possible to address that in a future PR.

## What changes are included in this PR?
* Instead of using `repeat_n` + `concat` to merge together `n` copies of
the `StringViewArray`, we instead use `take`, which preserves the same
number of buffers as the input `StringViewArray`.
* Add a new benchmark for this situation
* Add more unit tests for `to_array_of_size`

## Are these changes tested?

Yes and benchmarked.

## Are there any user-facing changes?

No.

## AI usage

Iterated on the problem with Claude Code; I understand the problem and
the solution.
2026-02-24 13:53:10 +00:00
Tim Saucer a9c090141d Add support for FFI config extensions (#19469)
## Which issue does this PR close?

This addresses part of https://github.com/apache/datafusion/issues/17035

This is also a blocker for
https://github.com/apache/datafusion/issues/20450

## Rationale for this change

Currently we cannot support user defined configuration extensions via
FFI. This is because much of the infrastructure on how to add and
extract custom extensions relies on knowing concrete types of the
extensions. This is not supported in FFI. This PR adds an implementation
of configuration extensions that can be used across a FFI boundary.

## What changes are included in this PR?

- Implement `FFI_ExtensionOptions`.
- Update `ConfigOptions` to check if a `datafusion_ffi` namespace exists
when setting values
- Add unit test

## Are these changes tested?

Unit test added.

Also tested against `datafusion-python` locally. With this code I have
the following test that passes. I have created a simple python exposed
`MyConfig`:

```python
from datafusion import SessionConfig
from datafusion_ffi_example import MyConfig

def test_catalog_provider():
    config = MyConfig()
    config = SessionConfig().with_extension(config)
    config.set("my_config.baz_count", "42")
```

## Are there any user-facing changes?

New addition only.
2026-02-24 13:18:02 +00:00
kosiew 4a41587bdf Make custom_file_casts example schema nullable to allow null id values during casting (#20486)
## Which issue does this PR close?

*
[Comment](https://github.com/apache/datafusion/pull/20202#discussion_r2804841561)
on #20202

---

## Rationale for this change

The `custom_file_casts` example defines a *logical/table* schema that
uses `id: Int32` as the target type. In practice, casting and projection
paths in DataFusion can produce **nulls** (e.g. failed casts, missing
values, or intermediate expressions), and examples should avoid implying
that nulls are impossible when demonstrating casting behavior.

Marking the `id` field as **nullable** makes the example more realistic
and prevents confusion when users follow or adapt the example to
scenarios where nulls may appear.

---

## What changes are included in this PR?

* Update the logical/table schema in `custom_file_casts.rs` to define
`id` as **nullable** (`Field::new("id", DataType::Int32, true)`).
* Adjust the inline comment to reflect the nullable schema.

---

## Are these changes tested?

No new tests were added.

This is a documentation/example-only change that updates a schema
definition and comment. The example continues to compile and can be
exercised by running the `custom_file_casts` example as before.

---

## Are there any user-facing changes?

Yes (example behavior/expectations):

* The `custom_file_casts` example now documents `id` as nullable,
aligning the example schema with situations where cast/projection may
yield null values.
* No public APIs are changed and no breaking behavior is introduced.
2026-02-24 12:24:42 +00:00
Tim-53 0dfa542201 fix: HashJoin panic with dictionary-encoded columns in multi-key joins (#20441)
## Which issue does this PR close?

- Closes #20437


## Rationale for this change
`flatten_dictionary_array` returned only the unique values rather then
the full expanded array when being called on a `DictionaryArray`. When
building a `StructArray` this caused a length mismatch panic.


## What changes are included in this PR?
Replaced `array.values()` with `arrow::compute::cast(array, value_type)`
in `flatten_dictionary_array`, which properly expands the dictionary
into a full length array matching the row count.

## Are these changes tested?

Yes, both a new unit test aswell as a regression test were added.

## Are there any user-facing changes?

Nope

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2026-02-24 12:10:15 +00:00
dependabot[bot] 6c793694e9 chore(deps): bump the all-other-cargo-deps group with 2 updates (#20519)
Bumps the all-other-cargo-deps group with 2 updates:
[chrono](https://github.com/chronotope/chrono) and
[wasm-bindgen-test](https://github.com/wasm-bindgen/wasm-bindgen).

Updates `chrono` from 0.4.43 to 0.4.44
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/chronotope/chrono/releases">chrono's
releases</a>.</em></p>
<blockquote>
<h2>0.4.44</h2>
<h2>What's Changed</h2>
<ul>
<li>docs: match MSRV with <code>Cargo.toml</code> contents by <a
href="https://github.com/coryan"><code>@​coryan</code></a> in <a
href="https://redirect.github.com/chronotope/chrono/pull/1772">chronotope/chrono#1772</a></li>
<li>Add track_caller to non-deprecated functions by <a
href="https://github.com/svix-jplatte"><code>@​svix-jplatte</code></a>
in <a
href="https://redirect.github.com/chronotope/chrono/pull/1774">chronotope/chrono#1774</a></li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/chronotope/chrono/commit/c14b4599d07ef36ffa1f8a531fb0bc7eb3b42464"><code>c14b459</code></a>
Bump version to 0.4.44</li>
<li><a
href="https://github.com/chronotope/chrono/commit/ea832c5090369eefa2cb6a47d643e2f7ade7ffa7"><code>ea832c5</code></a>
Add track_caller to non-deprecated functions</li>
<li><a
href="https://github.com/chronotope/chrono/commit/cfae889a3a23507acf49b605794abba17effd2d7"><code>cfae889</code></a>
Fix panic message in to_rfc2822</li>
<li><a
href="https://github.com/chronotope/chrono/commit/f8900b5a44228a7f6282c65e8c407d3ecb6dcb7b"><code>f8900b5</code></a>
docs: match MSRV with <code>Cargo.toml</code> contents</li>
<li>See full diff in <a
href="https://github.com/chronotope/chrono/compare/v0.4.43...v0.4.44">compare
view</a></li>
</ul>
</details>
<br />

Updates `wasm-bindgen-test` from 0.3.61 to 0.3.62
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/wasm-bindgen/wasm-bindgen/commits">compare
view</a></li>
</ul>
</details>
<br />


Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore <dependency name> major version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's major version (unless you unignore this specific
dependency's major version or upgrade to it yourself)
- `@dependabot ignore <dependency name> minor version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's minor version (unless you unignore this specific
dependency's minor version or upgrade to it yourself)
- `@dependabot ignore <dependency name>` will close this group update PR
and stop Dependabot creating any more for the specific dependency
(unless you unignore this specific dependency or upgrade to it yourself)
- `@dependabot unignore <dependency name>` will remove all of the ignore
conditions of the specified dependency
- `@dependabot unignore <dependency name> <ignore condition>` will
remove the ignore condition of the specified dependency and ignore
conditions


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-02-24 11:24:24 +00:00
dependabot[bot] 4c0a6531ca chore(deps): bump taiki-e/install-action from 2.68.6 to 2.68.8 (#20518)
Bumps
[taiki-e/install-action](https://github.com/taiki-e/install-action) from
2.68.6 to 2.68.8.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/taiki-e/install-action/releases">taiki-e/install-action's
releases</a>.</em></p>
<blockquote>
<h2>2.68.8</h2>
<ul>
<li>
<p>Update <code>cargo-nextest@latest</code> to 0.9.129.</p>
</li>
<li>
<p>Update <code>mise@latest</code> to 2026.2.19.</p>
</li>
<li>
<p>Update <code>tombi@latest</code> to 0.7.32.</p>
</li>
</ul>
<h2>2.68.7</h2>
<ul>
<li>
<p>Update <code>mise@latest</code> to 2026.2.18.</p>
</li>
<li>
<p>Update <code>wasm-bindgen@latest</code> to 0.2.111.</p>
</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/taiki-e/install-action/blob/main/CHANGELOG.md">taiki-e/install-action's
changelog</a>.</em></p>
<blockquote>
<h1>Changelog</h1>
<p>All notable changes to this project will be documented in this
file.</p>
<p>This project adheres to <a href="https://semver.org">Semantic
Versioning</a>.</p>
<!-- raw HTML omitted -->
<h2>[Unreleased]</h2>
<ul>
<li>
<p>Update <code>wasm-bindgen@latest</code> to 0.2.112.</p>
</li>
<li>
<p>Update <code>uv@latest</code> to 0.10.5.</p>
</li>
</ul>
<h2>[2.68.8] - 2026-02-23</h2>
<ul>
<li>
<p>Update <code>cargo-nextest@latest</code> to 0.9.129.</p>
</li>
<li>
<p>Update <code>mise@latest</code> to 2026.2.19.</p>
</li>
<li>
<p>Update <code>tombi@latest</code> to 0.7.32.</p>
</li>
</ul>
<h2>[2.68.7] - 2026-02-22</h2>
<ul>
<li>
<p>Update <code>mise@latest</code> to 2026.2.18.</p>
</li>
<li>
<p>Update <code>wasm-bindgen@latest</code> to 0.2.111.</p>
</li>
</ul>
<h2>[2.68.6] - 2026-02-21</h2>
<ul>
<li>Update <code>wasm-bindgen@latest</code> to 0.2.110.</li>
</ul>
<h2>[2.68.5] - 2026-02-20</h2>
<ul>
<li>Update <code>wasm-bindgen@latest</code> to 0.2.109.</li>
</ul>
<h2>[2.68.4] - 2026-02-20</h2>
<ul>
<li>Update <code>cargo-nextest@latest</code> to 0.9.128.</li>
</ul>
<h2>[2.68.3] - 2026-02-19</h2>
<ul>
<li>
<p>Update <code>mise@latest</code> to 2026.2.17.</p>
</li>
<li>
<p>Update <code>cargo-tarpaulin@latest</code> to 0.35.2.</p>
</li>
<li>
<p>Update <code>syft@latest</code> to 1.42.1.</p>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/taiki-e/install-action/commit/cfdb446e391c69574ebc316dfb7d7849ec12b940"><code>cfdb446</code></a>
Release 2.68.8</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/350f13bd74589d52195d1aed8e04b35b616a9c49"><code>350f13b</code></a>
Update <code>cargo-nextest@latest</code> to 0.9.129</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/8ba6eccac43cdb9aa5b83627b897b416b206be2a"><code>8ba6ecc</code></a>
Update <code>mise@latest</code> to 2026.2.19</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/cf805946ef1da29d90b652c870b7a19aae44b0f5"><code>cf80594</code></a>
Update <code>tombi@latest</code> to 0.7.32</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/f92912fad184299a31e22ad070a5059fd07d4f59"><code>f92912f</code></a>
Release 2.68.7</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/4970026aba514ced4229209c822802e1bff68b3e"><code>4970026</code></a>
Update <code>mise@latest</code> to 2026.2.18</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/6043f02f023f20fde8f9436e0d500ee1391fab70"><code>6043f02</code></a>
Update <code>wasm-bindgen@latest</code> to 0.2.111</li>
<li>See full diff in <a
href="https://github.com/taiki-e/install-action/compare/470679bc3a1580072dac4e67535d1aa3a3dcdf51...cfdb446e391c69574ebc316dfb7d7849ec12b940">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=taiki-e/install-action&package-manager=github_actions&previous-version=2.68.6&new-version=2.68.8)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-02-24 11:24:23 +00:00
dependabot[bot] 3aa34b33f5 chore(deps): bump strum from 0.27.2 to 0.28.0 (#20520)
Bumps [strum](https://github.com/Peternator7/strum) from 0.27.2 to
0.28.0.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/Peternator7/strum/blob/master/CHANGELOG.md">strum's
changelog</a>.</em></p>
<blockquote>
<h2>0.28.0</h2>
<ul>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/461">#461</a>:
Allow any kind of passthrough attributes on
<code>EnumDiscriminants</code>.</p>
<ul>
<li>Previously only list-style attributes (e.g.
<code>#[strum_discriminants(derive(...))]</code>) were supported. Now
path-only
(e.g. <code>#[strum_discriminants(non_exhaustive)]</code>) and
name/value (e.g. <code>#[strum_discriminants(doc =
&quot;foo&quot;)]</code>)
attributes are also supported.</li>
</ul>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/462">#462</a>:
Add missing <code>#[automatically_derived]</code> to generated impls not
covered by <a
href="https://redirect.github.com/Peternator7/strum/pull/444">#444</a>.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/466">#466</a>:
Bump MSRV to 1.71, required to keep up with updated <code>syn</code> and
<code>windows-sys</code> dependencies. This is a breaking change if
you're on an old version of rust.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/469">#469</a>:
Use absolute paths in generated proc macro code to avoid
potential name conflicts.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/465">#465</a>:
Upgrade <code>phf</code> dependency to v0.13.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/473">#473</a>:
Fix <code>cargo fmt</code> / <code>clippy</code> issues and add GitHub
Actions CI.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/477">#477</a>:
<code>strum::ParseError</code> now implements
<code>core::fmt::Display</code> instead
<code>std::fmt::Display</code> to make it <code>#[no_std]</code>
compatible. Note the <code>Error</code> trait wasn't available in core
until <code>1.81</code>
so <code>strum::ParseError</code> still only implements that in std.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/476">#476</a>:
<strong>Breaking Change</strong> - <code>EnumString</code> now
implements <code>From&lt;&amp;str&gt;</code>
(infallible) instead of <code>TryFrom&lt;&amp;str&gt;</code> when the
enum has a <code>#[strum(default)]</code> variant. This more accurately
reflects that parsing cannot fail in that case. If you need the old
<code>TryFrom</code> behavior, you can opt back in using
<code>parse_error_ty</code> and <code>parse_error_fn</code>:</p>
<pre lang="rust"><code>#[derive(EnumString)]
#[strum(parse_error_ty = strum::ParseError, parse_error_fn =
make_error)]
pub enum Color {
    Red,
    #[strum(default)]
    Other(String),
}
<p>fn make_error(x: &amp;str) -&gt; strum::ParseError {
strum::ParseError::VariantNotFound
}
</code></pre></p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/431">#431</a>:
Fix bug where <code>EnumString</code> ignored the
<code>parse_err_ty</code>
attribute when the enum had a <code>#[strum(default)]</code>
variant.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/Peternator7/strum/pull/474">#474</a>:
EnumDiscriminants will now copy <code>default</code> over from the
original enum to the Discriminant enum.</p>
<pre lang="rust"><code>#[derive(Debug, Default, EnumDiscriminants)]
#[strum_discriminants(derive(Default))] // &lt;- Remove this in 0.28.
enum MyEnum {
    #[default] // &lt;- Will be the #[default] on the MyEnumDiscriminant
    #[strum_discriminants(default)] // &lt;- Remove this in 0.28
    Variant0,
    Variant1 { a: NonDefault },
}
</code></pre>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/Peternator7/strum/commit/7376771128834d28bb9beba5c39846cba62e71ec"><code>7376771</code></a>
Peternator7/0.28 (<a
href="https://redirect.github.com/Peternator7/strum/issues/475">#475</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/26e63cd964a2e364331a5dd977d589bb9f649d8c"><code>26e63cd</code></a>
Display exists in core (<a
href="https://redirect.github.com/Peternator7/strum/issues/477">#477</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/9334c728eedaa8a992d1388a8f4564bbccad1934"><code>9334c72</code></a>
Make TryFrom and FromStr infallible if there's a default (<a
href="https://redirect.github.com/Peternator7/strum/issues/476">#476</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/0ccbbf823c16e827afc263182cd55e99e3b2a52e"><code>0ccbbf8</code></a>
Honor parse_err_ty attribute when the enum has a default variant (<a
href="https://redirect.github.com/Peternator7/strum/issues/431">#431</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/2c9e5a9259189ce8397f2f4967060240c6bafd74"><code>2c9e5a9</code></a>
Automatically add Default implementation to EnumDiscriminant if it
exists on ...</li>
<li><a
href="https://github.com/Peternator7/strum/commit/e241243e48359b8b811b8eaccdcfa1ae87138e0d"><code>e241243</code></a>
Fix existing cargo fmt + clippy issues and add GH actions (<a
href="https://redirect.github.com/Peternator7/strum/issues/473">#473</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/639b67fefd20eaead1c5d2ea794e9afe70a00312"><code>639b67f</code></a>
feat: allow any kind of passthrough attributes on
<code>EnumDiscriminants</code> (<a
href="https://redirect.github.com/Peternator7/strum/issues/461">#461</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/0ea1e2d0fd1460e7492ea32e6b460394d9199ff8"><code>0ea1e2d</code></a>
docs: Fix typo (<a
href="https://redirect.github.com/Peternator7/strum/issues/463">#463</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/36c051b91086b37d531c63ccf5a49266832a846d"><code>36c051b</code></a>
Upgrade <code>phf</code> to v0.13 (<a
href="https://redirect.github.com/Peternator7/strum/issues/465">#465</a>)</li>
<li><a
href="https://github.com/Peternator7/strum/commit/9328b38617dc6f4a3bc5fdac03883d3fc766cf34"><code>9328b38</code></a>
Use absolute paths in proc macro (<a
href="https://redirect.github.com/Peternator7/strum/issues/469">#469</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/Peternator7/strum/compare/v0.27.2...v0.28.0">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=strum&package-manager=cargo&previous-version=0.27.2&new-version=0.28.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-02-24 11:23:48 +00:00
Dmitrii Blaginin 11ef486e6c Runs-on for extended CI checks (#20511)
part of https://github.com/apache/datafusion/issues/20052

## Which issue does this PR close?


example run:
https://github.com/apache/datafusion/actions/runs/22325922758

this recused the run time from 3h to 1h. still a lot (on my mac it runs
in 5m!) but that's a start

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-24 10:34:49 +00:00
Xander d59cdfe999 Fix name tracker (#19856)
## Which issue does this PR close?

- Closes #17508

## Rationale for this change

The previous implementation used UUID-based aliasing as a workaround to
prevent duplicate names for literals in Substrait plans. This approach
had several drawbacks:
- Non-deterministic plan names that made testing difficult (requiring
UUID regex filters)
- Only addressed literal naming conflicts, not the broader issue of name
deduplication
- Added unnecessary dependency on the `uuid` crate
- Didn't properly handle cases where the same qualified name could
appear with different schema representations

## What changes are included in this PR?

  1. Enhanced NameTracker: Refactored to detect two types of conflicts:
- Duplicate schema names: Tracked via schema_name() to prevent
validate_unique_names failures (e.g., two Utf8(NULL) literals)
- Ambiguous references: Tracked via qualified_name() to prevent
DFSchema::check_names failures when a qualified field (e.g.,
left.Utf8(NULL)) and unqualified field (e.g., Utf8(NULL)) share the same
column name
2. **Removed UUID dependency**: Eliminated the `uuid` crate from
`datafusion/substrait`
3. **Removed literal-specific aliasing**: The UUID-based workaround in
`project_rel.rs` is no longer needed as the improved NameTracker handles
all naming conflicts consistently
4. **Deterministic naming**: Name conflicts now use predictable
`__temp__N` suffixes instead of random UUIDs

Note: This doesn't fully fix all the issues in #17508 which allow some
special casing of `CAST` which are not included here.
## Are these changes tested?

Yes:
- Updated snapshot tests to reflect the new deterministic naming (e.g.,
`Utf8("people")__temp__0` instead of UUID-based names)
- Modified some roundtrip tests to verify semantic equivalence (schema
matching and execution) rather than exact string matching, which is more
robust
- All existing integration tests pass with the new naming scheme

## Are there any user-facing changes?

Minimal. The generated plan names are now deterministic and more
readable (using `__temp__N` suffixes instead of UUIDs), but this is
primarily an internal representation change. The functional behavior and
query results remain unchanged.
2026-02-24 08:15:59 +00:00
Neil Conway b6d46a6382 perf: Optimize initcap() (#20352)
## Which issue does this PR close?

- Closes #20351.

## Rationale for this change

When all values in a `Utf8`/`LargeUtf8` array are ASCII, we can skip
using `GenericStringBuilder` and instead process the entire input buffer
in a single pass using byte-level operations. This also avoids
recomputing the offsets and nulls arrays. A similar optimization is
already used for lower() and upper().

Along the way, optimize `initcap_string()` for ASCII-only inputs. It
already had an ASCII-only fastpath but there was room for further
optimization, by iterating over bytes rather than characters.

## What changes are included in this PR?

* Cleanup benchmarks: we ran the scalar benchmark for different array
sizes, despite the fact that it is invariant to the array size
* Add benchmark for different string lengths
* Add benchmark for Unicode array input
* Optimize for ASCII-only inputs as described above
* Add test case for ASCII-only input that is a sliced array
* Add test case variants for `LargeStringArray`

## Are these changes tested?

Yes, plus an additional test added.

## Are there any user-facing changes?

No.
2026-02-24 06:11:08 +00:00
Dmitrii Blaginin 7602913b0f Switch to the latest Mac OS (#20510) 2026-02-23 22:57:49 +00:00
Andrew Lamb b9328b9734 Upgrade to sqlparser 0.61.0 (#20177)
DRAFT until SQL parser is released

## Which issue does this PR close?

- part of https://github.com/apache/datafusion-sqlparser-rs/issues/2117


## Rationale for this change

Keep up to date with dependencies

I think @Samyak2 specifically would like access to the `:` field syntax

## What changes are included in this PR?
1. Update to 0.61.0
2. Update APIs 

## Are these changes tested?
Yes by existing tests

## Are there any user-facing changes?
New dependency

---------

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
2026-02-23 18:49:08 +00:00