mirror of
https://github.com/langchain-ai/datafusion.git
synced 2026-06-30 21:27:59 -04:00
main
12768 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
8df75c3f04 |
Document guidance on how to evaluate breaking API changes (#20584)
## Which issue does this PR close? ## Rationale for this change DataFusion does make API changes from time to time, and that is a normal part of software development. However, it is important to evaluate the impact of those API changes on downstream users and to ensure that the benefits of the change are clear to those users. I found a few times where API changes were made with the justification that "some APIs in DataFusion are cleaner" or "this is more consistent with other APIs". While those may be valid justifications, it is painful for downstream users who have change their code to accommodate the API change when they get nothing in return This most recently happened in this PR - https://github.com/apache/datafusion/pull/19790#pullrequestreview-3863480182 thus I think the contributor guide should include some guidance on how to evaluate breaking API changes and to ensure that the benefits of the change are clear to downstream users. ## What changes are included in this PR? Polish up the API guidance section ## Are these changes tested? By CI ## Are there any user-facing changes? Better / clearer docs |
||
|
|
3a23bb2531 |
perf: Optimize array_agg() using GroupsAccumulator (#20504)
## Which issue does this PR close? - Closes #20465. - Closes #17446. ## Rationale for this change This PR optimizes the performance of `array_agg()` by adding support for the `GroupsAccumulator` API. The design tries to minimize the amount of per-batch work done in `update_batch()`: we store a reference to the batch, and a `(group_idx, row_idx)` pair for each row. In `evaluate()`, we assemble all the requested output with a single `interleave` call. This turns out to be significantly faster, because we copy much less data and assembling the results can be vectorized more effectively. For example, on a benchmark with 5000 groups and 5000 int64 values per group, this approach is roughly 190x faster than the previous approach. Releasing memory after a partial emit is a little more involved than the previous approach, but with some determination it is still possible. ## What changes are included in this PR? * Implement the `GroupsAccumulator` API for `array_agg()` * Add benchmark for `array_agg` of a named struct over a dict, following the workload in #17446 * Add unit tests * Improve SLT test coverage * Remove a redundant SLT test ## Are these changes tested? Yes, and benchmarked. ## Are there any user-facing changes? No. ## AI usage Iterated with the help of multiple AI tools; I've reviewed and understand the resulting code. |
||
|
|
73fbd48070 |
Upgrade DataFusion to arrow-rs/parquet 58.0.0 / object_store 0.13.0 (#19728)
## Which issue does this PR close? - Follow on to https://github.com/apache/datafusion/pull/19355 - related to https://github.com/apache/arrow-rs/issues/8466 - Closes https://github.com/apache/datafusion/issues/17455 ## Rationale for this change Keep datafusion up to date (and test Arrow using DataFusion tests) ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Daniël Heres <danielheres@gmail.com> |
||
|
|
acec058cb5 |
perf: Use Arrow vectorized eq kernel for IN list with column references (#20528)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Relates to #20427 . ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> When the IN list contains column references (e.g. `SELECT * FROM t WHERE a IN (b, c, d, e)`), DataFusion falls back to a row-by-row `make_comparator` path which is significantly slower than it needs to be. Arrow provides SIMD-optimized `eq` kernels that can compare entire arrays in one call. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - Use Arrow's vectorized `eq` kernel instead of row-by-row `make_comparator` for non-nested types (primitive, string, binary) in the column-reference IN list evaluation path - For nested types (Struct, List, etc.), fall back to `make_comparator` since Arrow's `eq` kernel does not support them - Add 6 unit tests covering the column-reference evaluation path (Int32, Utf8, NOT IN, NULL handling, NaN semantics) ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Yes. 6 new unit tests added: - `test_in_list_with_columns_int32_scalars` - `test_in_list_with_columns_int32_column_refs` - `test_in_list_with_columns_utf8_column_refs` - `test_in_list_with_columns_negated` - `test_in_list_with_columns_null_in_list` - `test_in_list_with_columns_float_nan` ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> No API changes. Queries with column-reference IN lists will run faster. <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> |
||
|
|
5d8249ff16 |
fix: Fix and Refactor Spark shuffle function (#20484)
## Which issue does this PR close? - Closes #20483. ## Rationale for this change Currently, Spark `shuffle` function returns following error message when `seed` is `null`. This needs to be fixed by exposing `NULL` instead of `'Int64'`. **Current:** ``` query error SELECT shuffle([2, 1], NULL); ---- DataFusion error: Execution error: shuffle seed must be Int64 type, got 'Int64' ``` **New:** ``` query error DataFusion error: Execution error: shuffle seed must be Int64 type but got 'NULL' SELECT shuffle([1, 2, 3], NULL); ``` In addition to this fix, this PR also introduces following refactoring to `shuffle` function: - Combining args validation checks with `single` error message, - Extending current error message with expected data types: ``` Current: shuffle does not support type '{array_type}'. New: shuffle does not support type '{array_type}'; expected types: List, LargeList, FixedSizeList or Null." ``` - Adding new UT coverages for both `shuffle.rs` and `shuffle.slt`. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes, being added new UT cases. ## Are there any user-facing changes? Yes, updating Spark `shuffle` functions error messages. |
||
|
|
e567cb91f4 |
Fix serde of window lead/lag defaults (#20608)
## Which issue does this PR close? - Closes #20607. ## Rationale for this change Don't lose values in serde ## What changes are included in this PR? Preservation of window function arguments, particularly default value ## Are these changes tested? A RTT is included ## Are there any user-facing changes? Users with distributed query engines such as Ballista will have more queries work than before **note**: AI was used to create this PR |
||
|
|
451c79fc00 |
fix: Fix array_to_string with columnar third arg (#20536)
## Which issue does this PR close? - Closes #20535 ## Rationale for this change The previous coding used the `null_string` value for the first row as the value for the remainder of the rows. This is wrong if `null_string` is columnar. ## What changes are included in this PR? ## Are these changes tested? Yes; added new SLT test. ## Are there any user-facing changes? No, other than fixing the behavior of `array_to_string` in this scenario. |
||
|
|
7ef62b988d |
chore: Enable workspace lint for all workspace members (#20577)
## Which issue does this PR close? N/A ## Rationale for this change Several workspace members did not opt into workspace-level lints. This was almost certainly an oversight. ## What changes are included in this PR? * Enable workspace-level lints for all workspace members * Fix up code to fix various linter errors This PR pulls out some of the changes made in #20250, although I didn't actually use that PR. ## Are these changes tested? Yes. ## Are there any user-facing changes? No. |
||
|
|
c63ca330d7 |
fix: increase ROUND decimal precision to prevent overflow truncation (#19926)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #19921. ## Rationale for this change `SELECT ROUND('999.9'::DECIMAL(4,1))` incorrectly returned `100.0` instead of `1000`. When rounding a decimal value causes a carry-over that increases the number of digits (e.g., 999.9 → 1000.0), the result overflows the original precision constraint. The overflow was silently truncated during display, producing incorrect results. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - Fixes ROUND on DECIMAL so carry-over doesn’t silently truncate/produce wrong results. - When decimal_places is a constant (including omitted/NULL), ROUND now reduces the output scale to min(input_scale, max(decimal_places, 0)) (Spark/DuckDB-style), reclaiming precision for the integer part. - When decimal_places is not a constant (e.g. a column/array), ROUND keeps the original scale and may increase precision by 1 (capped); if precision is already max and the rounded value can’t fit, it returns an overflow error instead of a wrong value. - Adds/updates sqllogictests for these behaviors and edge cases. <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? - Added new sqllogictest case for the specific bug scenario - Updated 2 existing tests with new expected precision values - All existing tests pass <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? | Aspect | Before | After | | --- | --- | --- | | ROUND('999.9'::DECIMAL(4,1)) | 100.0 (wrong) | 1000 (correct) | | Return type (default/literal dp) | Decimal128(4, 1) | Decimal128(4, 0) | - Return type for DECIMAL inputs can change: with literal dp it generally reduces scale; with non-literal dp it keeps scale and may increase precision by 1. - New error case: when precision is already max and dp is non-literal, ROUND may now error on overflow rather than return a truncated/wrong decimal. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com> |
||
|
|
e583fe9d22 |
Add deterministic per-file timing summary to sqllogictest runner (#20569)
## Which issue does this PR close? * Part of #20524. ## Rationale for this change The sqllogictest runner executes files in parallel, but it was hard to pinpoint which test files dominate wall-clock time. This change adds **deterministic per-file elapsed timing observability** so we can identify long-tail files and prioritize follow-up optimization work, while keeping default output usable for both local development (TTY) and CI (non-TTY). ## What changes are included in this PR? * Collect per-file elapsed durations in the sqllogictest runner and aggregate them at end-of-run. * Print a **deterministic timing summary** (stable sort: elapsed desc, path asc; stable formatting) via `MultiProgress` to avoid interleaved progress-bar noise. * Add CLI flags and environment variables to control output: * `--timing-summary auto|off|top|full` (also `SLT_TIMING_SUMMARY`) * `--timing-top-n <N>` (also `SLT_TIMING_TOP_N`, must be `>= 1`) * Default behavior: * `auto` maps to `off` for local TTY runs and `top` for CI/non-TTY runs. * Add optional debug logging for slow files (over 30s) behind `SLT_TIMING_DEBUG_SLOW_FILES=1`. * Update `datafusion/sqllogictest/README.md` with usage examples. ## Are these changes tested? * Covered by existing `sqllogictests` integration test execution; no new unit tests were added. * Manual validation plan (ran locally / in CI as applicable): * `cargo test --test sqllogictests -- push_down_filter_ --test-threads 16` * `cargo test --test sqllogictests -- --test-threads 16` * `cargo test --test sqllogictests -- --timing-summary top --timing-top-n 10` * `cargo test --test sqllogictests -- --timing-summary full` * Verified output properties: * Summary ordering is deterministic across repeated runs (elapsed desc, path asc). * `auto` mode is quiet on TTY but prints a top-N summary on non-TTY/CI. * Pass/fail behavior and error reporting are unchanged. ## Are there any user-facing changes? Yes (test-runner UX only): * New optional timing summary output for `sqllogictests`. * New CLI flags / env vars documented in `datafusion/sqllogictest/README.md`: * `--timing-summary auto|off|top|full` / `SLT_TIMING_SUMMARY` * `--timing-top-n <N>` / `SLT_TIMING_TOP_N` * `SLT_TIMING_DEBUG_SLOW_FILES=1` (optional debug logging for slow files >30s) No public DataFusion APIs are changed. ## LLM-generated code disclosure This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested. |
||
|
|
bc600b3090 |
Split push_down_filter.slt into standalone sqllogictest files to reduce long-tail runtime (#20566)
## Which issue does this PR close? * Part of #20524 ## Rationale for this change `datafusion/sqllogictest/test_files/push_down_filter.slt` had grown into a large sqllogictest file. Since the sqllogictest runner parallelizes at **file granularity**, a single heavyweight file can become a straggler and dominate wall-clock time. This PR performs a non-invasive split of that file into smaller, self-contained `.slt` files so the runner can distribute work more evenly across threads, improving overall suite balance without changing SQL semantics or test coverage. ## What changes are included in this PR? * Removed the monolithic `push_down_filter.slt`. * Added new standalone sqllogictest files, each with the minimal setup/teardown required to run independently: * `push_down_filter_unnest.slt` — unnest filter pushdown coverage (including struct/field cases). * `push_down_filter_parquet.slt` — parquet filter pushdown + limit + cast predicate behavior + dynamic filter pushdown (swapped join inputs). * `push_down_filter_outer_joins.slt` — LEFT/RIGHT join and anti-join logical filter pushdown checks. * `push_down_filter_regression.slt` — regression coverage for issues #17188 and #17512, plus aggregate dynamic filter pushdown checks. * Updated scratch output paths to be file-scoped (e.g. `test_files/scratch/push_down_filter_parquet/...`) to reduce the chance of conflicts when tests execute in parallel. * Preserved all original query expectations and explain-plan assertions; changes are organizational only. ## Are these changes tested? Yes, with a python script to compare text blocks in the new slt files vs old single slt file. ``` python - <<'PY' import subprocess, re from collections import defaultdict, deque repo = '.' old_spec = '692a7cb67^:datafusion/sqllogictest/test_files/push_down_filter.slt' new_specs = [ 'HEAD:datafusion/sqllogictest/test_files/push_down_filter_outer_joins.slt', 'HEAD:datafusion/sqllogictest/test_files/push_down_filter_parquet.slt', 'HEAD:datafusion/sqllogictest/test_files/push_down_filter_regression.slt', 'HEAD:datafusion/sqllogictest/test_files/push_down_filter_unnest.slt', ] def git_show(spec): return subprocess.check_output(['git', '-C', repo, 'show', spec], text=True) def normalize_sql(sql): s = sql.strip().lower() s = re.sub( r"test_files/scratch/push_down_filter(?:_[^'\s;)/]+)?[^'\s;)]*", "test_files/scratch/__norm__", s, ) s = re.sub(r'\s+', ' ', s) return s def blocks(text): lines = text.splitlines() out = [] i = 0 while i < len(lines): m = lines[i].strip() if m.startswith('query '): i += 1 b = [] while i < len(lines) and lines[i].strip() != '----': if not lines[i].lstrip().startswith('#'): b.append(lines[i]) i += 1 sql = '\n'.join(b).strip() if sql: out.append(('query', normalize_sql(sql))) elif m.startswith('statement '): i += 1 b = [] while i < len(lines): s = lines[i].strip() if s == '': break if s.startswith('query ') or s.startswith('statement '): i -= 1 break if not lines[i].lstrip().startswith('#'): b.append(lines[i]) i += 1 sql = '\n'.join(b).strip() if sql: out.append(('statement', normalize_sql(sql))) i += 1 return out old_blocks = blocks(git_show(old_spec)) new_blocks = [] for s in new_specs: new_blocks.extend(blocks(git_show(s))) q = defaultdict(deque) for item in new_blocks: q[item].append(item) missing = 0 extra = 0 for item in old_blocks: if q[item]: q[item].popleft() else: missing += 1 for v in q.values(): extra += len(v) print(f'old_blocks={len(old_blocks)}') print(f'new_blocks={len(new_blocks)}') print(f'missing={missing}') print(f'extra={extra}') print(f'baseline={old_spec}') if missing != 0: raise SystemExit(1) PY ``` Output: ``` old_blocks=107 new_blocks=108 missing=0 extra=1 ``` The extra(1) is this statement block: set datafusion.explain.physical_plan_only = true; Why it shows as extra: In split files, it appears 3 times: push_down_filter_parquet.slt:21 push_down_filter_unnest.slt:21 push_down_filter_regression.slt:129 In the baseline monolithic file at e937cadbc^, it appears 2 times. So comparison reports 3 - 2 = extra 1. ## Are there any user-facing changes? No user-facing behavior changes. This is a test-suite organization/performance improvement only. ## Note before merging Revert e8369bb (it is a commit to trigger the CI extented tests for sqllogictest) ## LLM-generated code disclosure This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested. |
||
|
|
a79e6e6b39 |
fix(substrait): Correctly parse field references in subqueries (#20439)
## Which issue does this PR close? - Closes #20438. ## Rationale for this change The substrait consumer parsed field references in correlated subqueries incorrectly. Field references were always resolved relative to the schema of the current (innermost) subquery, leading to incorrect results. ## What changes are included in this PR? We now maintain a stack of outer query schemas, and pushes/pops elements from it as we traverse subqueries. When resolving field references, we now use `FieldReference.root_type` to detect outer query field references and resolve them against the appropriate schema. This commit updates the expected results for parsing TPC-H queries, because several of them were parsed incorrectly (the misparsing was probably not detected because the incorrect parse didn't result in any illegal queries, by sheer luck). This also means we can enable Q17, which failed to parse before. ## Are these changes tested? Yes. Test results updated to reflect new, correct behavior, and new unit tests added. ## Are there any user-facing changes? The behavior of the substrait consumer has changed, although the previous behavior was wrong and it seems a bit unlikely anyone would have dependend on it. The `DefaultSubstraitConsumer` API is slightly changed (new private field). |
||
|
|
a257c29c26 |
add redirect for old upgrading.html URL to fix broken changelog links (#20582)
## Which issue does this PR close? - Closes #20572 ## Rationale for this change When upgrade guides were split into separate pages (#20183), the old `upgrading.html` URL broke. All changelog files still reference this old URL, causing 404 errors for users. ## What changes are included in this PR? Added a redirect in `docs/source/conf.py` using the existing `sphinx_reredirects` extension to redirect `library-user-guide/upgrading.html` to `library-user-guide/upgrading/index.html`. This preserves all existing changelog links without needing to update historical files. ## Are these changes tested? Tested locally - the redirect works correctly, including with anchor links (e.g., `upgrading.html#datafusion-46-0-0`). ## Are there any user-facing changes? Yes - users clicking old changelog links will now be redirected to the correct page instead of getting a 404. |
||
|
|
3ab1301c53 |
fix: handle empty delimiter in split_part (closes #20503) (#20542)
## Which issue does this PR close? - Closes #20503 ## Rationale for this change `split_part` did not handle empty delimiters in a PostgreSQL-compatible way (`split("")` in Rust creates leading/trailing empty fields). This could return unexpected results for positions like `1` / `-1` and out-of-range values. This PR aligns behavior with Postgres semantics for empty delimiters. ## What changes are included in this PR? Small change in how we treat the 1, -1 ## Are these changes tested? Indeed! ## Are there any user-facing changes? Yes, behavior is now more consistent with PostgreSQL for `split_part(str, '', n)`. No API changes. |
||
|
|
e76f0eebe3 |
fix: IS NULL panic with invalid function without input arguments (#20306)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #20201 . ## Rationale for this change Unlike `Projection`, expressions within a `Filter` node do not always have their types resolved during the initial LogicalPlan generation. Validation is often deferred until the type_coercion phase. When invoking f_up on the expression tree to perform type coercion, the check is bypassed for function nodes with empty arguments, leading to a panic during subsequent execution. For example, all statements below cause a panic: ``` SELECT * FROM (SELECT 1) WHERE (STARTS_WITH() IS NULL); SELECT * FROM (SELECT 1) WHERE (STARTS_WITH() IS NOT NULL); SELECT * FROM (SELECT 'a') WHERE (STARTS_WITH() SIMILAR TO 'abc%'); SELECT * FROM (SELECT 1) WHERE CAST(STARTS_WITH() AS STRING) = 'x'; SELECT * FROM (SELECT 1) WHERE TRY_CAST(STARTS_WITH() AS INT) IS NULL; ``` This pr aims to stop panic. It would be better if reject these invalid cases at planning. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ```diff fn coerce_arguments_for_signature<F: UDFCoercionExt>( schema: &DFSchema, func: &F, ) -> Result<Vec<Expr>> { - if expressions.is_empty() { - return Ok(expressions); - } ``` Deleted the early return. Thanks to @neilconway . ## Are these changes tested? All original tests passed. Added some unit tests and sqllogictests. ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> No. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> |
||
|
|
d6fb3608b0 |
perf: Optimize array_position for scalar needle (#20532)
## Which issue does this PR close? - Closes #20530 ## Rationale for this change The previous implementation of `array_position` used `compare_element_to_list` for every input row. When the needle is a scalar (quite common), we can do much better by searching over the entire flat haystack values array with a single call to `arrow_ord::cmp::not_distinct`. We can then iterate over the resulting set bits to determine per-row results. This is ~5-10x faster than the previous implementation for typical inputs. ## What changes are included in this PR? * Implement new fast path for `array_position` with scalar needle * Improve docs for `array_position` * Don't use `internal_err` to report a user-visible error ## Are these changes tested? Yes, and benchmarked. Additional tests added in a separate PR (#20531) ## Are there any user-facing changes? No. |
||
|
|
a026e7da2f |
perf: Optimize heap handling in TopK operator (#20556)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. ## Rationale for this change This change to make a significant performance impact in the `TopK` operator, which is a commonly used operator. ## What changes are included in this PR? Instead of doing two operations on the inner heap (pop than push), we use `Binary::peek_mut`, which allows us to replace the heap item in-place and then sift it to its proper location in the heap. Some SLT results seem to change, the only explanation I can find for it is that pop/push vs the sift_down that `PeekMut` uses have some subtle differences that resolve ties in a different way, ending up with a slightly different result. On my macbook, running the `topk_aggregate` benchmark, most benchmarks are not changed significantly, aside from the following: ``` distinct 10000000 rows desc [no TopK] time: [554.69 ms 903.25 ms 1.3318 s] change: [−82.888% −69.587% −47.591%] (p = 0.00 < 0.05) Performance has improved. Found 17 outliers among 100 measurements (17.00%) 5 (5.00%) high mild 12 (12.00%) high severe Benchmarking distinct 10000000 rows asc [no TopK]: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 113.7s, or reduce sample count to 10. distinct 10000000 rows asc [no TopK] time: [405.87 ms 702.47 ms 1.0583 s] change: [−86.490% −75.215% −51.486%] (p = 0.00 < 0.05) Performance has improved. Found 17 outliers among 100 measurements (17.00%) 3 (3.00%) high mild 14 (14.00%) high severe distinct 10000000 rows desc [TopK] time: [6.8372 ms 6.9933 ms 7.1523 ms] change: [−0.5254% +2.2409% +5.0920%] (p = 0.13 > 0.05) No change in performance detected. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild distinct 10000000 rows asc [TopK] time: [6.8731 ms 6.9952 ms 7.1226 ms] change: [+3.3252% +5.3824% +7.5131%] (p = 0.00 < 0.05) Performance has regressed. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild ``` ## Are these changes tested? Existing test suite. ## Are there any user-facing changes? No API changes, seems like some ordering might change in queries that use the `TopK` operator, but in a way that seems correct. |
||
|
|
bcd42b090a |
fix: Unaccounted spill sort in row_hash (#20314)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #20313 . ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> We must not use that much memory without reserving it. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Added a reservation before the sort, made a shrink call for the group values after the emit and updated the reservation so the reservation will be possible. Moved the sort to use sort_chunked so we can immediately drop the original batch and shrink the reservation to the used sizes, added a new spill method for iterators, so we can use an accurate memory accounting. If said reservation did not succeed, fallback to an incrementing sort method which holds the original batch the whole time, and outputs one batch at the time, this requires a much smaller reservation. Made the reservation much more robust(otherwise the fuzz tests were failing now that we actually reserve the memory in the sort) ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Current tests should still function, but memory should be reserved. Added test that specifically verifies that we error on this when we shouldn't do the sort. Modified the tests that used to test the splitting function in the spill to test the new iter spilling function ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> No |
||
|
|
33b86fe02e |
perf: Cache num_output_rows in sort merge join to avoid O(n) recount (#20478)
## Which issue does this PR close?
N/A - performance optimization
## Rationale for this change
In the SMJ tight loop (`join_partial`), `num_unfrozen_pairs()` was
called **twice per iteration**: once in the loop guard and once inside
`append_output_pair`. This method iterates all chunks in
`output_indices` and sums their lengths — O(num_chunks). Over a full
batch of `batch_size` iterations, this makes the inner loop O(batch_size
* num_chunks) instead of O(batch_size).
## What changes are included in this PR?
Add a `num_output_rows` field to `StreamedBatch` that is incremented on
each append and reset on freeze, replacing the O(n) summation with an
O(1) field read.
- Added `num_output_rows: usize` field to `StreamedBatch`, initialized
to `0`
- Increment `num_output_rows` in `append_output_pair()` after each
append
- `num_output_rows()` now returns the cached field directly
- Reset to `0` in `freeze_streamed()` when `output_indices` is cleared
- Removed the `num_unfrozen_pairs` parameter from `append_output_pair()`
since it can now read `self.num_output_rows` directly
## Are these changes tested?
Yes — all 48 existing `sort_merge_join` tests pass. This is a pure
refactor of an internal counter with no behavioral change.
## Performance
Very minor improvement.
### Before
```
sort_merge_join/inner_1to1/100000
time: [3.8146 ms 3.8229 ms 3.8314 ms]
sort_merge_join/inner_1to10/100000
time: [16.094 ms 16.125 ms 16.161 ms]
Found 7 outliers among 100 measurements (7.00%)
6 (6.00%) high mild
1 (1.00%) high severe
sort_merge_join/left_1to1_unmatched/100000
time: [3.7823 ms 3.7861 ms 3.7902 ms]
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
sort_merge_join/left_semi_1to10/100000
time: [3.0523 ms 3.0755 ms 3.1023 ms]
Found 14 outliers among 100 measurements (14.00%)
3 (3.00%) high mild
11 (11.00%) high severe
sort_merge_join/left_anti_partial/100000
time: [3.3458 ms 3.3498 ms 3.3542 ms]
Found 12 outliers among 100 measurements (12.00%)
8 (8.00%) high mild
4 (4.00%) high severe
```
### After
```
sort_merge_join/inner_1to1/100000
time: [3.7162 ms 3.7207 ms 3.7254 ms]
change: [−4.2320% −3.9309% −3.6431%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
sort_merge_join/inner_1to10/100000
time: [15.556 ms 15.589 ms 15.626 ms]
change: [−5.2786% −4.8329% −4.4351%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) high mild
3 (3.00%) high severe
sort_merge_join/left_1to1_unmatched/100000
time: [3.7059 ms 3.7101 ms 3.7146 ms]
change: [−4.4526% −4.1565% −3.8660%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
sort_merge_join/left_semi_1to10/100000
time: [3.0832 ms 3.0899 ms 3.0981 ms]
change: [−4.0965% −3.4158% −2.7657%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) high mild
2 (2.00%) high severe
sort_merge_join/left_anti_partial/100000
time: [3.2963 ms 3.3048 ms 3.3153 ms]
change: [−3.9413% −3.5316% −3.0884%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) high mild
5 (5.00%) high severe
```
## Are there any user-facing changes?
No.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
||
|
|
3a970c58ff |
Clamp early aggregation emit to the sort boundary when using partial group ordering (#20446)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #20445. ## What changes are included in this PR? Fix a panic on early emit with partial sort aggregations, by clamping our emit point to the sort boundary ## Are these changes tested? Yes ## Are there any user-facing changes? No |
||
|
|
e894a03bea |
perf: Use Hashbrown for array_distinct (#20538)
## Which issue does this PR close? N/A ## Rationale for this change #20364 recently optimized `array_distinct` to use batched row conversion. As part of that PR, `std::HashSet` was used. This PR just replaces `std::HashSet` with `hashbrown::HashSet`, which measurably improves performance. ## What changes are included in this PR? ## Are these changes tested? Yes. ## Are there any user-facing changes? No. |
||
|
|
e6849945bf |
fix: cardinality() of an empty array should be zero (#20533)
## Which issue does this PR close? - Closes #20526. ## Rationale for this change Per Postgres and the SQL spec, `cardinality()` of an empty array should be zero; we previously returned `NULL`. Along the way, fix another bug: we previously returned `0` for the cardinality of an untyped `NULL` and `NULL` for the cardinality of a typed null (e.g., `NULL::int[]`). We should return `NULL` in both cases. ## What changes are included in this PR? Bug fixes, update SLT. ## Are these changes tested? Yes. ## Are there any user-facing changes? Yes: the behavior of `cardinality` has changed, albeit the previous behavior was incorrect. |
||
|
|
d7d646164d |
feat: Implement Spark bin function (#20479)
## Which issue does this PR close? N/A ## Rationale for this change Add new function: https://spark.apache.org/docs/latest/api/sql/index.html#bin ## What changes are included in this PR? - Implementation - Unit Tests - SLT tests ## Are these changes tested? Yes, tests added as part of this PR. ## Are there any user-facing changes? No, these are new function. --------- Co-authored-by: Kazantsev Maksim <mn.kazantsev@gmail.com> |
||
|
|
e937cadbcc |
[fix] Add type coercion from NULL to Interval to make date_bin more postgres compatible (#20499)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes https://github.com/apache/datafusion/issues/20502 ## Rationale for this change The following query is failing with the following error: `SELECT date_bin(NULL, TIMESTAMP '2023-01-01 12:30:00', TIMESTAMP '2023-01-01 12:00:00') ` `Error: Error during planning: Failed to coerce arguments to satisfy a call to 'date_bin' function: coercion from Null, Timestamp(ns), Timestamp(ns) to the signature OneOf([....])` ## What changes are included in this PR? Fix `date_bin(NULL, ...)` to return `NULL` instead of a planning error by allowing Nulls to coerce to Interva. ## Are these changes tested? I added a sqllogictest case to verify the query executes and returns `NULL`. ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> Yes, previously `date_bin(NULL, ...) `returned a planning error. It now returns NULL. |
||
|
|
d75fcb83e3 |
Fix physical expr adapter to resolve physical fields by name, not column index (#20485)
## Which issue does this PR close? * [Comment](https://github.com/apache/datafusion/pull/20202#discussion_r2804840366) on #20202 ## Rationale for this change When adapting physical expressions across differing logical/physical schemas, relying on `Column::index()` can be incorrect if the physical schema column ordering differs from the logical plan (or if a `Column` is constructed with an index that doesn’t match the current physical schema). This can lead to looking up the wrong physical field, causing incorrect casts, type mismatches, or runtime failures. This change ensures the adapter always resolves the physical field using the column **name** against the physical file schema, making expression rewriting robust to schema reordering and avoiding subtle bugs where an index points at an unrelated column. ## What changes are included in this PR? * Updated `create_cast_column_expr` to resolve the physical field via `physical_file_schema.index_of(column.name())` instead of `column.index()`. * Added a regression test that deliberately supplies a mismatched `Column` index and asserts the rewriter still selects the correct physical field by name and produces the expected `CastColumnExpr`. ## Are these changes tested? Yes. * Added `test_create_cast_column_expr_uses_name_lookup_not_column_index` which covers the scenario where physical and logical schemas have different column orders and the provided `Column` index is incorrect. ## Are there any user-facing changes? No direct user-facing changes. This is an internal correctness fix that improves robustness of physical expression adaptation when schema ordering differs between logical and physical plans. <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> |
||
|
|
2347306943 |
[Minor] Fix error messages for shrink and try_shrink (#20422)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> In the following code, when we fetch `prev` again to construct the error message, the value we get may be different from the value that failed `checked_sub` in the first place which would get us out of the fetch_update CAS loop. Instead we should use the prev value that `fetch_update` returned in the error message. ```rust pub fn try_shrink(&self, capacity: usize) -> Result<usize> { let prev = self .size .fetch_update( atomic::Ordering::Relaxed, atomic::Ordering::Relaxed, |prev| prev.checked_sub(capacity), ) .map_err(|_| { let prev = self.size.load(atomic::Ordering::Relaxed); internal_datafusion_err!( "Cannot free the capacity {capacity} out of allocated size {prev}" ) })?; self.registration.pool.shrink(self, capacity); Ok(prev - capacity) } ``` ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Yes, with existing tests. ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> No |
||
|
|
387e20cc58 |
Improve HashJoinExecBuilder to save state from previous fields (#20276)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> Closes #20270 Prior the patch HashJoinExecBuilder constructed from an existing node reseted some fields of the node, e.g. dynamic filters, metrics. It significantly reduces usage scope of the builder. ## What changes are included in this PR? This patch improves the implementation. Now builder created from the existing node preserves all fields in case they have not been explicitly updated. Also builder now tracks flag if it must recompute plan properties. Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> |
||
|
|
585bbf35d3 |
perf: Optimize array_has_any() with scalar arg (#20385)
## Which issue does this PR close? - Closes #20384. - See #18181 for related context. ## Rationale for this change When `array_has_any` is passed a scalar for either of its arguments, we can use a much faster algorithm: rather than doing O(N*M) comparisons for each row of the columnar arg, we can build a hash table on the scalar argument and probe it instead. ## What changes are included in this PR? * Add benchmark to cover the one-scalar-arg case * Implement optimization as described above Note that we fallback to a linear scan when the scalar arg is smaller than a threshold (<= 8 elements), because benchmarks suggested probing a HashSet is not profitable for very small arrays. ## Are these changes tested? Yes. Tests pass and benchmarked. ## Are there any user-facing changes? No. --------- Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com> Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com> |
||
|
|
34dad2ccee |
Cache PlanProperties, add fast-path for with_new_children (#19792)
- closes https://github.com/apache/datafusion/issues/19796 This patch aims to implement a fast-path for the ExecutionPlan::with_new_children function for some plans, moving closer to a physical plan re-use implementation and improving planning performance. If the passed children properties are the same as in self, we do not actually recompute self's properties (which could be costly if projection mapping is required). Instead, we just replace the children and re-use self's properties as-is. To be able to compare two different properties -- ExecutionPlan::properties(...) signature is modified and now returns `&Arc<PlanProperties>`. If `children` properties are the same in `with_new_children` -- we clone our properties arc and then a parent plan will consider our properties as unchanged, doing the same. - Return `&Arc<PlanProperties>` from `ExecutionPlan::properties(...)` instead of a reference. - Implement `with_new_children` fast-path if there is no children properties changes for all major plans. Note: currently, `reset_plan_states` does not allow to re-use plan in general: it is not supported for dynamic filters and recursive queries features, as in this case state reset should update pointers in the children plans. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> |
||
|
|
b8cebdde2a |
Fix incorrect regex pattern in regex_replace_posix_groups (#19827)
The `regex_replace_posix_groups` method was using the pattern `(\d*)` to
match
POSIX capture group references like `\1`. However, `*` matches zero or
more
digits, which caused a lone backslash `\` to incorrectly become `${}`.
Changed to `(\d+)` which requires at least one digit, fixing the issue.
Added unit tests to validate correct behavior.
- Fixes #19766
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
|
||
|
|
e80694e369 |
Remove recursive const check in simplify_const_expr (#20234)
## Which issue does this PR close? - Closes #20134 . ## Rationale for this change The check for simplifying const expressions was recursive and expensive, repeatedly checking the expression's children in a recursive way. I've tried other approached like pre-computing the result for all expressions outside of the loop and using that cache during the traversal, but I've found that it only yielded between 5-8% improvement while adding complexity, while this approach simplifies the code and seems to be more performant in my benchmarks (change is compared to current main branch): ``` tpc-ds/q76/cs/16 time: [27.112 µs 27.159 µs 27.214 µs] change: [−13.533% −13.167% −12.801%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 4 (4.00%) high mild 2 (2.00%) high severe tpc-ds/q76/ws/16 time: [26.175 µs 26.280 µs 26.394 µs] change: [−14.312% −13.833% −13.346%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) low mild tpc-ds/q76/cs/128 time: [195.79 µs 196.17 µs 196.56 µs] change: [−14.362% −14.080% −13.816%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) low severe 1 (1.00%) low mild 3 (3.00%) high mild tpc-ds/q76/ws/128 time: [197.08 µs 197.61 µs 198.23 µs] change: [−13.531% −13.142% −12.737%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) low mild 2 (2.00%) high mild ``` ## What changes are included in this PR? 1. `simplify_const_expr` now only checks itself and whether all of its children are literals, because it assumes the order of simplification is bottoms-up. 2. Removes some code from the public API, see the last section for the full details. ## Are these changes tested? Existing test suite ## Are there any user-facing changes? I suggest removing some of the physical expression simplification code from the public API, which I believe reduces the maintenance burden here. These changes also helps removing code like the distinct `simplify_const_expr` and `simplify_const_expr_with_dummy`. 1. Makes all `datafusion-physical-expr::simplifier` sub-modules (`not` and `const_evaluator`) private, including their key functions. They are not used externally, and being able to change their behavior seems more valuable long term. The simplifier is also not currently an extension point as far as I can tell, so there's no value in providing atomic building blocks like them for now. 2. Removes `has_column_references` completely, its trivial to re-implement and isn't used anywhere in the codebase. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> |
||
|
|
fdd36d0d21 |
Update comments on OptimizerRule about function name matching (#20346)
## Which issue does this PR close? - Related to https://github.com/apache/datafusion/pull/20180 ## Rationale for this change I gave feedback to @devanshu0987 https://github.com/apache/datafusion/pull/20180/changes#r2800720037 that it was not a good idea to check for function names in optimizer rules, but then I realized that the rationale for this is not written down anywhere. ## What changes are included in this PR? Document why checking for function names in optimizer rules is not good and offer alternatives ## Are these changes tested? By CI ## Are there any user-facing changes? Just docs, no functional changes |
||
|
|
b16ad9badc |
fix: SortMergeJoin don't wait for all input before emitting (#20482)
## Which issue does this PR close? N/A ## Rationale for this change I noticed while playing around with local tests and debugging memory issue, that `SortMergeJoinStream` wait for all input before start emitting, which shouldn't be the case as we can emit early when we have enough data. also, this cause huge memory pressure ## What changes are included in this PR? Trying to fix the issue, not sure yet ## Are these changes tested? Yes ## Are there any user-facing changes? ----- ## TODO: - [x] update docs - [x] finish fix |
||
|
|
db5197b742 |
chore: Replace matches! on fieldless enums with == (#20525)
## Which issue does this PR close? N/A ## Rationale for this change When comparing a value with a field-less enum that implements `PartialEq`, `==` is simpler and more readable than `matches!`. ## What changes are included in this PR? ## Are these changes tested? Yes. ## Are there any user-facing changes? No. |
||
|
|
932418b20c |
chore(deps): bump strum_macros from 0.27.2 to 0.28.0 (#20521)
Bumps [strum_macros](https://github.com/Peternator7/strum) from 0.27.2 to 0.28.0. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/Peternator7/strum/blob/master/CHANGELOG.md">strum_macros's changelog</a>.</em></p> <blockquote> <h2>0.28.0</h2> <ul> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/461">#461</a>: Allow any kind of passthrough attributes on <code>EnumDiscriminants</code>.</p> <ul> <li>Previously only list-style attributes (e.g. <code>#[strum_discriminants(derive(...))]</code>) were supported. Now path-only (e.g. <code>#[strum_discriminants(non_exhaustive)]</code>) and name/value (e.g. <code>#[strum_discriminants(doc = "foo")]</code>) attributes are also supported.</li> </ul> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/462">#462</a>: Add missing <code>#[automatically_derived]</code> to generated impls not covered by <a href="https://redirect.github.com/Peternator7/strum/pull/444">#444</a>.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/466">#466</a>: Bump MSRV to 1.71, required to keep up with updated <code>syn</code> and <code>windows-sys</code> dependencies. This is a breaking change if you're on an old version of rust.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/469">#469</a>: Use absolute paths in generated proc macro code to avoid potential name conflicts.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/465">#465</a>: Upgrade <code>phf</code> dependency to v0.13.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/473">#473</a>: Fix <code>cargo fmt</code> / <code>clippy</code> issues and add GitHub Actions CI.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/477">#477</a>: <code>strum::ParseError</code> now implements <code>core::fmt::Display</code> instead <code>std::fmt::Display</code> to make it <code>#[no_std]</code> compatible. Note the <code>Error</code> trait wasn't available in core until <code>1.81</code> so <code>strum::ParseError</code> still only implements that in std.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/476">#476</a>: <strong>Breaking Change</strong> - <code>EnumString</code> now implements <code>From<&str></code> (infallible) instead of <code>TryFrom<&str></code> when the enum has a <code>#[strum(default)]</code> variant. This more accurately reflects that parsing cannot fail in that case. If you need the old <code>TryFrom</code> behavior, you can opt back in using <code>parse_error_ty</code> and <code>parse_error_fn</code>:</p> <pre lang="rust"><code>#[derive(EnumString)] #[strum(parse_error_ty = strum::ParseError, parse_error_fn = make_error)] pub enum Color { Red, #[strum(default)] Other(String), } <p>fn make_error(x: &str) -> strum::ParseError { strum::ParseError::VariantNotFound } </code></pre></p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/431">#431</a>: Fix bug where <code>EnumString</code> ignored the <code>parse_err_ty</code> attribute when the enum had a <code>#[strum(default)]</code> variant.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/474">#474</a>: EnumDiscriminants will now copy <code>default</code> over from the original enum to the Discriminant enum.</p> <pre lang="rust"><code>#[derive(Debug, Default, EnumDiscriminants)] #[strum_discriminants(derive(Default))] // <- Remove this in 0.28. enum MyEnum { #[default] // <- Will be the #[default] on the MyEnumDiscriminant #[strum_discriminants(default)] // <- Remove this in 0.28 Variant0, Variant1 { a: NonDefault }, } </code></pre> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/Peternator7/strum/commit/7376771128834d28bb9beba5c39846cba62e71ec"><code>7376771</code></a> Peternator7/0.28 (<a href="https://redirect.github.com/Peternator7/strum/issues/475">#475</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/26e63cd964a2e364331a5dd977d589bb9f649d8c"><code>26e63cd</code></a> Display exists in core (<a href="https://redirect.github.com/Peternator7/strum/issues/477">#477</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/9334c728eedaa8a992d1388a8f4564bbccad1934"><code>9334c72</code></a> Make TryFrom and FromStr infallible if there's a default (<a href="https://redirect.github.com/Peternator7/strum/issues/476">#476</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/0ccbbf823c16e827afc263182cd55e99e3b2a52e"><code>0ccbbf8</code></a> Honor parse_err_ty attribute when the enum has a default variant (<a href="https://redirect.github.com/Peternator7/strum/issues/431">#431</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/2c9e5a9259189ce8397f2f4967060240c6bafd74"><code>2c9e5a9</code></a> Automatically add Default implementation to EnumDiscriminant if it exists on ...</li> <li><a href="https://github.com/Peternator7/strum/commit/e241243e48359b8b811b8eaccdcfa1ae87138e0d"><code>e241243</code></a> Fix existing cargo fmt + clippy issues and add GH actions (<a href="https://redirect.github.com/Peternator7/strum/issues/473">#473</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/639b67fefd20eaead1c5d2ea794e9afe70a00312"><code>639b67f</code></a> feat: allow any kind of passthrough attributes on <code>EnumDiscriminants</code> (<a href="https://redirect.github.com/Peternator7/strum/issues/461">#461</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/0ea1e2d0fd1460e7492ea32e6b460394d9199ff8"><code>0ea1e2d</code></a> docs: Fix typo (<a href="https://redirect.github.com/Peternator7/strum/issues/463">#463</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/36c051b91086b37d531c63ccf5a49266832a846d"><code>36c051b</code></a> Upgrade <code>phf</code> to v0.13 (<a href="https://redirect.github.com/Peternator7/strum/issues/465">#465</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/9328b38617dc6f4a3bc5fdac03883d3fc766cf34"><code>9328b38</code></a> Use absolute paths in proc macro (<a href="https://redirect.github.com/Peternator7/strum/issues/469">#469</a>)</li> <li>Additional commits viewable in <a href="https://github.com/Peternator7/strum/compare/v0.27.2...v0.28.0">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
e71e7a39bf |
chore: Cleanup code to use repeat_n in a few places (#20527)
## Which issue does this PR close? N/A ## Rationale for this change Using `repeat_n` is more readable and slightly faster than `(0..n).map(|_| ...)`. ## What changes are included in this PR? ## Are these changes tested? Yes. ## Are there any user-facing changes? No. |
||
|
|
670dbf481c |
fix: prevent duplicate alias collision with user-provided __datafusion_extracted names (#20432)
## Summary - Fixes a bug where the optimizer's `AliasGenerator` could produce alias names that collide with`__datafusion_extracted_N` aliases, causing a "Schema contains duplicate unqualified field name" error - I don't expect users themselves to create these aliases, but if you run the optimizers twice (with different `AliasGenerator` instances) you'll hit this. - Adds `AliasGenerator::update_min_id()` to advance the counter past existing aliases - Scans each plan node's expressions during `ExtractLeafExpressions` traversal to seed the generator before any extraction occurs - Switches to controlling the traversal which also means the config-based short circuit more clearly skips the entire rule. Closes https://github.com/apache/datafusion/issues/20430 ## Test plan - [x] Unit test: `test_user_provided_extracted_alias_no_collision` in `extract_leaf_expressions` - [x] SLT regression test in `projection_pushdown.slt` with explicit `__datafusion_extracted_2` alias 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
17d770d6e5 |
fix: handle out of range errors in DATE_BIN instead of panicking (#20221)
## Which issue does this PR close? Closes #20219 ## Rationale for this change The DATE_BIN function was panicking when datetime operations went out of range instead of returning proper errors. The two specific cases were: 1. Month subtraction going out of range causing `DateTime - Months` panic 2. `timestamp_nanos_opt()` returning None and then unwrapping ## What changes are included in this PR? - Changed `date_bin_months_interval` and `to_utc_date_time` to return `Result` instead of panicking - Replaced `origin_date - Months` and `origin_date + Months` with `checked_sub_months` and `checked_add_months` - Replaced `.unwrap()` calls with proper `match` statements and error handling - Updated all callers throughout the file to handle `Result` types ## Are these changes tested? Tested manually with the exact queries from the issue that were panicking: ```sql select DATE_BIN('1637426858', TO_TIMESTAMP_MILLIS(1040292460), TIMESTAMP '1984-01-07 00:00:00'); select DATE_BIN('1637426858', TO_TIMESTAMP_MILLIS(-1040292460), TIMESTAMP '1984-01-07 00:00:00'); ``` Both queries now return NULL instead of panicking. All existing unit tests pass. ## Are there any user-facing changes? Yes - queries with DATE_BIN that would previously panic now return NULL when datetime operations go out of range. |
||
|
|
9c85ac608f |
perf: Fix quadratic behavior of to_array_of_size (#20459)
## Which issue does this PR close? - Closes #20458. - Closes #18159. ## Rationale for this change When `array_to_size(n)` was called on a `List`-like object containing a `StringViewArray` with `b` data buffers, the previous implementation returned a list containing a `StringViewArray` with `n*b` buffers, which results in catastrophically bad performance if `b` grows even somewhat large. This issue was previously noticed causing poor nested loop join performance. #18161 adjusted the NLJ code to avoid calling `to_array_of_size` for this reason, but didn't attempt to fix the underlying issue in `to_array_of_size`. This PR doesn't attempt to revert the change to the NLJ code: the special-case code added in #18161 is still slightly faster than `to_array_of_size` after this optimization. It might be possible to address that in a future PR. ## What changes are included in this PR? * Instead of using `repeat_n` + `concat` to merge together `n` copies of the `StringViewArray`, we instead use `take`, which preserves the same number of buffers as the input `StringViewArray`. * Add a new benchmark for this situation * Add more unit tests for `to_array_of_size` ## Are these changes tested? Yes and benchmarked. ## Are there any user-facing changes? No. ## AI usage Iterated on the problem with Claude Code; I understand the problem and the solution. |
||
|
|
a9c090141d |
Add support for FFI config extensions (#19469)
## Which issue does this PR close? This addresses part of https://github.com/apache/datafusion/issues/17035 This is also a blocker for https://github.com/apache/datafusion/issues/20450 ## Rationale for this change Currently we cannot support user defined configuration extensions via FFI. This is because much of the infrastructure on how to add and extract custom extensions relies on knowing concrete types of the extensions. This is not supported in FFI. This PR adds an implementation of configuration extensions that can be used across a FFI boundary. ## What changes are included in this PR? - Implement `FFI_ExtensionOptions`. - Update `ConfigOptions` to check if a `datafusion_ffi` namespace exists when setting values - Add unit test ## Are these changes tested? Unit test added. Also tested against `datafusion-python` locally. With this code I have the following test that passes. I have created a simple python exposed `MyConfig`: ```python from datafusion import SessionConfig from datafusion_ffi_example import MyConfig def test_catalog_provider(): config = MyConfig() config = SessionConfig().with_extension(config) config.set("my_config.baz_count", "42") ``` ## Are there any user-facing changes? New addition only. |
||
|
|
4a41587bdf |
Make custom_file_casts example schema nullable to allow null id values during casting (#20486)
## Which issue does this PR close? * [Comment](https://github.com/apache/datafusion/pull/20202#discussion_r2804841561) on #20202 --- ## Rationale for this change The `custom_file_casts` example defines a *logical/table* schema that uses `id: Int32` as the target type. In practice, casting and projection paths in DataFusion can produce **nulls** (e.g. failed casts, missing values, or intermediate expressions), and examples should avoid implying that nulls are impossible when demonstrating casting behavior. Marking the `id` field as **nullable** makes the example more realistic and prevents confusion when users follow or adapt the example to scenarios where nulls may appear. --- ## What changes are included in this PR? * Update the logical/table schema in `custom_file_casts.rs` to define `id` as **nullable** (`Field::new("id", DataType::Int32, true)`). * Adjust the inline comment to reflect the nullable schema. --- ## Are these changes tested? No new tests were added. This is a documentation/example-only change that updates a schema definition and comment. The example continues to compile and can be exercised by running the `custom_file_casts` example as before. --- ## Are there any user-facing changes? Yes (example behavior/expectations): * The `custom_file_casts` example now documents `id` as nullable, aligning the example schema with situations where cast/projection may yield null values. * No public APIs are changed and no breaking behavior is introduced. |
||
|
|
0dfa542201 |
fix: HashJoin panic with dictionary-encoded columns in multi-key joins (#20441)
## Which issue does this PR close? - Closes #20437 ## Rationale for this change `flatten_dictionary_array` returned only the unique values rather then the full expanded array when being called on a `DictionaryArray`. When building a `StructArray` this caused a length mismatch panic. ## What changes are included in this PR? Replaced `array.values()` with `arrow::compute::cast(array, value_type)` in `flatten_dictionary_array`, which properly expands the dictionary into a full length array matching the row count. ## Are these changes tested? Yes, both a new unit test aswell as a regression test were added. ## Are there any user-facing changes? Nope --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> |
||
|
|
6c793694e9 |
chore(deps): bump the all-other-cargo-deps group with 2 updates (#20519)
Bumps the all-other-cargo-deps group with 2 updates: [chrono](https://github.com/chronotope/chrono) and [wasm-bindgen-test](https://github.com/wasm-bindgen/wasm-bindgen). Updates `chrono` from 0.4.43 to 0.4.44 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/chronotope/chrono/releases">chrono's releases</a>.</em></p> <blockquote> <h2>0.4.44</h2> <h2>What's Changed</h2> <ul> <li>docs: match MSRV with <code>Cargo.toml</code> contents by <a href="https://github.com/coryan"><code>@coryan</code></a> in <a href="https://redirect.github.com/chronotope/chrono/pull/1772">chronotope/chrono#1772</a></li> <li>Add track_caller to non-deprecated functions by <a href="https://github.com/svix-jplatte"><code>@svix-jplatte</code></a> in <a href="https://redirect.github.com/chronotope/chrono/pull/1774">chronotope/chrono#1774</a></li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/chronotope/chrono/commit/c14b4599d07ef36ffa1f8a531fb0bc7eb3b42464"><code>c14b459</code></a> Bump version to 0.4.44</li> <li><a href="https://github.com/chronotope/chrono/commit/ea832c5090369eefa2cb6a47d643e2f7ade7ffa7"><code>ea832c5</code></a> Add track_caller to non-deprecated functions</li> <li><a href="https://github.com/chronotope/chrono/commit/cfae889a3a23507acf49b605794abba17effd2d7"><code>cfae889</code></a> Fix panic message in to_rfc2822</li> <li><a href="https://github.com/chronotope/chrono/commit/f8900b5a44228a7f6282c65e8c407d3ecb6dcb7b"><code>f8900b5</code></a> docs: match MSRV with <code>Cargo.toml</code> contents</li> <li>See full diff in <a href="https://github.com/chronotope/chrono/compare/v0.4.43...v0.4.44">compare view</a></li> </ul> </details> <br /> Updates `wasm-bindgen-test` from 0.3.61 to 0.3.62 <details> <summary>Commits</summary> <ul> <li>See full diff in <a href="https://github.com/wasm-bindgen/wasm-bindgen/commits">compare view</a></li> </ul> </details> <br /> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore <dependency name> major version` will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself) - `@dependabot ignore <dependency name> minor version` will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself) - `@dependabot ignore <dependency name>` will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself) - `@dependabot unignore <dependency name>` will remove all of the ignore conditions of the specified dependency - `@dependabot unignore <dependency name> <ignore condition>` will remove the ignore condition of the specified dependency and ignore conditions </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
4c0a6531ca |
chore(deps): bump taiki-e/install-action from 2.68.6 to 2.68.8 (#20518)
Bumps [taiki-e/install-action](https://github.com/taiki-e/install-action) from 2.68.6 to 2.68.8. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/taiki-e/install-action/releases">taiki-e/install-action's releases</a>.</em></p> <blockquote> <h2>2.68.8</h2> <ul> <li> <p>Update <code>cargo-nextest@latest</code> to 0.9.129.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.2.19.</p> </li> <li> <p>Update <code>tombi@latest</code> to 0.7.32.</p> </li> </ul> <h2>2.68.7</h2> <ul> <li> <p>Update <code>mise@latest</code> to 2026.2.18.</p> </li> <li> <p>Update <code>wasm-bindgen@latest</code> to 0.2.111.</p> </li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/taiki-e/install-action/blob/main/CHANGELOG.md">taiki-e/install-action's changelog</a>.</em></p> <blockquote> <h1>Changelog</h1> <p>All notable changes to this project will be documented in this file.</p> <p>This project adheres to <a href="https://semver.org">Semantic Versioning</a>.</p> <!-- raw HTML omitted --> <h2>[Unreleased]</h2> <ul> <li> <p>Update <code>wasm-bindgen@latest</code> to 0.2.112.</p> </li> <li> <p>Update <code>uv@latest</code> to 0.10.5.</p> </li> </ul> <h2>[2.68.8] - 2026-02-23</h2> <ul> <li> <p>Update <code>cargo-nextest@latest</code> to 0.9.129.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.2.19.</p> </li> <li> <p>Update <code>tombi@latest</code> to 0.7.32.</p> </li> </ul> <h2>[2.68.7] - 2026-02-22</h2> <ul> <li> <p>Update <code>mise@latest</code> to 2026.2.18.</p> </li> <li> <p>Update <code>wasm-bindgen@latest</code> to 0.2.111.</p> </li> </ul> <h2>[2.68.6] - 2026-02-21</h2> <ul> <li>Update <code>wasm-bindgen@latest</code> to 0.2.110.</li> </ul> <h2>[2.68.5] - 2026-02-20</h2> <ul> <li>Update <code>wasm-bindgen@latest</code> to 0.2.109.</li> </ul> <h2>[2.68.4] - 2026-02-20</h2> <ul> <li>Update <code>cargo-nextest@latest</code> to 0.9.128.</li> </ul> <h2>[2.68.3] - 2026-02-19</h2> <ul> <li> <p>Update <code>mise@latest</code> to 2026.2.17.</p> </li> <li> <p>Update <code>cargo-tarpaulin@latest</code> to 0.35.2.</p> </li> <li> <p>Update <code>syft@latest</code> to 1.42.1.</p> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/taiki-e/install-action/commit/cfdb446e391c69574ebc316dfb7d7849ec12b940"><code>cfdb446</code></a> Release 2.68.8</li> <li><a href="https://github.com/taiki-e/install-action/commit/350f13bd74589d52195d1aed8e04b35b616a9c49"><code>350f13b</code></a> Update <code>cargo-nextest@latest</code> to 0.9.129</li> <li><a href="https://github.com/taiki-e/install-action/commit/8ba6eccac43cdb9aa5b83627b897b416b206be2a"><code>8ba6ecc</code></a> Update <code>mise@latest</code> to 2026.2.19</li> <li><a href="https://github.com/taiki-e/install-action/commit/cf805946ef1da29d90b652c870b7a19aae44b0f5"><code>cf80594</code></a> Update <code>tombi@latest</code> to 0.7.32</li> <li><a href="https://github.com/taiki-e/install-action/commit/f92912fad184299a31e22ad070a5059fd07d4f59"><code>f92912f</code></a> Release 2.68.7</li> <li><a href="https://github.com/taiki-e/install-action/commit/4970026aba514ced4229209c822802e1bff68b3e"><code>4970026</code></a> Update <code>mise@latest</code> to 2026.2.18</li> <li><a href="https://github.com/taiki-e/install-action/commit/6043f02f023f20fde8f9436e0d500ee1391fab70"><code>6043f02</code></a> Update <code>wasm-bindgen@latest</code> to 0.2.111</li> <li>See full diff in <a href="https://github.com/taiki-e/install-action/compare/470679bc3a1580072dac4e67535d1aa3a3dcdf51...cfdb446e391c69574ebc316dfb7d7849ec12b940">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
3aa34b33f5 |
chore(deps): bump strum from 0.27.2 to 0.28.0 (#20520)
Bumps [strum](https://github.com/Peternator7/strum) from 0.27.2 to 0.28.0. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/Peternator7/strum/blob/master/CHANGELOG.md">strum's changelog</a>.</em></p> <blockquote> <h2>0.28.0</h2> <ul> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/461">#461</a>: Allow any kind of passthrough attributes on <code>EnumDiscriminants</code>.</p> <ul> <li>Previously only list-style attributes (e.g. <code>#[strum_discriminants(derive(...))]</code>) were supported. Now path-only (e.g. <code>#[strum_discriminants(non_exhaustive)]</code>) and name/value (e.g. <code>#[strum_discriminants(doc = "foo")]</code>) attributes are also supported.</li> </ul> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/462">#462</a>: Add missing <code>#[automatically_derived]</code> to generated impls not covered by <a href="https://redirect.github.com/Peternator7/strum/pull/444">#444</a>.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/466">#466</a>: Bump MSRV to 1.71, required to keep up with updated <code>syn</code> and <code>windows-sys</code> dependencies. This is a breaking change if you're on an old version of rust.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/469">#469</a>: Use absolute paths in generated proc macro code to avoid potential name conflicts.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/465">#465</a>: Upgrade <code>phf</code> dependency to v0.13.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/473">#473</a>: Fix <code>cargo fmt</code> / <code>clippy</code> issues and add GitHub Actions CI.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/477">#477</a>: <code>strum::ParseError</code> now implements <code>core::fmt::Display</code> instead <code>std::fmt::Display</code> to make it <code>#[no_std]</code> compatible. Note the <code>Error</code> trait wasn't available in core until <code>1.81</code> so <code>strum::ParseError</code> still only implements that in std.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/476">#476</a>: <strong>Breaking Change</strong> - <code>EnumString</code> now implements <code>From<&str></code> (infallible) instead of <code>TryFrom<&str></code> when the enum has a <code>#[strum(default)]</code> variant. This more accurately reflects that parsing cannot fail in that case. If you need the old <code>TryFrom</code> behavior, you can opt back in using <code>parse_error_ty</code> and <code>parse_error_fn</code>:</p> <pre lang="rust"><code>#[derive(EnumString)] #[strum(parse_error_ty = strum::ParseError, parse_error_fn = make_error)] pub enum Color { Red, #[strum(default)] Other(String), } <p>fn make_error(x: &str) -> strum::ParseError { strum::ParseError::VariantNotFound } </code></pre></p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/431">#431</a>: Fix bug where <code>EnumString</code> ignored the <code>parse_err_ty</code> attribute when the enum had a <code>#[strum(default)]</code> variant.</p> </li> <li> <p><a href="https://redirect.github.com/Peternator7/strum/pull/474">#474</a>: EnumDiscriminants will now copy <code>default</code> over from the original enum to the Discriminant enum.</p> <pre lang="rust"><code>#[derive(Debug, Default, EnumDiscriminants)] #[strum_discriminants(derive(Default))] // <- Remove this in 0.28. enum MyEnum { #[default] // <- Will be the #[default] on the MyEnumDiscriminant #[strum_discriminants(default)] // <- Remove this in 0.28 Variant0, Variant1 { a: NonDefault }, } </code></pre> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/Peternator7/strum/commit/7376771128834d28bb9beba5c39846cba62e71ec"><code>7376771</code></a> Peternator7/0.28 (<a href="https://redirect.github.com/Peternator7/strum/issues/475">#475</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/26e63cd964a2e364331a5dd977d589bb9f649d8c"><code>26e63cd</code></a> Display exists in core (<a href="https://redirect.github.com/Peternator7/strum/issues/477">#477</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/9334c728eedaa8a992d1388a8f4564bbccad1934"><code>9334c72</code></a> Make TryFrom and FromStr infallible if there's a default (<a href="https://redirect.github.com/Peternator7/strum/issues/476">#476</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/0ccbbf823c16e827afc263182cd55e99e3b2a52e"><code>0ccbbf8</code></a> Honor parse_err_ty attribute when the enum has a default variant (<a href="https://redirect.github.com/Peternator7/strum/issues/431">#431</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/2c9e5a9259189ce8397f2f4967060240c6bafd74"><code>2c9e5a9</code></a> Automatically add Default implementation to EnumDiscriminant if it exists on ...</li> <li><a href="https://github.com/Peternator7/strum/commit/e241243e48359b8b811b8eaccdcfa1ae87138e0d"><code>e241243</code></a> Fix existing cargo fmt + clippy issues and add GH actions (<a href="https://redirect.github.com/Peternator7/strum/issues/473">#473</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/639b67fefd20eaead1c5d2ea794e9afe70a00312"><code>639b67f</code></a> feat: allow any kind of passthrough attributes on <code>EnumDiscriminants</code> (<a href="https://redirect.github.com/Peternator7/strum/issues/461">#461</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/0ea1e2d0fd1460e7492ea32e6b460394d9199ff8"><code>0ea1e2d</code></a> docs: Fix typo (<a href="https://redirect.github.com/Peternator7/strum/issues/463">#463</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/36c051b91086b37d531c63ccf5a49266832a846d"><code>36c051b</code></a> Upgrade <code>phf</code> to v0.13 (<a href="https://redirect.github.com/Peternator7/strum/issues/465">#465</a>)</li> <li><a href="https://github.com/Peternator7/strum/commit/9328b38617dc6f4a3bc5fdac03883d3fc766cf34"><code>9328b38</code></a> Use absolute paths in proc macro (<a href="https://redirect.github.com/Peternator7/strum/issues/469">#469</a>)</li> <li>Additional commits viewable in <a href="https://github.com/Peternator7/strum/compare/v0.27.2...v0.28.0">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
11ef486e6c |
Runs-on for extended CI checks (#20511)
part of https://github.com/apache/datafusion/issues/20052 ## Which issue does this PR close? example run: https://github.com/apache/datafusion/actions/runs/22325922758 this recused the run time from 3h to 1h. still a lot (on my mac it runs in 5m!) but that's a start --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> |
||
|
|
d59cdfe999 |
Fix name tracker (#19856)
## Which issue does this PR close? - Closes #17508 ## Rationale for this change The previous implementation used UUID-based aliasing as a workaround to prevent duplicate names for literals in Substrait plans. This approach had several drawbacks: - Non-deterministic plan names that made testing difficult (requiring UUID regex filters) - Only addressed literal naming conflicts, not the broader issue of name deduplication - Added unnecessary dependency on the `uuid` crate - Didn't properly handle cases where the same qualified name could appear with different schema representations ## What changes are included in this PR? 1. Enhanced NameTracker: Refactored to detect two types of conflicts: - Duplicate schema names: Tracked via schema_name() to prevent validate_unique_names failures (e.g., two Utf8(NULL) literals) - Ambiguous references: Tracked via qualified_name() to prevent DFSchema::check_names failures when a qualified field (e.g., left.Utf8(NULL)) and unqualified field (e.g., Utf8(NULL)) share the same column name 2. **Removed UUID dependency**: Eliminated the `uuid` crate from `datafusion/substrait` 3. **Removed literal-specific aliasing**: The UUID-based workaround in `project_rel.rs` is no longer needed as the improved NameTracker handles all naming conflicts consistently 4. **Deterministic naming**: Name conflicts now use predictable `__temp__N` suffixes instead of random UUIDs Note: This doesn't fully fix all the issues in #17508 which allow some special casing of `CAST` which are not included here. ## Are these changes tested? Yes: - Updated snapshot tests to reflect the new deterministic naming (e.g., `Utf8("people")__temp__0` instead of UUID-based names) - Modified some roundtrip tests to verify semantic equivalence (schema matching and execution) rather than exact string matching, which is more robust - All existing integration tests pass with the new naming scheme ## Are there any user-facing changes? Minimal. The generated plan names are now deterministic and more readable (using `__temp__N` suffixes instead of UUIDs), but this is primarily an internal representation change. The functional behavior and query results remain unchanged. |
||
|
|
b6d46a6382 |
perf: Optimize initcap() (#20352)
## Which issue does this PR close? - Closes #20351. ## Rationale for this change When all values in a `Utf8`/`LargeUtf8` array are ASCII, we can skip using `GenericStringBuilder` and instead process the entire input buffer in a single pass using byte-level operations. This also avoids recomputing the offsets and nulls arrays. A similar optimization is already used for lower() and upper(). Along the way, optimize `initcap_string()` for ASCII-only inputs. It already had an ASCII-only fastpath but there was room for further optimization, by iterating over bytes rather than characters. ## What changes are included in this PR? * Cleanup benchmarks: we ran the scalar benchmark for different array sizes, despite the fact that it is invariant to the array size * Add benchmark for different string lengths * Add benchmark for Unicode array input * Optimize for ASCII-only inputs as described above * Add test case for ASCII-only input that is a sliced array * Add test case variants for `LargeStringArray` ## Are these changes tested? Yes, plus an additional test added. ## Are there any user-facing changes? No. |
||
|
|
7602913b0f | Switch to the latest Mac OS (#20510) | ||
|
|
b9328b9734 |
Upgrade to sqlparser 0.61.0 (#20177)
DRAFT until SQL parser is released ## Which issue does this PR close? - part of https://github.com/apache/datafusion-sqlparser-rs/issues/2117 ## Rationale for this change Keep up to date with dependencies I think @Samyak2 specifically would like access to the `:` field syntax ## What changes are included in this PR? 1. Update to 0.61.0 2. Update APIs ## Are these changes tested? Yes by existing tests ## Are there any user-facing changes? New dependency --------- Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com> |