There never has been a valid negative version in the Delta protocol. I'm
not sure why this was even here as i64.
Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
# Description
Tighten the Python compatibility handling in `DeltaTable.create()` and
`DeltaTable.vacuum()` ensuring duplicate values are rejected when legacy
positional arguments are mixed with keywords.
# Related Issue(s)
<!---
For example:
- closes#106
--->
# Documentation
<!---
Share links to useful documentation
--->
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
# Description
Added `deprecate_positional_commit_args` to `_util.py` as a helper that
preserves legacy positional behavior per method, emits a
DeprecationWarning, rejects invalid usage, and normalizes to canonical
commit_properties-first order.
Wired into all public mutating APIs across `table.py`,
`writer/convert_to.py`, and `transaction.py`. `write_deltalake` params
reordered to canonical order (already keyword-only, no shim needed).
`restore` left untouched (already keyword-only).
Unit tests added in `tests/test_util.py`.
# Related Issue(s)
- closes#4252
Notes:
- `_internal.pyi stubs are already in canonical order
- `create()` and `vacuum()` handle legacy positional args inline (they
had extra trailing params beyond the commit args, so the shared helper
is bypassed for those two)
Follow-up PR will make all APIs keyword-only and remove the
compatibility path
AI disclosure: I used Claude as a coding assistant to help map out
affected methods. I reviewed every modification, ran the full test suite
locally, and understand each change made.
---------
Signed-off-by: Bhavana Sundar <bhavana7899@gmail.com>
Co-authored-by: Ethan Urbanski <ethanurbanski@gmail.com>
# Description
This test fails on local when run repeatedly.
`tempfile.gettempdir()` always returns the same path and cauases test to
fail on second run.
Existing fixture for `tmp_path` returns unique path for each test
# Related Issue(s)
# Documentation
NA
Signed-off-by: Khalid Mammadov <khalidmammadov9@gmail.com>
# Description
The description of the main changes of your pull request
# Related Issue(s)
<!---
For example:
- closes#106
--->
# Documentation
<!---
Share links to useful documentation
--->
Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
# Description
When we write data with unsupported Arrow types (`Date64`,
`Timestamp(ns)`), writes either fail with confusing kernel errors or
produce tables with incompatible schemas.
The goal was to centralize the normalization (including the opportunity
to refactor `convert_to_delta`) into `normalize_for_delta` to convert
unsupported Arrow types to their Delta-compatible equivalents before
writing.
This matches the Delta protocol specification and is consistent with how
Spark handles these types.
# Related Issue(s)
- Fixes#3877
- Fixes#1721
## Types conversion
| Arrow type | Delta-compatible type |
|---|---|
| `Date64` | `Date32` |
| `Timestamp(s/ms/ns, tz)` | `Timestamp(us, tz)` |
Nested types (Struct, List, FixedSizeList, Map) are normalized
recursively.
---------
Signed-off-by: Florian Valeye <florian.valeye@gmail.com>
# Description
Found this while working on #4266.
Merge target subset filters can retain decimal precision/scale from the
source expression instead of the target schema. For example `decimal(4,
1)` when the column is `decimal(6, 1)`. The newer file skipping path
rejects the mismatch.
This fix is normalize `target_subset_filter` against the target schema
before simplification, and extend literal coercion to handle between
bounds.
# Related Issue(s)
<!---
For example:
- closes#106
--->
# Documentation
<!---
Share links to useful documentation
--->
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
During some experimentation somwewhere along the line the actual
implementation was commented out. This wasn't caught in CI it seems
because we have no test coverage of the function!
Welp, now we do!
Fixes#4126
Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
# Description
adds `post_commithook_properties` support to the missing `table.alter`
metadata methods, so cleanup and checkpoint creation behavior can be
controlled consistently via `cleanup_expired_logs` and
`create_checkpoint` respectively
## Validation
- `make tests`
- `cargo check -p deltalake-python`
- `cargo fmt --all --check`
Signed-off-by: vsmanish1772 <smanish1772@gmail.com>
# Description
Fixes#4235 - `DeltaTable.deletion_vectors()` returned truncated
selection vectors when the highest deleted row index was below the
file's total row count.
Kernel returns a sparse DV mask (up to highest deleted index). The api
returned that raw mask directly, which could be shorter than numRecords.
**What Changed**
- Plumb `num_records: Option<u64>` through scan replay DV side channel
- Pad short masks with `true` up to `numRecords` at the API boundary
- Error if mask exceeds `numRecords` or `numRecords` is missing
This is now a stricter contract with `deletion_vectors()` now failing if
a DV file is missing `numRecords` instead of returning a truncated mask.
**Upstream Kernel Note**
If kernel can return full length selection vectors, this normalization
will not be needed. Will look into if an upstream feature on
delta-kernel is welcomed for a length aware selection vector api
# Related Issue(s)
- #4235
<!---
For example:
- closes#106
--->
# Documentation
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
# Description
some follow up/hardening changes from the partition only delete work
done recently
DELETE partition only fallback and add action evaluation could
materialize all actions into a single batch, which breaks on large
tables
Changes:
- DELETE fallback uses batched partition metadata instead of single
batch materialization
- Shared partition metadata MemTable builder across scan and DELETE
paths
- Snapshot fast path for partition only column projection
- add_actions coalescing streams directly into BatchCoalescer instead of
pre-collecting
- Python docs note get_add_actions() return type migration
<!---
For example:
- closes#106
--->
# Documentation
<!---
Share links to useful documentation
--->
---------
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
# Description
Vacuum Lite Mode only deletes Stale Tombstone files, but current
implementation does full file listing regardless of Lite or Full mode.
This change avoids listing storage for Lite mode and tries to simlify
and clarify logic by segregating concerns for each mode.
# Related Issue(s)
- closes [#106](https://github.com/delta-io/delta-rs/issues/4228)
# Documentation
Added test cases to test and clarify intend
---------
Signed-off-by: Khalid Mammadov <khalidmammadov9@gmail.com>
Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
Added disk spilling for `merge `similar to optimize functions to allow
for merges which touches many files in the target
# Description
I have added functionality for spilling to disk similar to how it works
in the optimize functions. If nothing is provided it works as before.
I have added test similar to those for the other spill functions.
I have tested my cases in #4217 which now successfully completes the
merge without OOM.
I have used AI (Opus 4.6) for getting a overview of the project
structure and for writing most of the code. I have review and verified
the code myself.
Work done:
- create `create_session_state_with_spill_config` (which is just a move
and rename of `create_session_state_for_optimize`)
- use `create_session_state_with_spill_config` in existing optimize
functions
- use `create_session_state_with_spill_config` for `merge`
# Related Issue(s)
Closes#4217
# Documentation
<!---
Share links to useful documentation
--->
---------
Signed-off-by: Thomas Frederik Hoeck <tfh@norden.com>
Co-authored-by: Thomas Frederik Hoeck <tfh@norden.com>
# Description
Upgrades the Python DataFusion path to 52.x and makes the integration
lane blocking in CI
# Related Issue(s)
<!---
For example:
- closes#106
--->
# Documentation
<!---
Share links to useful documentation
--->
---------
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
# Description
The description of the main changes of your pull request
# Related Issue(s)
<!---
For example:
- closes#106
--->
# Documentation
<!---
Share links to useful documentation
--->
Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
# Description
For delta tables, table level statistics are not quite as useful as file
level stats. However we do go through quite some trouble to expose table
level stats which also assume we always have a materialised log to
expose these stats. As such, it hinders us in migration to a lazy
architecture.
In fact the datafusion native file-based table implementation (parquet,
json, csv, ...) only expose stats on the execution pkan level, and not
on the table provider level.
In this PR we therefore remote the table level stats from the current
table provider and remove the associated code.
Signed-off-by: Robert Pack <robstar.pack@gmail.com>
# Description
- Add a runtime version check in __datafusion_table_provider__ to
prevent FFI ABI mismatch segfaults
- Block capsule export when installed datafusion major != 52
- Provide actionable error text with QueryBuilder workaround
Changes:
- lib.rs: add REQUIRED_DATAFUSION_PY_MAJOR, datafusion_python_version(),
guard at method start
- test_datafusion.py: add incompatible version and not installed tests
Note: This guard is a temporary safety net to prevent segfaults until
DataFusion 52 Python wheels are available on PyPI. Once wheels land,
users can install datafusion==52.* and use SessionContext registration
normally.
# Related Issue(s)
- #4135
<!---
For example:
- closes#106
--->
# Documentation
<!---
Share links to useful documentation
--->
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>