mirror of
https://github.com/langchain-ai/delta-rs.git
synced 2026-07-01 20:34:35 -04:00
committed by
R. Tyler Croy
parent
7a378c210b
commit
b014e5e0d4
@@ -0,0 +1,23 @@
|
||||
`deltalake` is a Rust-based re-implementation of the DeltaLake protocol originally developed at DataBricks. The `deltalake` library has APIs in Rust and Python. The `deltalake` implementation has no dependencies on Java, Spark or DataBricks.
|
||||
|
||||
## Contributing
|
||||
|
||||
The Delta Lake community welcomes contributors from all developers, regardless of your experience or programming background.
|
||||
|
||||
You can write Rust code, Python code, documentation, submit bugs, or give talks to the community. We welcome all of these contributions.
|
||||
|
||||
Feel free to [join our Slack](https://go.delta.io/slack) and message us in the #delta-rs channel any time!
|
||||
|
||||
We value kind communication and building a productive, friendly environment for maximum collaboration and fun.
|
||||
|
||||
|
||||
## Important terminology
|
||||
|
||||
* `deltalake` refers to the Rust or Python API of delta-rs
|
||||
* "Delta Spark" refers to the Scala implementation of the Delta Lake transaction log protocol. This depends on Spark and Java.
|
||||
|
||||
## Why implement the Delta Lake transaction log protocol in Rust?
|
||||
|
||||
Delta Spark depends on Java and Spark, which is fine for many use cases, but not all Delta Lake users want to depend on these libraries. `deltalake` allows you to manage your dataset using a Delta Lake approach without any Java or Spark dependencies.
|
||||
|
||||
A `DeltaTable` on disk is simply a directory that stores metadata in JSON files and data in Parquet files.
|
||||
@@ -0,0 +1,5 @@
|
||||
## Project history
|
||||
|
||||
Check out this video by Denny Lee & QP Hou to learn about the genesis of the delta-rs project:
|
||||
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/ZQdEdifcBh8?si=ytGW7FB-kwl6VqsV" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
|
||||
+34
-77
@@ -1,91 +1,48 @@
|
||||
`deltalake` is an open source library that makes working with tabular datasets easier, more robust and more performant. With deltalake you can add, remove or update rows in a dataset as new data arrives. You can time travel back to earlier versions of a dataset. You can optimize dataset storage from small files to large files.
|
||||
`deltalake` is an open source library that makes working with tabular datasets easier, more robust and more performant. With `deltalake` you can add, remove or update rows in a dataset as new data arrives. You can time travel back to earlier versions of a dataset. You can optimize dataset storage from small files to large files.
|
||||
|
||||
`deltalake` can be used to manage data stored on a local file system or in the cloud. `deltalake` integrates with data manipulation libraries such as Pandas, Polars, DuckDB and DataFusion.
|
||||
With `deltalake` you can manage data stored on a local file system or in the cloud. `deltalake` integrates with data manipulation libraries such as Pandas, Polars, DuckDB and DataFusion.
|
||||
|
||||
`deltalake` uses a lakehouse framework for managing datasets. With this lakehouse approach you manage your datasets with a `DeltaTable` object and then `deltalake` takes care of the underlying files. Within a `DeltaTable` your data is stored in high performance Parquet files while metadata is stored in a set of JSON files called a transaction log.
|
||||
|
||||
`deltalake` is a Rust-based re-implementation of the DeltaLake protocol originally developed at DataBricks. The `deltalake` library has APIs in Rust and Python. The `deltalake` implementation has no dependencies on Java, Spark or DataBricks.
|
||||
|
||||
|
||||
## Important terminology
|
||||
|
||||
* `deltalake` refers to the Rust or Python API of delta-rs
|
||||
* "Delta Spark" refers to the Scala implementation of the Delta Lake transaction log protocol. This depends on Spark and Java.
|
||||
|
||||
## Why implement the Delta Lake transaction log protocol in Rust?
|
||||
|
||||
Delta Spark depends on Java and Spark, which is fine for many use cases, but not all Delta Lake users want to depend on these libraries. `deltalake` allows you to manage your dataset using a Delta Lake approach without any Java or Spark dependencies.
|
||||
|
||||
A `DeltaTable` on disk is simply a directory that stores metadata in JSON files and data in Parquet files.
|
||||
`deltalake` uses a lakehouse framework where you manage your datasets with a `DeltaTable` object and `deltalake` takes care of the underlying files.
|
||||
|
||||
## Quick start
|
||||
|
||||
You can install `deltalake` in Python with `pip`
|
||||
```bash
|
||||
pip install deltalake
|
||||
```
|
||||
We create a Pandas `DataFrame` and write it to a `DeltaTable`:
|
||||
```python
|
||||
import pandas as pd
|
||||
from deltalake import DeltaTable,write_deltalake
|
||||
1. Install the Python dependencies with `pip`:
|
||||
|
||||
df = pd.DataFrame(
|
||||
{
|
||||
"id": [1, 2, 3],
|
||||
"name": ["Aadhya", "Bob", "Chen"],
|
||||
}
|
||||
)
|
||||
```bash
|
||||
pip install deltalake pyarrow tabulate
|
||||
```
|
||||
|
||||
(
|
||||
write_deltalake(
|
||||
table_or_uri="delta_table_dir",
|
||||
data=df,
|
||||
)
|
||||
)
|
||||
```
|
||||
We create a `DeltaTable` object that holds the metadata for the Delta table:
|
||||
```python
|
||||
dt = DeltaTable("delta_table_dir")
|
||||
```
|
||||
We load the `DeltaTable` into a Pandas `DataFrame` with `to_pandas` on a `DeltaTable`:
|
||||
```python
|
||||
new_df = dt.to_pandas()
|
||||
```
|
||||
- `pyarrow` is needed for the DataFrame import
|
||||
- `tabulate` is needed to print the DataFrame in the example
|
||||
|
||||
Or we can load the data into a Polars `DataFrame` with `pl.read_delta`:
|
||||
```python
|
||||
import polars as pl
|
||||
new_df = pl.read_delta("delta_table_dir")
|
||||
```
|
||||
1. Create a Pandas `DataFrame` and write it to a `DeltaTable`:
|
||||
|
||||
Or we can load the data with DuckDB:
|
||||
```python
|
||||
import duckdb
|
||||
duckdb.query("SELECT * FROM delta_scan('./delta_table_dir')")
|
||||
```
|
||||
```python
|
||||
from deltalake import write_deltalake, DeltaTable
|
||||
import pandas as pd
|
||||
|
||||
Or we can load the data with DataFusion:
|
||||
```python
|
||||
from datafusion import SessionContext
|
||||
# Create a Pandas DataFrame and write it to a DeltaTable:
|
||||
df = pd.DataFrame({"num": [8, 9], "letter": ["aa", "bb"]})
|
||||
write_deltalake("tmp/some-table", df)
|
||||
|
||||
ctx = SessionContext()
|
||||
ctx.register_dataset("my_delta_table", dt.to_pyarrow_dataset())
|
||||
ctx.sql("select * from my_delta_table")
|
||||
```
|
||||
# Create a DeltaTable object to track metadata for the Delta table
|
||||
dt = DeltaTable("tmp/some-table")
|
||||
|
||||
# Overwrite the DataFrame with new data
|
||||
df = pd.DataFrame({"num": [11, 22], "letter": ["dd", "ee"]})
|
||||
write_deltalake("tmp/some-table", df, mode="overwrite")
|
||||
|
||||
# Easily revert to version 0 of the table
|
||||
df = DeltaTable("tmp/some-table", version=0)
|
||||
|
||||
# Print the the original version 0 data
|
||||
print(df.to_pandas().to_markdown())
|
||||
```
|
||||
|
||||
|
||||
## Contributing
|
||||
## Next steps
|
||||
|
||||
The Delta Lake community welcomes contributors from all developers, regardless of your experience or programming background.
|
||||
|
||||
You can write Rust code, Python code, documentation, submit bugs, or give talks to the community. We welcome all of these contributions.
|
||||
|
||||
Feel free to [join our Slack](https://go.delta.io/slack) and message us in the #delta-rs channel any time!
|
||||
|
||||
We value kind communication and building a productive, friendly environment for maximum collaboration and fun.
|
||||
|
||||
## Project history
|
||||
|
||||
Check out this video by Denny Lee & QP Hou to learn about the genesis of the delta-rs project:
|
||||
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/ZQdEdifcBh8?si=ytGW7FB-kwl6VqsV" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
|
||||
- Learn about Querying Delta Tables
|
||||
- Learn about using `deltalake` with Polars
|
||||
- Learn about using `deltalake` with DuckDB
|
||||
- Learn about using `deltalake` with DataFusion
|
||||
|
||||
+4
-1
@@ -49,7 +49,6 @@ nav:
|
||||
- Home: index.md
|
||||
- Why Use Delta Lake: why-use-delta-lake.md
|
||||
- Delta Lake for big and small data: delta-lake-big-data-small-data.md
|
||||
- Best practices: delta-lake-best-practices.md
|
||||
- Usage:
|
||||
- Installation: usage/installation.md
|
||||
- Overview: usage/overview.md
|
||||
@@ -66,6 +65,7 @@ nav:
|
||||
- usage/writing/index.md
|
||||
- usage/writing/writing-to-s3-with-locking-provider.md
|
||||
- Deleting rows from a table: usage/deleting-rows-from-delta-lake-table.md
|
||||
- Best practices: usage/delta-lake-best-practices.md
|
||||
- Optimize:
|
||||
- Small file compaction: usage/optimize/small-file-compaction-with-optimize.md
|
||||
- Z Order: usage/optimize/delta-lake-z-order.md
|
||||
@@ -104,6 +104,9 @@ nav:
|
||||
- File skipping: how-delta-lake-works/delta-lake-file-skipping.md
|
||||
- Upgrade guides:
|
||||
- Version 1.0.0: upgrade-guides/guide-1.0.0.md
|
||||
- About:
|
||||
- Contributing: about/contributing.md
|
||||
- History: about/history.md
|
||||
not_in_nav: |
|
||||
/_build/
|
||||
|
||||
|
||||
Reference in New Issue
Block a user