@@ -254,7 +254,8 @@ Here’s the output of the command:
'preserveInsertionOrder': True}
```
The optimize operation has added 5 new files and marked 100 exisitng files for removal (this is also known as “tombstoning” files). It has compacted the 100 tiny files into 5 larger files.
The optimize operation has added 5 new files and marked 100 existing files for removal
(this is also known as “tombstoning” files). It has compacted the 100 tiny files into 5 larger files.
Let’s append some more data to the Delta table and see how we can selectively run optimize on the new data that’s added.
@@ -288,7 +289,9 @@ Let’s append another 24 hours of data to the Delta table:
}
```
We can use `get_add_actions()` to introspect the table state. We can see that `2021-01-06` has only a few hours of data so far, so we don't want to optimize that yet. But `2021-01-05` has all 24 hours of data, so it's ready to be optimized.
We can use `get_add_actions()` to introspect the table state. We can see that `2021-01-06`
has only a few hours of data so far, so we don't want to optimize that yet. But `2021-01-05`
has all 24 hours of data, so it's ready to be optimized.
=== "Python"
```python
@@ -311,10 +314,10 @@ We can use `get_add_actions()` to introspect the table state. We can see that `2
@@ -368,7 +371,9 @@ To optimize a single partition, you can pass in a `partition_filters` argument s
'preserveInsertionOrder': True}
```
This optimize operation tombstones 21 small data files and adds one file with all the existing data properly condensed. Let’s take a look a portion of the `_delta_log/00000000000000000125.json` file, which is the transaction log entry that corresponds with this incremental optimize command.
This optimize operation tombstones 21 small data files and adds one file with all the existing
data properly condensed. Let’s take a look a portion of the `_delta_log/00000000000000000125.json`
file, which is the transaction log entry that corresponds with this incremental optimize command.
```python
{
@@ -416,9 +421,11 @@ This optimize operation tombstones 21 small data files and adds one file with al
}
```
The trasaction log indicates that many files have been tombstoned and one file is added, as expected.
The transaction log indicates that many files have been tombstoned and one file is added, as expected.
The Delta Lake optimize command “removes” data by marking the data files as removed in the transaction log. The optimize command doesn’t physically delete the Parquet file from storage. Optimize performs a “logical remove” not a “physical remove”.
The Delta Lake optimize command “removes” data by marking the data files as removed in the transaction log.
The optimize command doesn’t physically delete the Parquet file from storage.
Optimize performs a “logical remove” not a “physical remove”.
Delta Lake uses logical operations so you can time travel back to earlier versions of your data. You can vacuum your Delta table to physically remove Parquet files from storage if you don’t need to time travel and don’t want to pay to store the tombstoned files.
@@ -493,7 +500,7 @@ Delta tables can accumulate small files for a variety of reasons:
* User error: users can accidentally write files that are too small. Users should sometimes repartition in memory before writing to disk to avoid appending files that are too small.
* Frequent appends: systems that append more often tend to append more smaller files. A pipeline that appends every minute will generally generate ten times as many small files compared to a system that appends every ten minutes.
* Appending to partitioned data lakes with high cardinality columns can also cause small files. If you append every hour to a table that’s partitioned on a column with 1,000 distinct values, then every append could create 1,000 new files. Partitioning by date avoids this problem because the data isn’t split up across partitions in this manner.
* Appending to partitioned data lakes with high cardinality columns can also cause small files. If you append every hour to a table that’s partitioned on a column with 1,000 distinct values, then every append could create 1,000 new files. Partitioning by date avoids this problem because the data isn’t split up across partitions in this manner.
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.