chore: more typos

Signed-off-by: Robert Pack <robstar.pack@gmail.com>
This commit is contained in:
Robert Pack
2025-05-25 23:09:40 +02:00
committed by Robert Pack
parent 784e65ed3a
commit 9b24a3d0b3
5 changed files with 43 additions and 13 deletions
+5
View File
@@ -0,0 +1,5 @@
repos:
- repo: https://github.com/crate-ci/typos
rev: v1.32.0
hooks:
- id: typos
+14 -3
View File
@@ -80,8 +80,19 @@ tokio = { version = "1" }
num_cpus = { version = "1" }
[workspace.metadata.typos]
default.extend-ignore-re = ["(?Rm)^.*(#|//)\\s*spellchecker:disable-line$"]
default.extend-ignore-re = [
# Custom ignore regex patterns: https://github.com/crate-ci/typos/blob/master/docs/reference.md#example-configurations
"(?s)//\\s*spellchecker:ignore-next-line[^\\n]*\\n[^\\n]*",
# Line block with # spellchecker:<on|off>
"(?s)(#|//|<\\!--)\\s*spellchecker:off.*?\\n\\s*(#|//|<\\!--)\\s*spellchecker:on",
"(?Rm)^.*(#|//)\\s*spellchecker:disable-line$",
# workaround for: https://github.com/crate-ci/typos/issues/850
"[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}",
]
[workspace.metadata.typos.default.extend-words]
"arro3" = "arro3"
"Arro3" = "Arro3"
arro = "arro"
Arro = "Arro"
arro3 = "arro3"
Arro3 = "Arro3"
AKS = "AKS"
+4
View File
@@ -129,6 +129,8 @@ which provide the list of files that are part of the table and metadata
about them, such as creation time, size, and statistics. You can get a
data frame of the add actions data using `DeltaTable.get_add_actions`:
<!-- spellchecker:off --!>
=== "Python"
``` python
>>> from deltalake import DeltaTable
@@ -162,3 +164,5 @@ This works even with past versions of the table:
let actions = table.snapshot()?.add_actions_table(true)?;
println!("{}", pretty_format_batches(&vec![actions])?);
```
<!-- spellchecker:on --!>
@@ -254,7 +254,8 @@ Heres the output of the command:
'preserveInsertionOrder': True}
```
The optimize operation has added 5 new files and marked 100 exisitng files for removal (this is also known as “tombstoning” files). It has compacted the 100 tiny files into 5 larger files.
The optimize operation has added 5 new files and marked 100 existing files for removal
(this is also known as “tombstoning” files). It has compacted the 100 tiny files into 5 larger files.
Lets append some more data to the Delta table and see how we can selectively run optimize on the new data thats added.
@@ -288,7 +289,9 @@ Lets append another 24 hours of data to the Delta table:
}
```
We can use `get_add_actions()` to introspect the table state. We can see that `2021-01-06` has only a few hours of data so far, so we don't want to optimize that yet. But `2021-01-05` has all 24 hours of data, so it's ready to be optimized.
We can use `get_add_actions()` to introspect the table state. We can see that `2021-01-06`
has only a few hours of data so far, so we don't want to optimize that yet. But `2021-01-05`
has all 24 hours of data, so it's ready to be optimized.
=== "Python"
```python
@@ -311,10 +314,10 @@ We can use `get_add_actions()` to introspect the table state. We can see that `2
let ctx = SessionContext::new();
ctx.register_batch("observations", batch.clone())?;
let df = ctx.sql("
SELECT \"partition.date\",
COUNT(*)
FROM observations
GROUP BY \"partition.date\"
SELECT \"partition.date\",
COUNT(*)
FROM observations
GROUP BY \"partition.date\"
ORDER BY \"partition.date\"").await?;
df.show().await?;
@@ -368,7 +371,9 @@ To optimize a single partition, you can pass in a `partition_filters` argument s
'preserveInsertionOrder': True}
```
This optimize operation tombstones 21 small data files and adds one file with all the existing data properly condensed. Lets take a look a portion of the `_delta_log/00000000000000000125.json` file, which is the transaction log entry that corresponds with this incremental optimize command.
This optimize operation tombstones 21 small data files and adds one file with all the existing
data properly condensed. Lets take a look a portion of the `_delta_log/00000000000000000125.json`
file, which is the transaction log entry that corresponds with this incremental optimize command.
```python
{
@@ -416,9 +421,11 @@ This optimize operation tombstones 21 small data files and adds one file with al
}
```
The trasaction log indicates that many files have been tombstoned and one file is added, as expected.
The transaction log indicates that many files have been tombstoned and one file is added, as expected.
The Delta Lake optimize command “removes” data by marking the data files as removed in the transaction log. The optimize command doesnt physically delete the Parquet file from storage. Optimize performs a “logical remove” not a “physical remove”.
The Delta Lake optimize command “removes” data by marking the data files as removed in the transaction log.
The optimize command doesnt physically delete the Parquet file from storage.
Optimize performs a “logical remove” not a “physical remove”.
Delta Lake uses logical operations so you can time travel back to earlier versions of your data. You can vacuum your Delta table to physically remove Parquet files from storage if you dont need to time travel and dont want to pay to store the tombstoned files.
@@ -493,7 +500,7 @@ Delta tables can accumulate small files for a variety of reasons:
* User error: users can accidentally write files that are too small. Users should sometimes repartition in memory before writing to disk to avoid appending files that are too small.
* Frequent appends: systems that append more often tend to append more smaller files. A pipeline that appends every minute will generally generate ten times as many small files compared to a system that appends every ten minutes.
* Appending to partitioned data lakes with high cardinality columns can also cause small files. If you append every hour to a table thats partitioned on a column with 1,000 distinct values, then every append could create 1,000 new files. Partitioning by date avoids this problem because the data isnt split up across partitions in this manner.
* Appending to partitioned data lakes with high cardinality columns can also cause small files. If you append every hour to a table thats partitioned on a column with 1,000 distinct values, then every append could create 1,000 new files. Partitioning by date avoids this problem because the data isnt split up across partitions in this manner.
## Conclusion
+3
View File
@@ -309,6 +309,8 @@ provide the list of files that are part of the table and metadata about them,
such as creation time, size, and statistics. You can get a data frame of
the add actions data using :meth:`DeltaTable.get_add_actions`:
<!-- spellchecker:off --!>
.. code-block:: python
>>> from deltalake import DeltaTable
@@ -328,6 +330,7 @@ This works even with past versions of the table:
0 part-00000-c9b90f86-73e6-46c8-93ba-ff6bfaf892a... 440 2021-03-06 15:16:07 True 2 0 0 2
1 part-00001-911a94a2-43f6-4acb-8620-5e68c265498... 445 2021-03-06 15:16:07 True 3 0 2 4
<!-- spellchecker:on --!>
Querying Delta Tables
---------------------