chore: more typos

Signed-off-by: Robert Pack <robstar.pack@gmail.com>
2026-07-01 20:34:35 -04:00 · 2025-05-25 23:09:40 +02:00
parent 784e65ed3a
commit 9b24a3d0b3
5 changed files with 43 additions and 13 deletions
@@ -0,0 +1,5 @@
+repos:
+  - repo: https://github.com/crate-ci/typos
+    rev: v1.32.0
+    hooks:
+      - id: typos
@@ -80,8 +80,19 @@ tokio = { version = "1" }
 num_cpus = { version = "1" }

 [workspace.metadata.typos]
-default.extend-ignore-re = ["(?Rm)^.*(#|//)\\s*spellchecker:disable-line$"]
+default.extend-ignore-re = [
+    # Custom ignore regex patterns: https://github.com/crate-ci/typos/blob/master/docs/reference.md#example-configurations
+    "(?s)//\\s*spellchecker:ignore-next-line[^\\n]*\\n[^\\n]*",
+    # Line block with # spellchecker:<on|off>
+    "(?s)(#|//|<\\!--)\\s*spellchecker:off.*?\\n\\s*(#|//|<\\!--)\\s*spellchecker:on",
+    "(?Rm)^.*(#|//)\\s*spellchecker:disable-line$",
+    # workaround for: https://github.com/crate-ci/typos/issues/850
+    "[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}",
+]

 [workspace.metadata.typos.default.extend-words]
-"arro3" = "arro3"
-"Arro3" = "Arro3"
+arro = "arro"
+Arro = "Arro"
+arro3 = "arro3"
+Arro3 = "Arro3"
+AKS = "AKS"
@@ -129,6 +129,8 @@ which provide the list of files that are part of the table and metadata
 about them, such as creation time, size, and statistics. You can get a
 data frame of the add actions data using `DeltaTable.get_add_actions`:

+<!-- spellchecker:off --!>
+
 === "Python"
    ``` python
    >>> from deltalake import DeltaTable
@@ -162,3 +164,5 @@ This works even with past versions of the table:
    let actions = table.snapshot()?.add_actions_table(true)?;
    println!("{}", pretty_format_batches(&vec![actions])?);
    ```
+
+<!-- spellchecker:on --!>
@@ -254,7 +254,8 @@ Here’s the output of the command:
 'preserveInsertionOrder': True}
 ```

-The optimize operation has added 5 new files and marked 100 exisitng files for removal (this is also known as “tombstoning” files).  It has compacted the 100 tiny files into 5 larger files.
+The optimize operation has added 5 new files and marked 100 existing files for removal
+(this is also known as “tombstoning” files).  It has compacted the 100 tiny files into 5 larger files.

 Let’s append some more data to the Delta table and see how we can selectively run optimize on the new data that’s added.

@@ -288,7 +289,9 @@ Let’s append another 24 hours of data to the Delta table:
    }
    ```

-We can use `get_add_actions()` to introspect the table state. We can see that `2021-01-06` has only a few hours of data so far, so we don't want to optimize that yet. But `2021-01-05` has all 24 hours of data, so it's ready to be optimized.
+We can use `get_add_actions()` to introspect the table state. We can see that `2021-01-06`
+has only a few hours of data so far, so we don't want to optimize that yet. But `2021-01-05`
+has all 24 hours of data, so it's ready to be optimized.

 === "Python"
    ```python
@@ -311,10 +314,10 @@ We can use `get_add_actions()` to introspect the table state. We can see that `2
    let ctx = SessionContext::new();
    ctx.register_batch("observations", batch.clone())?;
    let df = ctx.sql("
-    SELECT \"partition.date\", 
-            COUNT(*) 
-    FROM observations 
-    GROUP BY \"partition.date\" 
+    SELECT \"partition.date\",
+            COUNT(*)
+    FROM observations
+    GROUP BY \"partition.date\"
    ORDER BY \"partition.date\"").await?;
    df.show().await?;

@@ -368,7 +371,9 @@ To optimize a single partition, you can pass in a `partition_filters` argument s
 'preserveInsertionOrder': True}
 ```

-This optimize operation tombstones 21 small data files and adds one file with all the existing data properly condensed.  Let’s take a look a portion of the `_delta_log/00000000000000000125.json` file, which is the transaction log entry that corresponds with this incremental optimize command.
+This optimize operation tombstones 21 small data files and adds one file with all the existing
+data properly condensed.  Let’s take a look a portion of the `_delta_log/00000000000000000125.json`
+file, which is the transaction log entry that corresponds with this incremental optimize command.

 ```python
 {
@@ -416,9 +421,11 @@ This optimize operation tombstones 21 small data files and adds one file with al
 }
 ```

-The trasaction log indicates that many files have been tombstoned and one file is added, as expected.
+The transaction log indicates that many files have been tombstoned and one file is added, as expected.

-The Delta Lake optimize command “removes” data by marking the data files as removed in the transaction log.  The optimize command doesn’t physically delete the Parquet file from storage.  Optimize performs a “logical remove” not a “physical remove”.
+The Delta Lake optimize command “removes” data by marking the data files as removed in the transaction log.
+The optimize command doesn’t physically delete the Parquet file from storage.
+Optimize performs a “logical remove” not a “physical remove”.

 Delta Lake uses logical operations so you can time travel back to earlier versions of your data.  You can vacuum your Delta table to physically remove Parquet files from storage if you don’t need to time travel and don’t want to pay to store the tombstoned files.

@@ -493,7 +500,7 @@ Delta tables can accumulate small files for a variety of reasons:

 * User error: users can accidentally write files that are too small.  Users should sometimes repartition in memory before writing to disk to avoid appending files that are too small.
 * Frequent appends: systems that append more often tend to append more smaller files.  A pipeline that appends every minute will generally generate ten times as many small files compared to a system that appends every ten minutes.
-* Appending to partitioned data lakes with high cardinality columns can also cause small files.  If you append every hour to a table that’s partitioned on a column with 1,000 distinct values, then every append could create 1,000 new files.  Partitioning by date avoids this problem because the data isn’t split up across partitions in this manner.  
+* Appending to partitioned data lakes with high cardinality columns can also cause small files.  If you append every hour to a table that’s partitioned on a column with 1,000 distinct values, then every append could create 1,000 new files.  Partitioning by date avoids this problem because the data isn’t split up across partitions in this manner.

 ## Conclusion

@@ -309,6 +309,8 @@ provide the list of files that are part of the table and metadata about them,
 such as creation time, size, and statistics. You can get a data frame of
 the add actions data using :meth:`DeltaTable.get_add_actions`:

+<!-- spellchecker:off --!>
+
 .. code-block:: python

    >>> from deltalake import DeltaTable
@@ -328,6 +330,7 @@ This works even with past versions of the table:
    0  part-00000-c9b90f86-73e6-46c8-93ba-ff6bfaf892a...         440 2021-03-06 15:16:07         True            2                 0          0          2
    1  part-00001-911a94a2-43f6-4acb-8620-5e68c265498...         445 2021-03-06 15:16:07         True            3                 0          2          4

+<!-- spellchecker:on --!>

 Querying Delta Tables
 ---------------------