use prettier to format md files (#367)

* use prettier to format md files * apply prettier * update ballista
2026-06-30 21:27:59 -04:00 · 2021-05-24 21:25:46 +08:00
parent 9fdc4fe7af
commit ee8b5bf9c4
31 changed files with 239 additions and 219 deletions
@@ -19,6 +19,6 @@

 # Code of Conduct

-* [Code of Conduct for The Apache Software Foundation][1]
+- [Code of Conduct for The Apache Software Foundation][1]

-[1]: https://www.apache.org/foundation/policies/conduct.html
+[1]: https://www.apache.org/foundation/policies/conduct.html
@@ -21,57 +21,57 @@

 This section describes how you can get started at developing DataFusion.

-For information on developing with Ballista, see the 
-[Ballista developer documentation](ballista/docs/README.md). 
+For information on developing with Ballista, see the
+[Ballista developer documentation](ballista/docs/README.md).

 ### Bootstrap environment

 DataFusion is written in Rust and it uses a standard rust toolkit:

-* `cargo build`
-* `cargo fmt` to format the code
-* `cargo test` to test
-* etc.
+- `cargo build`
+- `cargo fmt` to format the code
+- `cargo test` to test
+- etc.

 ## How to add a new scalar function

 Below is a checklist of what you need to do to add a new scalar function to DataFusion:

-* Add the actual implementation of the function:
-  * [here](datafusion/src/physical_plan/string_expressions.rs) for string functions
-  * [here](datafusion/src/physical_plan/math_expressions.rs) for math functions
-  * [here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
-  * create a new module [here](datafusion/src/physical_plan) for other functions
-* In [src/physical_plan/functions](datafusion/src/physical_plan/functions.rs), add:
-  * a new variant to `BuiltinScalarFunction`
-  * a new entry to `FromStr` with the name of the function as called by SQL
-  * a new line in `return_type` with the expected return type of the function, given an incoming type
-  * a new line in `signature` with the signature of the function (number and types of its arguments)
-  * a new line in `create_physical_expr` mapping the built-in to the implementation
-  * tests to the function.
-* In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
-* In [src/logical_plan/expr](datafusion/src/logical_plan/expr.rs), add:
-  * a new entry of the `unary_scalar_expr!` macro for the new function.
-* In [src/logical_plan/mod](datafusion/src/logical_plan/mod.rs), add:
-  * a new entry in the `pub use expr::{}` set.
+- Add the actual implementation of the function:
+  - [here](datafusion/src/physical_plan/string_expressions.rs) for string functions
+  - [here](datafusion/src/physical_plan/math_expressions.rs) for math functions
+  - [here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
+  - create a new module [here](datafusion/src/physical_plan) for other functions
+- In [src/physical_plan/functions](datafusion/src/physical_plan/functions.rs), add:
+  - a new variant to `BuiltinScalarFunction`
+  - a new entry to `FromStr` with the name of the function as called by SQL
+  - a new line in `return_type` with the expected return type of the function, given an incoming type
+  - a new line in `signature` with the signature of the function (number and types of its arguments)
+  - a new line in `create_physical_expr` mapping the built-in to the implementation
+  - tests to the function.
+- In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
+- In [src/logical_plan/expr](datafusion/src/logical_plan/expr.rs), add:
+  - a new entry of the `unary_scalar_expr!` macro for the new function.
+- In [src/logical_plan/mod](datafusion/src/logical_plan/mod.rs), add:
+  - a new entry in the `pub use expr::{}` set.

 ## How to add a new aggregate function

 Below is a checklist of what you need to do to add a new aggregate function to DataFusion:

-* Add the actual implementation of an `Accumulator` and `AggregateExpr`:
-  * [here](datafusion/src/physical_plan/string_expressions.rs) for string functions
-  * [here](datafusion/src/physical_plan/math_expressions.rs) for math functions
-  * [here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
-  * create a new module [here](datafusion/src/physical_plan) for other functions
-* In [src/physical_plan/aggregates](datafusion/src/physical_plan/aggregates.rs), add:
-  * a new variant to `BuiltinAggregateFunction`
-  * a new entry to `FromStr` with the name of the function as called by SQL
-  * a new line in `return_type` with the expected return type of the function, given an incoming type
-  * a new line in `signature` with the signature of the function (number and types of its arguments)
-  * a new line in `create_aggregate_expr` mapping the built-in to the implementation
-  * tests to the function.
-* In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
+- Add the actual implementation of an `Accumulator` and `AggregateExpr`:
+  - [here](datafusion/src/physical_plan/string_expressions.rs) for string functions
+  - [here](datafusion/src/physical_plan/math_expressions.rs) for math functions
+  - [here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
+  - create a new module [here](datafusion/src/physical_plan) for other functions
+- In [src/physical_plan/aggregates](datafusion/src/physical_plan/aggregates.rs), add:
+  - a new variant to `BuiltinAggregateFunction`
+  - a new entry to `FromStr` with the name of the function as called by SQL
+  - a new line in `return_type` with the expected return type of the function, given an incoming type
+  - a new line in `signature` with the signature of the function (number and types of its arguments)
+  - a new line in `create_aggregate_expr` mapping the built-in to the implementation
+  - tests to the function.
+- In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.

 ## How to display plans graphically

@@ -30,7 +30,7 @@ logical query plans as well as a query optimizer and execution engine
 capable of parallel execution against partitioned data sources (CSV
 and Parquet) using threads.

-DataFusion also supports distributed query execution via the  
+DataFusion also supports distributed query execution via the
 [Ballista](ballista/README.md) crate.

 ## Use Cases
@@ -42,24 +42,24 @@ the convenience of an SQL interface or a DataFrame API.

 ## Why DataFusion?

-* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
-* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
-* *Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
-* *High Quality*:  Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
+- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
+- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
+- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
+- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.

 ## Known Uses

 Here are some of the projects known to use DataFusion:

-* [Ballista](ballista) Distributed Compute Platform
-* [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
-* [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust)
-* [datafusion-python](https://pypi.org/project/datafusion)
-* [delta-rs](https://github.com/delta-io/delta-rs)
-* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
-* [ROAPI](https://github.com/roapi/roapi)
-* [Tensorbase](https://github.com/tensorbase/tensorbase)
-* [Squirtle](https://github.com/DSLAM-UMD/Squirtle)
+- [Ballista](ballista) Distributed Compute Platform
+- [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
+- [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust)
+- [datafusion-python](https://pypi.org/project/datafusion)
+- [delta-rs](https://github.com/delta-io/delta-rs)
+- [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
+- [ROAPI](https://github.com/roapi/roapi)
+- [Tensorbase](https://github.com/tensorbase/tensorbase)
+- [Squirtle](https://github.com/DSLAM-UMD/Squirtle)

 (if you know of another project, please submit a PR to add a link!)

@@ -122,8 +122,6 @@ Both of these examples will produce
 +---+--------+
 ```

-
-
 ## Using DataFusion as a library

 DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
@@ -230,7 +228,6 @@ DataFusion also includes a simple command-line interactive SQL utility. See the
 - [x] Parquet primitive types
 - [ ] Parquet nested types

-
 ## Extensibility

 DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:
@@ -242,26 +239,24 @@ DataFusion is designed to be extensible at all points. To that end, you can prov
 - [x] User Defined `LogicalPlan` nodes
 - [x] User Defined `ExecutionPlan` nodes

-
 # Supported SQL

 This library currently supports many SQL constructs, including

-* `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations
-* `SELECT ... FROM ...` together with any expression
-* `ALIAS` to name an expression
-* `CAST` to change types, including e.g. `Timestamp(Nanosecond, None)`
-* most mathematical unary and binary expressions such as `+`, `/`, `sqrt`, `tan`, `>=`.
-* `WHERE` to filter
-* `GROUP BY` together with one of the following aggregations: `MIN`, `MAX`, `COUNT`, `SUM`, `AVG`
-* `ORDER BY` together with an expression and optional `ASC` or `DESC` and also optional `NULLS FIRST` or `NULLS LAST`
-
+- `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations
+- `SELECT ... FROM ...` together with any expression
+- `ALIAS` to name an expression
+- `CAST` to change types, including e.g. `Timestamp(Nanosecond, None)`
+- most mathematical unary and binary expressions such as `+`, `/`, `sqrt`, `tan`, `>=`.
+- `WHERE` to filter
+- `GROUP BY` together with one of the following aggregations: `MIN`, `MAX`, `COUNT`, `SUM`, `AVG`
+- `ORDER BY` together with an expression and optional `ASC` or `DESC` and also optional `NULLS FIRST` or `NULLS LAST`

 ## Supported Functions

 DataFusion strives to implement a subset of the [PostgreSQL SQL dialect](https://www.postgresql.org/docs/current/functions.html) where possible. We explicitly choose a single dialect to maximize interoperability with other tools and allow reuse of the PostgreSQL documents and tutorials as much as possible.

-Currently, only a subset of the PosgreSQL dialect is implemented, and we will document any deviations.
+Currently, only a subset of the PostgreSQL dialect is implemented, and we will document any deviations.

 ## Schema Metadata / Information Schema Support

@@ -269,8 +264,7 @@ DataFusion supports the showing metadata about the tables available. This inform

 More information can be found in the [Postgres docs](https://www.postgresql.org/docs/13/infoschema-schema.html)).

-
-To show tables available for use in DataFusion, use the `SHOW TABLES`  command or the `information_schema.tables` view:
+To show tables available for use in DataFusion, use the `SHOW TABLES` command or the `information_schema.tables` view:

 ```sql
 > show tables;
@@ -291,7 +285,7 @@ To show tables available for use in DataFusion, use the `SHOW TABLES`  command o
 +---------------+--------------------+------------+--------------+
 ```

-To show the schema of a table in DataFusion, use the `SHOW COLUMNS`  command or the or `information_schema.columns` view:
+To show the schema of a table in DataFusion, use the `SHOW COLUMNS` command or the or `information_schema.columns` view:

 ```sql
 > show columns from t;
@@ -313,8 +307,6 @@ To show the schema of a table in DataFusion, use the `SHOW COLUMNS`  command or
 +------------+-------------+------------------+-------------+-----------+
 ```

-
-
 ## Supported Data Types

 DataFusion uses Arrow, and thus the Arrow type system, for query
@@ -322,41 +314,38 @@ execution. The SQL types from
 [sqlparser-rs](https://github.com/ballista-compute/sqlparser-rs/blob/main/src/ast/data_type.rs#L57)
 are mapped to Arrow types according to the following table

-
-| SQL Data Type   | Arrow DataType                   |
-| --------------- | -------------------------------- |
-| `CHAR`          | `Utf8`                           |
-| `VARCHAR`       | `Utf8`                           |
-| `UUID`          | *Not yet supported*              |
-| `CLOB`          | *Not yet supported*              |
-| `BINARY`        | *Not yet supported*              |
-| `VARBINARY`     | *Not yet supported*              |
-| `DECIMAL`       | `Float64`                        |
-| `FLOAT`         | `Float32`                        |
-| `SMALLINT`      | `Int16`                          |
-| `INT`           | `Int32`                          |
-| `BIGINT`        | `Int64`                          |
-| `REAL`          | `Float64`                        |
-| `DOUBLE`        | `Float64`                        |
-| `BOOLEAN`       | `Boolean`                        |
-| `DATE`          | `Date32`                         |
-| `TIME`          | `Time64(TimeUnit::Millisecond)`  |
-| `TIMESTAMP`     | `Date64`                         |
-| `INTERVAL`      | *Not yet supported*              |
-| `REGCLASS`      | *Not yet supported*              |
-| `TEXT`          | *Not yet supported*              |
-| `BYTEA`         | *Not yet supported*              |
-| `CUSTOM`        | *Not yet supported*              |
-| `ARRAY`         | *Not yet supported*              |
-
+| SQL Data Type | Arrow DataType                  |
+| ------------- | ------------------------------- |
+| `CHAR`        | `Utf8`                          |
+| `VARCHAR`     | `Utf8`                          |
+| `UUID`        | _Not yet supported_             |
+| `CLOB`        | _Not yet supported_             |
+| `BINARY`      | _Not yet supported_             |
+| `VARBINARY`   | _Not yet supported_             |
+| `DECIMAL`     | `Float64`                       |
+| `FLOAT`       | `Float32`                       |
+| `SMALLINT`    | `Int16`                         |
+| `INT`         | `Int32`                         |
+| `BIGINT`      | `Int64`                         |
+| `REAL`        | `Float64`                       |
+| `DOUBLE`      | `Float64`                       |
+| `BOOLEAN`     | `Boolean`                       |
+| `DATE`        | `Date32`                        |
+| `TIME`        | `Time64(TimeUnit::Millisecond)` |
+| `TIMESTAMP`   | `Date64`                        |
+| `INTERVAL`    | _Not yet supported_             |
+| `REGCLASS`    | _Not yet supported_             |
+| `TEXT`        | _Not yet supported_             |
+| `BYTEA`       | _Not yet supported_             |
+| `CUSTOM`      | _Not yet supported_             |
+| `ARRAY`       | _Not yet supported_             |

 # Architecture Overview

 There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together.

-* (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
-* (Feburary 2021): How DataFusion is used within the Ballista Project is described in *Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
-
+- (March 2021): The DataFusion architecture is described in _Query Engine Design and the Rust-Based DataFusion in Apache Arrow_: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
+- (Feburary 2021): How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)

 # Developer's guide

@@ -19,14 +19,14 @@

 # Ballista: Distributed Compute with Apache Arrow

-Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is built 
-on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported as 
+Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is built
+on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported as
 first-class citizens without paying a penalty for serialization costs.

 The foundational technologies in Ballista are:

 - [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels for efficient processing of data.
- [Apache Arrow Flight Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for efficient 
+- [Apache Arrow Flight Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for efficient
  data transfer between processes.
 - [Google Protocol Buffers](https://developers.google.com/protocol-buffers) for serializing query plans.
 - [Docker](https://www.docker.com/) for packaging up executors along with user-defined code.
@@ -57,7 +57,6 @@ April 2021 and should be considered experimental.

 ## Getting Started

-The [Ballista Developer Documentation](docs/README.md) and the 
-[DataFusion User Guide](https://github.com/apache/arrow-datafusion/tree/master/docs/user-guide) are currently the 
+The [Ballista Developer Documentation](docs/README.md) and the
+[DataFusion User Guide](https://github.com/apache/arrow-datafusion/tree/master/docs/user-guide) are currently the
 best sources of information for getting started with Ballista.
-
@@ -16,19 +16,19 @@
  specific language governing permissions and limitations
  under the License.
 -->
+
 # Ballista Developer Documentation

-This directory contains documentation for developers that are contributing to Ballista. If you are looking for 
-end-user documentation for a published release, please start with the 
+This directory contains documentation for developers that are contributing to Ballista. If you are looking for
+end-user documentation for a published release, please start with the
 [DataFusion User Guide](../../docs/user-guide) instead.

 ## Architecture & Design

- Read the [Architecture Overview](architecture.md) to get an understanding of the scheduler and executor 
+- Read the [Architecture Overview](architecture.md) to get an understanding of the scheduler and executor
  processes and how distributed query execution works.

 ## Build, Test, Release

 - Setting up a [development environment](dev-env.md).
 - [Integration Testing](integration-testing.md)
-
@@ -16,37 +16,38 @@
  specific language governing permissions and limitations
  under the License.
 -->
+
 # Ballista Architecture

 ## Overview

-Ballista allows queries to be executed in a distributed cluster. A cluster consists of one or 
+Ballista allows queries to be executed in a distributed cluster. A cluster consists of one or
 more scheduler processes and one or more executor processes. See the following sections in this document for more
 details about these components.

-The scheduler accepts logical query plans and translates them into physical query plans using DataFusion and then 
-runs a secondary planning/optimization process to translate the physical query plan into a distributed physical 
-query plan. 
+The scheduler accepts logical query plans and translates them into physical query plans using DataFusion and then
+runs a secondary planning/optimization process to translate the physical query plan into a distributed physical
+query plan.

-This process breaks a query down into a number of query stages that can be executed independently. There are 
-dependencies between query stages and these dependencies form a directionally-acyclic graph (DAG) because a query 
+This process breaks a query down into a number of query stages that can be executed independently. There are
+dependencies between query stages and these dependencies form a directionally-acyclic graph (DAG) because a query
 stage cannot start until its child query stages have completed.

-Each query stage has one or more partitions that can be processed in parallel by the available 
+Each query stage has one or more partitions that can be processed in parallel by the available
 executors in the cluster. This is the basic unit of scalability in Ballista.

-The following diagram shows the flow of requests and responses between the client, scheduler, and executor 
-processes. 
+The following diagram shows the flow of requests and responses between the client, scheduler, and executor
+processes.

 ![Query Execution Flow](images/query-execution.png)

 ## Scheduler Process

-The scheduler process implements a gRPC interface (defined in 
+The scheduler process implements a gRPC interface (defined in
 [ballista.proto](../rust/ballista/proto/ballista.proto)). The interface provides the following methods:

 | Method               | Description                                                          |
-|----------------------|----------------------------------------------------------------------|
+| -------------------- | -------------------------------------------------------------------- |
 | ExecuteQuery         | Submit a logical query plan or SQL query for execution               |
 | GetExecutorsMetadata | Retrieves a list of executors that have registered with a scheduler  |
 | GetFileMetadata      | Retrieve metadata about files available in the cluster file system   |
@@ -60,7 +61,7 @@ The scheduler can run in standalone mode, or can be run in clustered mode using
 The executor process implements the Apache Arrow Flight gRPC interface and is responsible for:

 - Executing query stages and persisting the results to disk in Apache Arrow IPC Format
- Making query stage results available as Flights so that they can be retrieved by other executors as well as by 
+- Making query stage results available as Flights so that they can be retrieved by other executors as well as by
  clients

 ## Rust Client
@@ -69,7 +70,6 @@ The Rust client provides a DataFrame API that is a thin wrapper around the DataF
 the means for a client to build a query plan for execution.

 The client executes the query plan by submitting an `ExecuteLogicalPlan` request to the scheduler and then calls
-`GetJobStatus` to check for completion. On completion, the client receives a list of locations for the Flights 
-containing the results for the query and will then connect to the appropriate executor processes to retrieve 
+`GetJobStatus` to check for completion. On completion, the client receives a list of locations for the Flights
+containing the results for the query and will then connect to the appropriate executor processes to retrieve
 those results.
-
@@ -16,13 +16,14 @@
  specific language governing permissions and limitations
  under the License.
 -->
+
 # Setting up a Rust development environment

 You will need a standard Rust development environment. The easiest way to achieve this is by using rustup: https://rustup.rs/

 ## Install OpenSSL

-Follow instructions for [setting up OpenSSL](https://docs.rs/openssl/0.10.28/openssl/). For Ubuntu users, the following 
+Follow instructions for [setting up OpenSSL](https://docs.rs/openssl/0.10.28/openssl/). For Ubuntu users, the following
 command works.

 ```bash
@@ -35,4 +36,4 @@ You'll need cmake in order to compile some of ballista's dependencies. Ubuntu us

 ```bash
 sudo apt-get install cmake
-```
+```
@@ -16,10 +16,11 @@
  specific language governing permissions and limitations
  under the License.
 -->
+
 # Integration Testing

-We use the [DataFusion Benchmarks](https://github.com/apache/arrow-datafusion/tree/master/benchmarks) for integration 
-testing. 
+We use the [DataFusion Benchmarks](https://github.com/apache/arrow-datafusion/tree/master/benchmarks) for integration
+testing.

 The integration tests can be executed by running the following command from the root of the DataFusion repository.

@@ -18,5 +18,5 @@
 -->

 # Ballista - Rust
-This crate contains the Ballista client library. For an example usage, please refer [here](../benchmarks/tpch/README.md).

+This crate contains the Ballista client library. For an example usage, please refer [here](../benchmarks/tpch/README.md).
@@ -18,4 +18,5 @@
 -->

 # Ballista - Rust
+
 This crate contains the core Ballista types.
@@ -18,6 +18,7 @@
 -->

 # Ballista Executor - Rust
+
 This crate contains the Ballista Executor. It can be used both as a library or as a binary.

 ## Run
@@ -28,4 +29,4 @@ RUST_LOG=info cargo run --release
 [2021-02-11T05:30:13Z INFO  executor] Running with config: ExecutorConfig { host: "localhost", port: 50051, work_dir: "/var/folders/y8/fc61kyjd4n53tn444n72rjrm0000gn/T/.tmpv1LjN0", concurrent_tasks: 4 }
 ```

-By default, the executor will bind to `localhost` and listen on port `50051`.
+By default, the executor will bind to `localhost` and listen on port `50051`.
@@ -18,6 +18,7 @@
 -->

 # Ballista Scheduler
+
 This crate contains the Ballista Scheduler. It can be used both as a library or as a binary.

 ## Run
@@ -32,8 +33,9 @@ $ RUST_LOG=info cargo run --release
 By default, the scheduler will bind to `localhost` and listen on port `50051`.

 ## Connecting to Scheduler
-Scheduler supports REST model also using content negotiation. 
-For e.x if you want to get list of executors connected to the scheduler, 
+
+Scheduler supports REST model also using content negotiation.
+For e.x if you want to get list of executors connected to the scheduler,
 you can do (assuming you use default config)

 ```bash
@@ -43,7 +45,8 @@ curl --request GET \
 ```

 ## Scheduler UI
-A basic ui for the scheduler is in `ui/scheduler` of the ballista repo. 
+
+A basic ui for the scheduler is in `ui/scheduler` of the ballista repo.
 It can be started using the following [yarn](https://yarnpkg.com/) command

 ```bash
@@ -22,7 +22,9 @@
 ## Start project from source

 ### Run scheduler/executor
+
 First, run scheduler from project:
+
 ```shell
 $ cd rust/scheduler
 $ RUST_LOG=info cargo run --release
@@ -34,6 +36,7 @@ $ RUST_LOG=info cargo run --release
 ```

 and run executor in new terminal:
+
 ```shell
 $ cd rust/executor
 $ RUST_LOG=info cargo run --release
@@ -44,6 +47,7 @@ $ RUST_LOG=info cargo run --release
 ```

 ### Run Client project
+
 ```shell
 $ cd ui/scheduler
 $ yarn
@@ -45,16 +45,23 @@ docker run -it -v $(your_data_location):/data datafusion-cli
 ## Usage

 ```
+DataFusion 4.0.0-SNAPSHOT
+DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries
+against CSV and Parquet files as well as querying directly against in-memory data.
+
 USAGE:
-    datafusion-cli [OPTIONS]
+    datafusion-cli [FLAGS] [OPTIONS]

 FLAGS:
    -h, --help       Prints help information
+    -q, --quiet      Reduce printing other than the results and work quietly
    -V, --version    Prints version information

 OPTIONS:
-    -c, --batch-size <batch-size>    The batch size of each query, default value is 1048576
+    -c, --batch-size <batch-size>    The batch size of each query, or use DataFusion default
    -p, --data-path <data-path>      Path to your data, default to current directory
+    -f, --file <file>                Execute commands from file, then exit
+        --format <format>            Output format (possible values: table, csv, tsv, json) [default: table]
 ```

 Type `exit` or `quit` to exit the CLI.
@@ -64,7 +71,7 @@ Type `exit` or `quit` to exit the CLI.
 Parquet data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. It is not necessary to provide schema information for Parquet files.

 ```sql
-CREATE EXTERNAL TABLE taxi 
+CREATE EXTERNAL TABLE taxi
 STORED AS PARQUET
 LOCATION '/mnt/nyctaxi/tripdata.parquet';
 ```
@@ -16,6 +16,7 @@
  specific language governing permissions and limitations
  under the License.
 -->
+
 # DataFusion User Guide Source

 This directory contains the sources for the DataFusion user guide.
@@ -27,4 +28,4 @@ To generate the user guide in HTML format, run the following commands:
 ```bash
 cargo install mdbook
 mdbook build
-```
+```
@@ -16,12 +16,14 @@
  specific language governing permissions and limitations
  under the License.
 -->
+
 # Summary

 - [Introduction](introduction.md)
- [Example Usage](example-usage.md)  
+- [Example Usage](example-usage.md)
 - [Use as a Library](library.md)
 - [SQL Reference](sql/introduction.md)
+
  - [SELECT](sql/select.md)
  - [DDL](sql/ddl.md)
    - [CREATE EXTERNAL TABLE](sql/ddl.md)
@@ -36,4 +38,4 @@
  - [Clients](distributed/clients.md)
    - [Rust](distributed/client-rust.md)
    - [Python](distributed/client-python.md)
- [Frequently Asked Questions](faq.md)
+- [Frequently Asked Questions](faq.md)
@@ -16,6 +16,7 @@
  specific language governing permissions and limitations
  under the License.
 -->
+
 # Python

-Coming soon.
+Coming soon.
@@ -16,7 +16,8 @@
  specific language governing permissions and limitations
  under the License.
 -->
+
 ## Ballista Rust Client

-The Rust client supports a `DataFrame` API as well as SQL. See the 
-[TPC-H Benchmark Client](https://github.com/ballista-compute/ballista/tree/main/rust/benchmarks/tpch) for an example.
+The Rust client supports a `DataFrame` API as well as SQL. See the
+[TPC-H Benchmark Client](https://github.com/ballista-compute/ballista/tree/main/rust/benchmarks/tpch) for an example.
@@ -16,6 +16,7 @@
  specific language governing permissions and limitations
  under the License.
 -->
+
 ## Clients

 - [Rust](client-rust.md)
@@ -16,8 +16,10 @@
  specific language governing permissions and limitations
  under the License.
 -->
-# Configuration 
-The rust executor and scheduler can be configured using toml files, environment variables and command line arguments. The specification for config options can be found in `rust/ballista/src/bin/[executor|scheduler]_config_spec.toml`. 
+
+# Configuration
+
+The rust executor and scheduler can be configured using toml files, environment variables and command line arguments. The specification for config options can be found in `rust/ballista/src/bin/[executor|scheduler]_config_spec.toml`.

 Those files fully define Ballista's configuration. If there is a discrepancy between this documentation and the files, assume those files are correct.

@@ -25,8 +27,8 @@ To get a list of command line arguments, run the binary with `--help`

 There is an example config file at `ballista/rust/ballista/examples/example_executor_config.toml`

-The order of precedence for arguments is: default config file < environment variables < specified config file < command line arguments. 
+The order of precedence for arguments is: default config file < environment variables < specified config file < command line arguments.

-The executor and scheduler will look for the default config file at `/etc/ballista/[executor|scheduler].toml` To specify a config file use the `--config-file` argument. 
+The executor and scheduler will look for the default config file at `/etc/ballista/[executor|scheduler].toml` To specify a config file use the `--config-file` argument.

-Environment variables are prefixed by `BALLISTA_EXECUTOR` or `BALLISTA_SCHEDULER` for the executor and scheduler respectively. Hyphens in command line arguments become underscores. For example, the `--scheduler-host` argument for the executor becomes `BALLISTA_EXECUTOR_SCHEDULER_HOST`
+Environment variables are prefixed by `BALLISTA_EXECUTOR` or `BALLISTA_SCHEDULER` for the executor and scheduler respectively. Hyphens in command line arguments become underscores. For example, the `--scheduler-host` argument for the executor becomes `BALLISTA_EXECUTOR_SCHEDULER_HOST`
@@ -16,6 +16,7 @@
  specific language governing permissions and limitations
  under the License.
 -->
+
 # Deployment

 Ballista is packaged as Docker images. Refer to the following guides to create a Ballista cluster:
@@ -23,4 +24,3 @@ Ballista is packaged as Docker images. Refer to the following guides to create a
 - [Create a cluster using Docker](standalone.md)
 - [Create a cluster using Docker Compose](docker-compose.md)
 - [Create a cluster using Kubernetes](kubernetes.md)
-
@@ -19,12 +19,12 @@

 # Installing Ballista with Docker Compose

-Docker Compose is a convenient way to launch a cluister when testing locally. The following Docker Compose example 
-demonstrates how to start a cluster using a single process that acts as both a scheduler and an executor, with a data 
+Docker Compose is a convenient way to launch a cluister when testing locally. The following Docker Compose example
+demonstrates how to start a cluster using a single process that acts as both a scheduler and an executor, with a data
 volume mounted into the container so that Ballista can access the host file system.

 ```yaml
-version: '2.0'
+version: "2.0"
 services:
  etcd:
    image: quay.io/coreos/etcd:v3.4.9
@@ -41,11 +41,9 @@ services:
      - "50051:50051"
    volumes:
      - ./data:/data
-
-
 ```

-With the above content saved to a `docker-compose.yaml` file, the following command can be used to start the single 
+With the above content saved to a `docker-compose.yaml` file, the following command can be used to start the single
 node cluster.

 ```bash
@@ -19,7 +19,7 @@

 ## Overview

-Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is 
+Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is
 built on an architecture that allows other programming languages to be supported as first-class citizens without paying
 a penalty for serialization costs.

@@ -41,12 +41,12 @@ The following diagram highlights some of the integrations that will be possible
 Although Ballista is largely inspired by Apache Spark, there are some key differences.

 - The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead of GC pauses.
- Ballista is designed from the ground up to use columnar data, enabling a number of efficiencies such as vectorized 
-processing (SIMD and GPU) and efficient compression. Although Spark does have some columnar support, it is still 
-largely row-based today.
+- Ballista is designed from the ground up to use columnar data, enabling a number of efficiencies such as vectorized
+  processing (SIMD and GPU) and efficient compression. Although Spark does have some columnar support, it is still
+  largely row-based today.
 - The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of distributed compute.
 - The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors in any programming language with minimal serialization overhead.
-  
+
 ## Status

-Ballista is at the proof-of-concept phase currently but is under active development by a growing community.
+Ballista is at the proof-of-concept phase currently but is under active development by a growing community.
@@ -16,6 +16,7 @@
  specific language governing permissions and limitations
  under the License.
 -->
+
 # Deploying Ballista with Kubernetes

 Ballista can be deployed to any Kubernetes cluster using the following instructions. These instructions assume that
@@ -32,15 +33,15 @@ The k8s deployment consists of:

 Ballista is at an early stage of development and therefore has some significant limitations:

- There is no support for shared object stores such as S3. All data must exist locally on each node in the 
+- There is no support for shared object stores such as S3. All data must exist locally on each node in the
  cluster, including where any client process runs.
- Only a single scheduler instance is currently supported unless the scheduler is configured to use `etcd` as a 
+- Only a single scheduler instance is currently supported unless the scheduler is configured to use `etcd` as a
  backing store.

-## Create Persistent Volume and Persistent Volume Claim 
+## Create Persistent Volume and Persistent Volume Claim

-Copy the following yaml to a `pv.yaml` file and apply to the cluster to create a persistent volume and a persistent 
-volume claim so that the specified host directory is available to the containers. This is where any data should be 
+Copy the following yaml to a `pv.yaml` file and apply to the cluster to create a persistent volume and a persistent
+volume claim so that the specified host directory is available to the containers. This is where any data should be
 located so that Ballista can execute queries against it.

 ```yaml
@@ -121,20 +122,20 @@ spec:
        ballista-cluster: ballista
    spec:
      containers:
-      - name: ballista-scheduler
-        image: ballistacompute/ballista-rust:0.4.2-SNAPSHOT
-        command: ["/scheduler"]
-        args: ["--port=50050"]
-        ports:
-          - containerPort: 50050
-            name: flight
-        volumeMounts:
-          - mountPath: /mnt
-            name: data
+        - name: ballista-scheduler
+          image: ballistacompute/ballista-rust:0.4.2-SNAPSHOT
+          command: ["/scheduler"]
+          args: ["--port=50050"]
+          ports:
+            - containerPort: 50050
+              name: flight
+          volumeMounts:
+            - mountPath: /mnt
+              name: data
      volumes:
-      - name: data
-        persistentVolumeClaim:
-          claimName: data-pv-claim
+        - name: data
+          persistentVolumeClaim:
+            claimName: data-pv-claim
 ---
 apiVersion: apps/v1
 kind: StatefulSet
@@ -156,12 +157,18 @@ spec:
        - name: ballista-executor
          image: ballistacompute/ballista-rust:0.4.2-SNAPSHOT
          command: ["/executor"]
-          args: ["--port=50051", "--scheduler-host=ballista-scheduler", "--scheduler-port=50050", "--external-host=$(MY_POD_IP)"]
+          args:
+            [
+              "--port=50051",
+              "--scheduler-host=ballista-scheduler",
+              "--scheduler-port=50050",
+              "--external-host=$(MY_POD_IP)",
+            ]
          env:
            - name: MY_POD_IP
              valueFrom:
                fieldRef:
-                  fieldPath: status.podIP            
+                  fieldPath: status.podIP
          ports:
            - containerPort: 50051
              name: flight
@@ -212,4 +219,4 @@ Run the following kubectl command to delete the cluster.

 ```bash
 kubectl delete -f cluster.yaml
-```
+```
@@ -16,31 +16,32 @@
  specific language governing permissions and limitations
  under the License.
 -->
+
 # Running Ballista on Raspberry Pi

 The Raspberry Pi single-board computer provides a fun and relatively inexpensive way to get started with distributed
 computing.

-These instructions have been tested using an Ubuntu Linux desktop as the host, and a 
+These instructions have been tested using an Ubuntu Linux desktop as the host, and a
 [Raspberry Pi 4 Model B](https://www.raspberrypi.org/products/raspberry-pi-4-model-b/) with 4 GB RAM as the target.

 ## Preparing the Raspberry Pi

 We recommend installing the 64-bit version of [Ubuntu for Raspberry Pi](https://ubuntu.com/raspberry-pi).

-The Rust implementation of Arrow does not work correctly on 32-bit ARM architectures 
+The Rust implementation of Arrow does not work correctly on 32-bit ARM architectures
 ([issue](https://github.com/apache/arrow-rs/issues/109)).

 ## Cross Compiling DataFusion for the Raspberry Pi

-We do not yet publish official Docker images as part of the release process, although we do plan to do this in the 
-future ([issue #228](https://github.com/apache/arrow-datafusion/issues/228)). 
+We do not yet publish official Docker images as part of the release process, although we do plan to do this in the
+future ([issue #228](https://github.com/apache/arrow-datafusion/issues/228)).

-Although it is technically possible to build DataFusion directly on a Raspberry Pi, it really isn't very practical. 
-It is much faster to use [cross](https://github.com/rust-embedded/cross) to cross-compile from a more powerful 
+Although it is technically possible to build DataFusion directly on a Raspberry Pi, it really isn't very practical.
+It is much faster to use [cross](https://github.com/rust-embedded/cross) to cross-compile from a more powerful
 desktop computer.

-Docker must be installed and the Docker daemon must be running before cross-compiling with cross. See the 
+Docker must be installed and the Docker daemon must be running before cross-compiling with cross. See the
 [cross](https://github.com/rust-embedded/cross) project for more detailed instructions.

 Run the following command to install cross.
@@ -63,9 +64,9 @@ cross test --target aarch64-unknown-linux-gnu

 ## Deploying the binaries to Raspberry Pi

-You should now be able to copy the executable to the Raspberry Pi using scp on Linux. You will need to change the IP 
-address in these commands to be the IP address for your Raspberry Pi. The easiest way to find this is to connect a 
-keyboard and monitor to the Pi and run `ifconfig`. 
+You should now be able to copy the executable to the Raspberry Pi using scp on Linux. You will need to change the IP
+address in these commands to be the IP address for your Raspberry Pi. The easiest way to find this is to connect a
+keyboard and monitor to the Pi and run `ifconfig`.

 ```bash
 scp ./target/aarch64-unknown-linux-gnu/release/ballista-scheduler ubuntu@10.0.0.186:
@@ -83,9 +84,9 @@ It is now possible to run the Ballista scheduler and executor natively on the Pi

 ## Docker

-Using Docker's `buildx` cross-platform functionality, we can also build a docker image targeting ARM64 
-from any desktop environment. This will require write access to a Docker repository 
-on [Docker Hub](https://hub.docker.com/) because the resulting Docker image will be pushed directly 
+Using Docker's `buildx` cross-platform functionality, we can also build a docker image targeting ARM64
+from any desktop environment. This will require write access to a Docker repository
+on [Docker Hub](https://hub.docker.com/) because the resulting Docker image will be pushed directly
 to the repo.

 ```bash
@@ -118,11 +119,11 @@ docker run -it myrepo/ballista-arm64 \
  --concurrency=24 --iterations=1 --debug --host=ballista-scheduler --port=50050
 ```

-Note that it will be necessary to mount appropriate volumes into the containers and also configure networking 
+Note that it will be necessary to mount appropriate volumes into the containers and also configure networking
 so that the Docker containers can communicate with each other. This can be achieved using Docker compose or Kubernetes.

 ## Kubernetes

 With Docker images built using the instructions above, it is now possible to deploy Ballista to a Kubernetes cluster
 running on one of more Raspberry Pi computers. Refer to the instructions in the [Kubernetes](kubernetes.md) chapter
-for more information, and remember to change the Docker image name to `myrepo/ballista-arm64`.
+for more information, and remember to change the Docker image name to `myrepo/ballista-arm64`.
@@ -16,6 +16,7 @@
  specific language governing permissions and limitations
  under the License.
 -->
+
 ## Deploying a standalone Ballista cluster

 ### Start a Scheduler
@@ -50,7 +51,7 @@ Start one or more executor processes. Each executor process will need to listen
 ```bash
 docker run --network=host \
  -d ballistacompute/ballista-rust:0.4.2-SNAPSHOT \
-  /executor --external-host localhost --port 50051 
+  /executor --external-host localhost --port 50051
 ```

 Use `docker ps` to check that both the scheduer and executor(s) are now running:
@@ -71,14 +72,14 @@ $ docker logs 0746ce262a19
 [2021-02-14T18:36:25Z INFO  executor] Starting registration with scheduler
 ```

-The external host and port will be registered with the scheduler. The executors will discover other executors by 
+The external host and port will be registered with the scheduler. The executors will discover other executors by
 requesting a list of executors from the scheduler.

 ### Using etcd as backing store

 _NOTE: This functionality is currently experimental_

-Ballista can optionally use [etcd](https://etcd.io/) as a backing store for the scheduler. 
+Ballista can optionally use [etcd](https://etcd.io/) as a backing store for the scheduler.

 ```bash
 docker run --network=host \
@@ -88,5 +89,5 @@ docker run --network=host \
  --etcd-urls etcd:2379
 ```

-Please refer to the [etcd](https://etcd.io/) web site for installation instructions. Etcd version 3.4.9 or later is 
+Please refer to the [etcd](https://etcd.io/) web site for installation instructions. Etcd version 3.4.9 or later is
 recommended.
@@ -16,6 +16,7 @@
  specific language governing permissions and limitations
  under the License.
 -->
+
 # Example Usage

 Run a SQL query against data stored in a CSV:
@@ -16,6 +16,7 @@
  specific language governing permissions and limitations
  under the License.
 -->
+
 # Frequently Asked Questions

 ## What is the relationship between Apache Arrow, DataFusion, and Ballista?
@@ -23,9 +24,9 @@
 Apache Arrow is a library which provides a standardized memory representation for columnar data. It also provides
 "kernels" for performing common operations on this data.

-DataFusion is a library for executing queries in-process using the Apache Arrow memory 
-model and computational kernels. It is designed to run within a single process, using threads 
-for parallel query execution. 
+DataFusion is a library for executing queries in-process using the Apache Arrow memory
+model and computational kernels. It is designed to run within a single process, using threads
+for parallel query execution.

 Ballista is a distributed compute platform design to leverage DataFusion and other query
-execution libraries.
+execution libraries.
@@ -37,8 +37,7 @@ the convenience of an SQL interface or a DataFrame API.

 ## Why DataFusion?

-* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
-* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
-* *Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
-* *High Quality*:  Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
-
+- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
+- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
+- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
+- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
@@ -16,6 +16,7 @@
  specific language governing permissions and limitations
  under the License.
 -->
+
 # Using DataFusion as a library

 DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
@@ -20,7 +20,7 @@
 # SELECT syntax

 The queries in DataFusion scan data from tables and return 0 or more rows.
-In this documentation we describe the SQL syntax in DataFusion. 
+In this documentation we describe the SQL syntax in DataFusion.

 DataFusion supports the following syntax for queries:
 <code class="language-sql hljs">
@@ -32,7 +32,7 @@ DataFusion supports the following syntax for queries:
 [ [GROUP BY](#group-by-clause) grouping_element [, ...] ] <br/>
 [ [HAVING](#having-clause) condition] <br/>
 [ [UNION](#union-clause) [ ALL | select ] <br/>
-[ [ORDER BY](#order-by-clause) expression [ ASC | DESC ] [, ...] ] <br/>
+[ [ORDER BY](#order-by-clause) expression [ ASC | DESC ][, ...] ] <br/>
 [ [LIMIT](#limit-clause) count ] <br/>

 </code>
@@ -48,11 +48,10 @@ SELECT a, b FROM x;

 # SELECT clause

-
 Example:

 ```sql
-SELECT a, b, a + b FROM table 
+SELECT a, b, a + b FROM table
 ```

 The `DISTINCT` quantifier can be added to make the query return all distinct rows.
@@ -65,11 +64,11 @@ SELECT DISTINCT person, age FROM employees
 # FROM clause

 Example:
+
 ```sql
 SELECT t.a FROM table AS t
 ```

-
 # WHERE clause

 Example:
@@ -86,7 +85,6 @@ Example:
 SELECT a, b, MAX(c) FROM table GROUP BY a, b
 ```

-
 # HAVING clause

 Example:
@@ -126,7 +124,6 @@ SELECT age, person FROM table ORDER BY age DESC;
 SELECT age, person FROM table ORDER BY age, person DESC;
 ```

-
 # LIMIT clause

 Limits the number of rows to be a maximum of `count` rows. `count` should be a non-negative integer.
@@ -136,4 +133,4 @@ Example:
 ```sql
 SELECT age, person FROM table
 LIMIT 10
-```
+```