diff --git a/ballista/docs/user-guide/.gitignore b/ballista/docs/user-guide/.gitignore deleted file mode 100644 index e662f99e3281a..0000000000000 --- a/ballista/docs/user-guide/.gitignore +++ /dev/null @@ -1,2 +0,0 @@ -ballista-book.tgz -book \ No newline at end of file diff --git a/docs/user-guide/.gitignore b/docs/user-guide/.gitignore new file mode 100644 index 0000000000000..e9c072897d554 --- /dev/null +++ b/docs/user-guide/.gitignore @@ -0,0 +1 @@ +book \ No newline at end of file diff --git a/ballista/docs/user-guide/README.md b/docs/user-guide/README.md similarity index 78% rename from ballista/docs/user-guide/README.md rename to docs/user-guide/README.md index 9ee3e90fcf6dd..0b9278c593b1e 100644 --- a/ballista/docs/user-guide/README.md +++ b/docs/user-guide/README.md @@ -16,21 +16,15 @@ specific language governing permissions and limitations under the License. --> -# Ballista User Guide Source +# DataFusion User Guide Source -This directory contains the sources for the user guide that is published at https://ballistacompute.org/docs/. +This directory contains the sources for the DataFusion user guide. ## Generate HTML +To generate the user guide in HTML format, run the following commands: + ```bash cargo install mdbook mdbook build -``` - -## Deploy User Guide to Web Site - -Requires ssh certificate to be available. - -```bash -./deploy.sh ``` \ No newline at end of file diff --git a/ballista/docs/user-guide/book.toml b/docs/user-guide/book.toml similarity index 93% rename from ballista/docs/user-guide/book.toml rename to docs/user-guide/book.toml index cf1653d74554d..efb9212dfdfda 100644 --- a/ballista/docs/user-guide/book.toml +++ b/docs/user-guide/book.toml @@ -16,8 +16,8 @@ # under the License. [book] -authors = ["Andy Grove"] +authors = ["Apache Arrow"] language = "en" multilingual = false src = "src" -title = "Ballista User Guide" +title = "DataFusion User Guide" diff --git a/docs/user-guide/src/SUMMARY.md b/docs/user-guide/src/SUMMARY.md new file mode 100644 index 0000000000000..e2ddcb0a4e89c --- /dev/null +++ b/docs/user-guide/src/SUMMARY.md @@ -0,0 +1,33 @@ + +# Summary + +- [Introduction](introduction.md) +- [Example Usage](example-usage.md) +- [Use as a Library](library.md) +- [Distributed](distributed/introduction.md) + - [Create a Ballista Cluster](distributed/deployment.md) + - [Docker](distributed/standalone.md) + - [Docker Compose](distributed/docker-compose.md) + - [Kubernetes](distributed/kubernetes.md) + - [Ballista Configuration](distributed/configuration.md) + - [Clients](distributed/clients.md) + - [Rust](distributed/client-rust.md) + - [Python](distributed/client-python.md) +- [Frequently Asked Questions](faq.md) \ No newline at end of file diff --git a/ballista/docs/user-guide/src/SUMMARY.md b/docs/user-guide/src/distributed/client-python.md similarity index 69% rename from ballista/docs/user-guide/src/SUMMARY.md rename to docs/user-guide/src/distributed/client-python.md index c8fc2c8bd6a67..7525c608ad233 100644 --- a/ballista/docs/user-guide/src/SUMMARY.md +++ b/docs/user-guide/src/distributed/client-python.md @@ -16,15 +16,6 @@ specific language governing permissions and limitations under the License. --> -# Summary +# Python -- [Introduction](introduction.md) -- [Create a Ballista Cluster](deployment.md) - - [Docker](standalone.md) - - [Docker Compose](docker-compose.md) - - [Kubernetes](kubernetes.md) - - [Ballista Configuration](configuration.md) -- [Clients](clients.md) - - [Rust](client-rust.md) - - [Python](client-python.md) -- [Frequently Asked Questions](faq.md) \ No newline at end of file +Coming soon. \ No newline at end of file diff --git a/ballista/docs/user-guide/src/client-rust.md b/docs/user-guide/src/distributed/client-rust.md similarity index 100% rename from ballista/docs/user-guide/src/client-rust.md rename to docs/user-guide/src/distributed/client-rust.md diff --git a/ballista/docs/user-guide/src/clients.md b/docs/user-guide/src/distributed/clients.md similarity index 100% rename from ballista/docs/user-guide/src/clients.md rename to docs/user-guide/src/distributed/clients.md diff --git a/ballista/docs/user-guide/src/configuration.md b/docs/user-guide/src/distributed/configuration.md similarity index 100% rename from ballista/docs/user-guide/src/configuration.md rename to docs/user-guide/src/distributed/configuration.md diff --git a/ballista/docs/user-guide/src/deployment.md b/docs/user-guide/src/distributed/deployment.md similarity index 100% rename from ballista/docs/user-guide/src/deployment.md rename to docs/user-guide/src/distributed/deployment.md diff --git a/ballista/docs/user-guide/src/docker-compose.md b/docs/user-guide/src/distributed/docker-compose.md similarity index 100% rename from ballista/docs/user-guide/src/docker-compose.md rename to docs/user-guide/src/distributed/docker-compose.md diff --git a/ballista/docs/user-guide/src/introduction.md b/docs/user-guide/src/distributed/introduction.md similarity index 100% rename from ballista/docs/user-guide/src/introduction.md rename to docs/user-guide/src/distributed/introduction.md diff --git a/ballista/docs/user-guide/src/kubernetes.md b/docs/user-guide/src/distributed/kubernetes.md similarity index 97% rename from ballista/docs/user-guide/src/kubernetes.md rename to docs/user-guide/src/distributed/kubernetes.md index 8cd8beeb267e6..027a44d469682 100644 --- a/ballista/docs/user-guide/src/kubernetes.md +++ b/docs/user-guide/src/distributed/kubernetes.md @@ -33,8 +33,7 @@ The k8s deployment consists of: Ballista is at an early stage of development and therefore has some significant limitations: - There is no support for shared object stores such as S3. All data must exist locally on each node in the - cluster, including where any client process runs (until - [#473](https://github.com/ballista-compute/ballista/issues/473) is resolved). + cluster, including where any client process runs. - Only a single scheduler instance is currently supported unless the scheduler is configured to use `etcd` as a backing store. diff --git a/ballista/docs/user-guide/src/standalone.md b/docs/user-guide/src/distributed/standalone.md similarity index 100% rename from ballista/docs/user-guide/src/standalone.md rename to docs/user-guide/src/distributed/standalone.md diff --git a/docs/user-guide/src/example-usage.md b/docs/user-guide/src/example-usage.md new file mode 100644 index 0000000000000..ff23c96de362e --- /dev/null +++ b/docs/user-guide/src/example-usage.md @@ -0,0 +1,76 @@ + +# Example Usage + +Run a SQL query against data stored in a CSV: + +```rust +use datafusion::prelude::*; +use arrow::util::pretty::print_batches; +use arrow::record_batch::RecordBatch; + +#[tokio::main] +async fn main() -> datafusion::error::Result<()> { + // register the table + let mut ctx = ExecutionContext::new(); + ctx.register_csv("example", "tests/example.csv", CsvReadOptions::new())?; + + // create a plan to run a SQL query + let df = ctx.sql("SELECT a, MIN(b) FROM example GROUP BY a LIMIT 100")?; + + // execute and print results + let results: Vec = df.collect().await?; + print_batches(&results)?; + Ok(()) +} +``` + +Use the DataFrame API to process data stored in a CSV: + +```rust +use datafusion::prelude::*; +use arrow::util::pretty::print_batches; +use arrow::record_batch::RecordBatch; + +#[tokio::main] +async fn main() -> datafusion::error::Result<()> { + // create the dataframe + let mut ctx = ExecutionContext::new(); + let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?; + + let df = df.filter(col("a").lt_eq(col("b")))? + .aggregate(vec![col("a")], vec![min(col("b"))])? + .limit(100)?; + + // execute and print results + let results: Vec = df.collect().await?; + print_batches(&results)?; + Ok(()) +} +``` + +Both of these examples will produce + +```text ++---+--------+ +| a | MIN(b) | ++---+--------+ +| 1 | 2 | ++---+--------+ +``` diff --git a/ballista/docs/user-guide/src/faq.md b/docs/user-guide/src/faq.md similarity index 100% rename from ballista/docs/user-guide/src/faq.md rename to docs/user-guide/src/faq.md diff --git a/ballista/docs/user-guide/src/img/ballista-architecture.png b/docs/user-guide/src/img/ballista-architecture.png similarity index 100% rename from ballista/docs/user-guide/src/img/ballista-architecture.png rename to docs/user-guide/src/img/ballista-architecture.png diff --git a/docs/user-guide/src/introduction.md b/docs/user-guide/src/introduction.md new file mode 100644 index 0000000000000..c67fb90103d88 --- /dev/null +++ b/docs/user-guide/src/introduction.md @@ -0,0 +1,44 @@ + + +# DataFusion + +DataFusion is an extensible query execution framework, written in +Rust, that uses [Apache Arrow](https://arrow.apache.org) as its +in-memory format. + +DataFusion supports both an SQL and a DataFrame API for building +logical query plans as well as a query optimizer and execution engine +capable of parallel execution against partitioned data sources (CSV +and Parquet) using threads. + +## Use Cases + +DataFusion is used to create modern, fast and efficient data +pipelines, ETL processes, and database systems, which need the +performance of Rust and Apache Arrow and want to provide their users +the convenience of an SQL interface or a DataFrame API. + +## Why DataFusion? + +* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance +* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem +* *Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase +* *High Quality*: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems. + diff --git a/docs/user-guide/src/library.md b/docs/user-guide/src/library.md new file mode 100644 index 0000000000000..12879b160c8f1 --- /dev/null +++ b/docs/user-guide/src/library.md @@ -0,0 +1,28 @@ + +# Using DataFusion as a library + +DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/). + +To get started, add the following to your `Cargo.toml` file: + +```toml +[dependencies] +datafusion = "4.0.0-SNAPSHOT" +```