Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions ballista/docs/user-guide/.gitignore

This file was deleted.

1 change: 1 addition & 0 deletions docs/user-guide/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
book
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a new line at the end of the file?

14 changes: 4 additions & 10 deletions ballista/docs/user-guide/README.md → docs/user-guide/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,21 +16,15 @@
specific language governing permissions and limitations
under the License.
-->
# Ballista User Guide Source
# DataFusion User Guide Source

This directory contains the sources for the user guide that is published at https://ballistacompute.org/docs/.
This directory contains the sources for the DataFusion user guide.

## Generate HTML

To generate the user guide in HTML format, run the following commands:

```bash
cargo install mdbook
mdbook build
```

## Deploy User Guide to Web Site

Requires ssh certificate to be available.

```bash
./deploy.sh
```
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
# under the License.

[book]
authors = ["Andy Grove"]
authors = ["Apache Arrow"]
language = "en"
multilingual = false
src = "src"
title = "Ballista User Guide"
title = "DataFusion User Guide"
33 changes: 33 additions & 0 deletions docs/user-guide/src/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Summary

- [Introduction](introduction.md)
- [Example Usage](example-usage.md)
- [Use as a Library](library.md)
- [Distributed](distributed/introduction.md)
- [Create a Ballista Cluster](distributed/deployment.md)
- [Docker](distributed/standalone.md)
- [Docker Compose](distributed/docker-compose.md)
- [Kubernetes](distributed/kubernetes.md)
- [Ballista Configuration](distributed/configuration.md)
- [Clients](distributed/clients.md)
- [Rust](distributed/client-rust.md)
- [Python](distributed/client-python.md)
- [Frequently Asked Questions](faq.md)
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,6 @@
specific language governing permissions and limitations
under the License.
-->
# Summary
# Python

- [Introduction](introduction.md)
- [Create a Ballista Cluster](deployment.md)
- [Docker](standalone.md)
- [Docker Compose](docker-compose.md)
- [Kubernetes](kubernetes.md)
- [Ballista Configuration](configuration.md)
- [Clients](clients.md)
- [Rust](client-rust.md)
- [Python](client-python.md)
- [Frequently Asked Questions](faq.md)
Coming soon.
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,7 @@ The k8s deployment consists of:
Ballista is at an early stage of development and therefore has some significant limitations:

- There is no support for shared object stores such as S3. All data must exist locally on each node in the
cluster, including where any client process runs (until
[#473](https://github.com/ballista-compute/ballista/issues/473) is resolved).
cluster, including where any client process runs.
- Only a single scheduler instance is currently supported unless the scheduler is configured to use `etcd` as a
backing store.

Expand Down
76 changes: 76 additions & 0 deletions docs/user-guide/src/example-usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Example Usage

Run a SQL query against data stored in a CSV:

```rust
use datafusion::prelude::*;
use arrow::util::pretty::print_batches;
use arrow::record_batch::RecordBatch;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
// register the table
let mut ctx = ExecutionContext::new();
ctx.register_csv("example", "tests/example.csv", CsvReadOptions::new())?;

// create a plan to run a SQL query
let df = ctx.sql("SELECT a, MIN(b) FROM example GROUP BY a LIMIT 100")?;

// execute and print results
let results: Vec<RecordBatch> = df.collect().await?;
print_batches(&results)?;
Ok(())
}
```

Use the DataFrame API to process data stored in a CSV:

```rust
use datafusion::prelude::*;
use arrow::util::pretty::print_batches;
use arrow::record_batch::RecordBatch;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
// create the dataframe
let mut ctx = ExecutionContext::new();
let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;

let df = df.filter(col("a").lt_eq(col("b")))?
.aggregate(vec![col("a")], vec![min(col("b"))])?
.limit(100)?;

// execute and print results
let results: Vec<RecordBatch> = df.collect().await?;
print_batches(&results)?;
Ok(())
}
```

Both of these examples will produce

```text
+---+--------+
| a | MIN(b) |
+---+--------+
| 1 | 2 |
+---+--------+
```
File renamed without changes.
44 changes: 44 additions & 0 deletions docs/user-guide/src/introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# DataFusion

DataFusion is an extensible query execution framework, written in
Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
in-memory format.

DataFusion supports both an SQL and a DataFrame API for building
logical query plans as well as a query optimizer and execution engine
capable of parallel execution against partitioned data sources (CSV
and Parquet) using threads.

## Use Cases

DataFusion is used to create modern, fast and efficient data
pipelines, ETL processes, and database systems, which need the
performance of Rust and Apache Arrow and want to provide their users
the convenience of an SQL interface or a DataFrame API.

## Why DataFusion?

* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
* *Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
* *High Quality*: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.

28 changes: 28 additions & 0 deletions docs/user-guide/src/library.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Using DataFusion as a library

DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).

To get started, add the following to your `Cargo.toml` file:

```toml
[dependencies]
datafusion = "4.0.0-SNAPSHOT"
```