Skip to content
This repository was archived by the owner on Feb 16, 2024. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 15 additions & 7 deletions docs/modules/ROOT/pages/demos/airflow-scheduled-job.adoc
Original file line number Diff line number Diff line change
@@ -1,12 +1,5 @@
= airflow-scheduled-job

[NOTE]
====
This guide assumes that you already have the demo `airflow-scheduled-job` installed.
If you don't have it installed please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo].
To put it simply you have to run `stackablectl demo install airflow-scheduled-job`.
====

This demo will

* Install the required Stackable operators
Expand All @@ -22,6 +15,21 @@ You can see the deployed products as well as their relationship in the following

image::demo-airflow-scheduled-job/overview.png[]

[#system-requirements]
== System requirements

To run this demo, your system needs at least:

* 2.5 https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu[cpu units] (core/hyperthread)
* 9GiB memory
* 24GiB disk storage

[#installation]
== Installation

Please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo].
To put it simply you just have to run `stackablectl demo install airflow-scheduled-job`.

== List deployed Stackable services
To list the installed Stackable services run the following command:

Expand Down
Original file line number Diff line number Diff line change
@@ -1,36 +1,20 @@
= data-lakehouse-iceberg-trino-spark

[WARNING]
[IMPORTANT]
====
This demo shows a data workload with real world data volumes and uses significant amount of resources to ensure acceptable response times.
It will most likely not run on your workstation.

There is also the smaller xref:demos/trino-iceberg.adoc[] demo focusing on the abilities a lakehouse using Apache Iceberg offers.
The `trino-iceberg` demo has no streaming data part and can be executed on a local workstation.

The demo was developed and tested on a kubernetes cluster with 10 nodes (4 cores (8 threads), 20GB RAM and 30GB HDD).
Instance types that loosely correspond to this on the Hyperscalers are:

- *Google*: `e2-standard-8`
- *Azure*: `Standard_D4_v2`
- *AWS*: `m5.2xlarge`

In addition to these nodes the operators will request multiple persistent volumes with a total capacity of about 1TB.
====

[WARNING]
[CAUTION]
====
This demo only runs in the `default` namespace, as a `ServiceAccount` will be created.
Additionally, we have to use the fqdn service names (including the namespace), so that the used TLS certificates are valid.
====

[NOTE]
====
This guide assumes that you already have the demo `data-lakehouse-iceberg-trino-spark` installed.
If you don't have it installed please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo].
To put it simply you have to run `stackablectl demo install data-lakehouse-iceberg-trino-spark`.
====

This demo will

* Install the required Stackable operators
Expand All @@ -53,6 +37,24 @@ You can see the deployed products as well as their relationship in the following

image::demo-data-lakehouse-iceberg-trino-spark/overview.png[]

[#system-requirements]
== System requirements

The demo was developed and tested on a kubernetes cluster with 10 nodes (4 cores (8 threads), 20GB RAM and 30GB HDD).
Instance types that loosely correspond to this on the Hyperscalers are:

- *Google*: `e2-standard-8`
- *Azure*: `Standard_D4_v2`
- *AWS*: `m5.2xlarge`

In addition to these nodes the operators will request multiple persistent volumes with a total capacity of about 1TB.

[#installation]
== Installation

Please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo].
To put it simply you just have to run `stackablectl demo install data-lakehouse-iceberg-trino-spark`.

== Apache Iceberg
As Apache Iceberg states on their https://iceberg.apache.org/docs/latest/[website]:

Expand Down
22 changes: 15 additions & 7 deletions docs/modules/ROOT/pages/demos/hbase-hdfs-load-cycling-data.adoc
Original file line number Diff line number Diff line change
@@ -1,12 +1,5 @@
= hbase-hdfs-cycling-data

[NOTE]
====
This guide assumes that you already have the demo `hbase-hdfs-load-cycling-data` installed.
If you don't have it installed please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo].
To put it simply you have to run `stackablectl demo install hbase-hdfs-load-cycling-data`.
====

This demo will

* Install the required Stackable operators
Expand All @@ -22,6 +15,21 @@ You can see the deployed products as well as their relationship in the following

image::demo-hbase-hdfs-load-cycling-data/overview.png[]

[#system-requirements]
== System requirements

To run this demo, your system needs at least:

* 3 https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu[cpu units] (core/hyperthread)
* 6GiB memory
* 16GiB disk storage

[#installation]
== Installation

Please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo].
To put it simply you just have to run `stackablectl demo install hbase-hdfs-load-cycling-data`.

== List deployed Stackable services
To list the installed Stackable services run the following command:
`stackablectl services list --all-namespaces`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,30 +2,10 @@

This demo showcases the integration between https://jupyter.org[Jupyter] and https://hadoop.apache.org/[Apache Hadoop] deployed on the Stackable Data Platform (SDP) Kubernetes cluster. https://jupyterlab.readthedocs.io/en/stable/[JupyterLab] is deployed using the https://github.com/jupyterhub/zero-to-jupyterhub-k8s[pyspark-notebook stack] provided by the Jupyter community. The SDP makes this integration easy by publishing a discovery `ConfigMap` for the HDFS cluster. This `ConfigMap` is then mounted in all `Pods`` running https://spark.apache.org/docs/latest/api/python/getting_started/index.html[PySpark] notebooks so that these have access to HDFS data. For this demo, the HDFS cluster is provisioned with a small sample of the https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page[NYC taxi trip dataset] which is analyzed with a notebook that is provisioned automatically in the JupyterLab interface .

This demo can be installed on most cloud managed Kubernetes clusters as well as on premise or on a reasonably provisioned laptop. Install this demo on an existing Kubernetes cluster:

[source,bash]
----
stackablectl demo install jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data
----

[WARNING]
====
This demo should not be run alongside other demos and requires a minimum of 32 GB RAM and 8 CPUs.
====

[NOTE]
====
Some container images used by this demo are quite large and some steps may take several minutes to complete. If you install this demo locally, on a developer laptop for example, this can lead to timeouts during the installation. If this happens, it's safe to rerun the `stackablectl` command from above.

For more details on how to install Stackable demos see the xref:commands/demo.adoc#_install_demo[documentation].
====

== Aim / Context

This demo does not use the Stackable spark-k8s-operator but rather delegates the creation of executor pods to JupyterHub. The intention is to demonstrate how to interact with SDP components when designing and testing Spark jobs: the resulting script and Spark job definition can then be transferred for use with a Stackable `SparkApplication` resource. When logging in to JupyterHub (described below), a pod will be created with the username as a suffix e.g. `jupyter-admin`. This runs a container that hosts a Jupyter notebook with Spark, Java and Python pre-installed. When the user creates a `SparkSession`, temporary spark executors are created that are persisted until the notebook kernel is shut down or re-started. The notebook can thus be used as a sandbox for writing, testing and benchmarking Spark jobs before they are moved into production.


== Overview

This demo will:
Expand All @@ -39,6 +19,27 @@ This demo will:
* Train an anomaly detection model using PySpark on the data available in HDFS
* Perform some predictions and visualize anomalies

[#system-requirements]
== System requirements

To run this demo, your system needs at least:

* 8 https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu[cpu units] (core/hyperthread)
* 32GiB memory
* 22GiB disk storage

[#installation]
== Installation

Please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo].
To put it simply you just have to run `stackablectl demo install jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data`.

[NOTE]
====
Some container images used by this demo are quite large and some steps may take several minutes to complete. If you install this demo locally, on a developer laptop for example, this can lead to timeouts during the installation. If this happens, it's safe to rerun the `stackablectl` command from above.

For more details on how to install Stackable demos see the xref:commands/demo.adoc#_install_demo[documentation].
====

== HDFS

Expand Down
18 changes: 12 additions & 6 deletions docs/modules/ROOT/pages/demos/logging.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -59,14 +59,20 @@ vm.max_map_count=262144

Then run `sudo sysctl --load` to reload.

== Run the demo
[#system-requirements]
== System requirements

The following command creates a kind cluster and installs this demo:
To run this demo, your system needs at least:

[source,console]
----
$ stackablectl demo install logging --kind-cluster
----
* 6.5 https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu[cpu units] (core/hyperthread)
* 5GiB memory
* 27GiB disk storage

[#installation]
== Installation

Please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo].
To put it simply you just have to run `stackablectl demo install logging`.

== List deployed Stackable services

Expand Down
Original file line number Diff line number Diff line change
@@ -1,18 +1,11 @@
= nifi-kafka-druid-earthquake-data

[WARNING]
[CAUTION]
====
This demo only runs in the `default` namespace, as a `ServiceAccount` will be created.
Additionally, we have to use the fqdn service names (including the namespace), so that the used TLS certificates are valid.
====

[NOTE]
====
This guide assumes that you already have the demo `nifi-kafka-druid-earthquake-data` installed.
If you don't have it installed please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo].
To put it simply you have to run `stackablectl demo install nifi-kafka-druid-earthquake-data`.
====

This demo will

* Install the required Stackable operators
Expand All @@ -32,6 +25,21 @@ You can see the deployed products as well as their relationship in the following

image::demo-nifi-kafka-druid-earthquake-data/overview.png[]

[#system-requirements]
== System requirements

To run this demo, your system needs at least:

* 9 https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu[cpu units] (core/hyperthread)
* 28GiB memory
* 75GiB disk storage

[#installation]
== Installation

Please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo].
To put it simply you just have to run `stackablectl demo install nifi-kafka-druid-earthquake-data`.

== List deployed Stackable services
To list the installed Stackable services run the following command:

Expand Down
Original file line number Diff line number Diff line change
@@ -1,18 +1,11 @@
= nifi-kafka-druid-water-level-data

[WARNING]
[CAUTION]
====
This demo only runs in the `default` namespace, as a `ServiceAccount` will be created.
Additionally, we have to use the fqdn service names (including the namespace), so that the used TLS certificates are valid.
====

[NOTE]
====
This guide assumes that you already have the demo `nifi-kafka-druid-water-level-data` installed.
If you don't have it installed please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo].
To put it simply you have to run `stackablectl demo install nifi-kafka-druid-water-level-data`.
====

This demo will

* Install the required Stackable operators
Expand All @@ -34,6 +27,21 @@ You can see the deployed products as well as their relationship in the following

image::demo-nifi-kafka-druid-water-level-data/overview.png[]

[#system-requirements]
== System requirements

To run this demo, your system needs at least:

* 9 https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu[cpu units] (core/hyperthread)
* 28GiB memory
* 75GiB disk storage

[#installation]
== Installation

Please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo].
To put it simply you just have to run `stackablectl demo install nifi-kafka-druid-water-level-data`.

== List deployed Stackable services
To list the installed Stackable services run the following command:

Expand Down
Original file line number Diff line number Diff line change
@@ -1,16 +1,5 @@
= spark-k8s-anomaly-detection-taxi-data

[WARNING]
====
This demo should not be run alongside other demos and requires a minimum of 32 GB RAM and 8 CPUs.
====
[NOTE]
====
This guide assumes you already have the demo `spark-k8s-anomaly-detection-taxi-data` installed.
If you don't have it installed please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo].
To put it simply you have to run `stackablectl demo install spark-k8s-anomaly-detection-taxi-data`.
====

This demo will

* Install the required Stackable operators
Expand All @@ -29,6 +18,21 @@ You can see the deployed products as well as their relationship in the following

image::spark-k8s-anomaly-detection-taxi-data/overview.png[]

[#system-requirements]
== System requirements

To run this demo, your system needs at least:

* 8 https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu[cpu units] (core/hyperthread)
* 32GiB memory
* 35GiB disk storage

[#installation]
== Installation

Please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo].
To put it simply you just have to run `stackablectl demo install spark-k8s-anomaly-detection-taxi-data`.

== List deployed Stackable services
To list the installed Stackable services run the following command:

Expand Down
22 changes: 15 additions & 7 deletions docs/modules/ROOT/pages/demos/trino-iceberg.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,6 @@ It focuses on the Trino and Iceberg integration and should run on you local work
If you are interested in a more complex lakehouse setup, please have a look at the xref:demos/data-lakehouse-iceberg-trino-spark.adoc[] demo.
====

[NOTE]
====
This guide assumes that you already have the demo `trino-iceberg` installed.
If you don't have it installed please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo].
To put it simply you have to run `stackablectl demo install trino-iceberg`.
====

This demo will

* Install the required Stackable operators
Expand All @@ -22,6 +15,21 @@ This demo will
* Create multiple data lakehouse tables using Apache Iceberg and data from the https://www.tpc.org/tpch/[TPC-H dataset].
* Run some queries to show the benefits of Iceberg

[#system-requirements]
== System requirements

To run this demo, your system needs at least:

* 9 https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu[cpu units] (core/hyperthread)
* 27GiB memory
* 110GiB disk storage

[#installation]
== Installation

Please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo].
To put it simply you just have to run `stackablectl demo install trino-iceberg`.

== List deployed Stackable services
To list the installed installed Stackable services run the following command:

Expand Down
Loading