vdk-gdp-execution-id: example added

ivakoleva · ivakoleva · commit b218a8848d3e · 2023-04-25T15:53:28.000+03:00
I sketched a vdk-gdp-execution-id data job example to check locally using quickstart-vdk. This is a stacked PR to import fix: #1961 Added the data job to examples directory. Testing done: verified locally Signed-off-by: ivakoleva <iva.koleva@clearcode.bg>
diff --git a/examples/gdp-execution-id-example/10_sql_step.sql b/examples/gdp-execution-id-example/10_sql_step.sql
@@ -0,0 +1,12 @@
+-- SQL scripts are standard SQL scripts. They are executed against Platform OLAP database.
+-- Refer to platform documentation for more information.
+
+-- Common uses of SQL steps are:
+--    aggregating data from other tables to a new one
+--    creating a table or a view that is needed for the python steps
+
+-- Queries in .sql files can be parametrised.
+-- A valid query parameter looks like → {parameter}.
+-- Parameters will be automatically replaced if there is a corresponding value existing in the IJobInput properties.
+
+CREATE TABLE IF NOT EXISTS hello_world (id NVARCHAR, vdk_gdp_execution_id NVARCHAR);
diff --git a/examples/gdp-execution-id-example/20_python_step.py b/examples/gdp-execution-id-example/20_python_step.py
@@ -0,0 +1,26 @@
+# Copyright 2021-2023 VMware, Inc.
+# SPDX-License-Identifier: Apache-2.0
+import logging
+
+from vdk.api.job_input import IJobInput
+
+log = logging.getLogger(__name__)
+
+
+def run(job_input: IJobInput):
+    """
+    Function named `run` is required in order for a python script to be recognized as a Data Job Python step and executed.
+
+    VDK provides to every python step an object - job_input - that has methods for:
+
+    * executing queries to OLAP Database;
+    * ingesting data into a database;
+    * processing data into a database.
+    See IJobInput documentation for more details.
+    """
+    log.info(f"Starting job step {__name__}")
+
+    # Write your python code inside here ... for example:
+    job_input.send_object_for_ingestion(
+        payload=dict(id="Hello World!"), destination_table="hello_world"
+    )
diff --git a/examples/gdp-execution-id-example/README.md b/examples/gdp-execution-id-example/README.md
@@ -0,0 +1,28 @@
+# My shiny new job
+
+Versatile Data Kit feature allows you to implement automated pull ingestion and batch data processing.
+
+# Generative Data Packs
+A GDP plugin expands the data you ingest automatically.
+
+# Data expansion
+The `vdk-gdp-execution-id` plugin used in [requirements.txt](./requirements.txt) and [config.ini](./config.ini)
+automatically expands your dataset with the unique Data Job execution id.
+The result is, the dataset produced can be correlated to a particular Data Job execution.
+
+# Run the example
+To run the data job locally:
+```bash
+vdk run gdp-execution-id-example
+```
+
+To check the result of data expanded and ingested:
+```
+% vdk sqlite-query -q "select * from hello_world"
+Creating new connection against local file database located at: /var/folders/h3/9ns__d4945qcvkdm2m2vjvqh0000gq/T/vdk-sqlite.db
+id            vdk_gdp_execution_id
+------------  -----------------------------------------------
+Hello World!  a17baca4-4780-4a60-b409-10e8b6fa90de-1682424042
+```
+Where the `hello_world.id` is being ingested in [20_python_step.py](./20_python_step.py),
+and vdk_gdp_execution_id gets added automatically for you.
diff --git a/examples/gdp-execution-id-example/config.ini b/examples/gdp-execution-id-example/config.ini
@@ -0,0 +1,64 @@
+; Supported format: https://docs.python.org/3/library/configparser.html#supported-ini-file-structure
+
+; This is the only file required to deploy a Data Job.
+; Read more to understand what each option means:
+
+; Information about the owner of the Data Job
+[owner]
+
+; Team is a way to group Data Jobs that belonged to the same team.
+team = taurus
+
+; Configuration related to running data jobs
+[job]
+; For format see https://en.wikipedia.org/wiki/Cron
+; The cron expression is evaluated in UTC time.
+; If it is time for a new job run and the previous job run hasn’t finished yet,
+; the cron job waits until the previous execution has finished.
+schedule_cron = */2 * * * *
+
+; Who will be contacted and on what occasion
+[contacts]
+
+; Specifies the time interval (in minutes) that a job execution is allowed to be delayed
+; from its scheduled time before a notification email is sent. The default is 240.
+; notification_delay_period_minutes=240
+
+; Specifies whether to enable or disable the email notifications for each data job run attempt.
+; The default value is true.
+; enable_attempt_notifications=true
+
+; Specifies whether to enable or disable email notifications per data job execution and execution delays.
+; The default value is true.
+;enable_execution_notifications=true
+
+; The [contacts] properties below use semicolon-separated list of email addresses that will be notified with email message on a given condition.
+; You can also provide email address linked to your Slack account in order to receive Slack messages.
+;   To generate Slack linked email address follow the steps here:
+;   https://get.slack.help/hc/en-us/articles/206819278-Send-emails-to-Slack#connect-the-email-app-to-your-workspace
+
+; Semicolon-separated list of email addresses to be notified on job execution failure caused by user code or user configuration why.
+; For example: if the job contains an SQL script with syntax error.
+; notified_on_job_failure_user_error=example@vmware.com
+notified_on_job_failure_user_error=
+
+; Semicolon-separated list of email addresses to be notified on job execution failure caused by a platform why.
+; notified_on_job_failure_platform_error=example@example.com; example2@example.com
+notified_on_job_failure_platform_error=
+
+; Semicolon-separated list of email addresses to be notified on job execution success.
+notified_on_job_success=
+
+; Semicolon-separated list of email addresses to be notified of job deployment outcome.
+; Notice that if this file is malformed (file structure is not as per https://docs.python.org/3/library/configparser.html#supported-ini-file-structure),
+;   then an email notification will NOT be sent to the recipients specified here.
+notified_on_job_deploy=
+
+[vdk]
+; Key value pairs of any configuration options that can be passed to vdk.
+; For possible options in your vdk installation execute command vdk config-help
+db_default_type=SQLITE
+ingest_method_default=SQLITE
+ingest_payload_preprocess_sequence=vdk-gdp-execution-id
+; The name of the micro-dimension that is added to each payload sent for ingestion.
+;gdp_execution_id_micro_dimension_name=vdk_gdp_execution_id
diff --git a/examples/gdp-execution-id-example/requirements.txt b/examples/gdp-execution-id-example/requirements.txt
@@ -0,0 +1,5 @@
+# Python jobs can specify extra library dependencies in requirements.txt file.
+# See https://pip.readthedocs.io/en/stable/user_guide/#requirements-files
+# The file is optional and can be deleted if no extra library dependencies are necessary.
+
+vdk-gdp-execution-id