From 80b3a3780217be801877480924cb514ffe36745d Mon Sep 17 00:00:00 2001 From: yasha Date: Mon, 20 Jan 2020 17:21:25 +0000 Subject: [PATCH 01/32] added lgbm server --- serving/lightgbm/lgbm_serving.ipynb | 398 ++++++++++++++++++++++++++++ 1 file changed, 398 insertions(+) create mode 100644 serving/lightgbm/lgbm_serving.ipynb diff --git a/serving/lightgbm/lgbm_serving.ipynb b/serving/lightgbm/lgbm_serving.ipynb new file mode 100644 index 000000000..9ff22fff1 --- /dev/null +++ b/serving/lightgbm/lgbm_serving.ipynb @@ -0,0 +1,398 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Deploy a Serverless Model Server with Nuclio-KFServing\n", + " --------------------------------------------------------------------\n", + "\n", + "The following notebook demonstrates how to deploy a **[LighGBM](https://github.com/microsoft/LightGBM)** model using **[nuclio](https://github.com/nuclio/nuclio)** + **[KFServing](https://github.com/kubeflow/kfserving)** (a.k.a Nuclio-serving)\n", + "\n", + "#### **notebook how-to's**\n", + "* Write and test model serving (KFServing) class in a notebook.\n", + "* Deploy the model server as a Nuclio-serving function.\n", + "* Invoke and test the serving function." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "#### **steps**\n", + "**[define a new function and its dependencies](#define-function)**
\n", + "**[test the model serving class locally](#test-locally)**
\n", + "**[deploy our serving class using as a serverless function](#deploy)**
\n", + "**[test our model server using HTTP request](#test-model-server)**
" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [], + "source": [ + "# nuclio: ignore\n", + "# if the nuclio-jupyter package is not installed run !pip install nuclio-jupyter\n", + "import nuclio" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### **define a new function and its dependencies**" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "%nuclio: setting kind to 'nuclio:serving'\n", + "%nuclio: setting 'MODEL_CLASS' environment variable\n" + ] + } + ], + "source": [ + "%nuclio config kind=\"nuclio:serving\"\n", + "%nuclio env MODEL_CLASS=ClassifierModel" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [], + "source": [ + "%%nuclio cmd -c\n", + "pip install -U -q kfserving\n", + "pip install -U -q azure\n", + "pip install -U -q numpy\n", + "pip install -U -q xgboost\n", + "pip install -U -q lightgbm\n", + "pip install -U -q mlrun" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [], + "source": [ + "import kfserving\n", + "import os\n", + "import numpy as np\n", + "from pickle import load\n", + "import lightgbm as lgb" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**NOTE: bring your own pickled model by changing the following variables, or run the [LightGBM demo](https://github.com/mlrun/demos/tree/master/lightgbm#instructions-for-lightgbm-demo).**" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [], + "source": [ + "model = ()\n", + "TARGET_PATH = '/User/mlrun/lightgbm'\n", + "MODEL_FILE = 'lightgbm_classifier.pkl'" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "class ClassifierModel(kfserving.KFModel):\n", + " def __init__(self, name: str, model_dir: str, model = None):\n", + " super().__init__(name)\n", + " self.name = name\n", + " self.model_dir = model_dir\n", + " if not model is None:\n", + " self.classifier = model\n", + " self.ready = True\n", + "\n", + " def load(self):\n", + " model_file = os.path.join(\n", + " kfserving.Storage.download(self.model_dir), MODEL_FILE)\n", + " self.classifier = load(open(model_file, 'rb'))\n", + " self.ready = True\n", + "\n", + " def predict(self, body):\n", + " try:\n", + " feats = np.asarray(body['instances'])\n", + " result: np.ndarray = self.classifier.predict(feats)\n", + " return result.tolist()\n", + " except Exception as e:\n", + " raise Exception(\"Failed to predict %s\" % e)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following end-code annotation tells ```nuclio``` to stop parsing the notebook from this cell. _**Please do not remove this cell**_:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "# nuclio: end-code" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "______________________________________________" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### **test the model serving class locally**\n", + "The class above can be tested locally. Just instantiate the class, `.load()` will load the model to a local dir.\n", + "\n", + "> **Verify there is a `model.bst` file in the model_dir path (generated by the training notebook)**" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "import pyarrow.parquet as pq\n", + "import pyarrow as pa\n", + "import pandas as pd" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[I 200120 17:03:17 storage:35] Copying contents of /User/mlrun/lightgbm to local\n" + ] + } + ], + "source": [ + "my_server = ClassifierModel('some-classifier-model', model_dir=TARGET_PATH)\n", + "my_server.load()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### _data_\n", + "Grab some data from the test set we prepared in the **[LightGBM demo](https://github.com/mlrun/demos/tree/master/lightgbm#instructions-for-lightgbm-demo)**:" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "features = pq.read_table(os.path.join(TARGET_PATH, 'xtest.parquet')).to_pandas().iloc[:3, :]\n", + "labels = pq.read_table(os.path.join(TARGET_PATH, 'ytest.parquet')).to_pandas().iloc[:3, :]" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "event = {\"instances\": features.values.tolist()}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "We can use the `.predict(body)` method to test the model." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[1.0, 1.0, 0.0]" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "my_server.predict(event)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### **deploy our serving class using as a serverless function**\n", + "in the following section we create a new model serving function which wraps our class , and specify model and other resources.\n", + "\n", + "the `models` dict store model names and the assosiated model **dir** URL (the URL can start with `S3://` and other blob store options), the faster way is to use a shared file volume, we use `.apply(mount_v3io())` to attach a v3io (iguazio data fabric) volume to our function. By default v3io will mount the current user home into the `\\User` function path.\n", + "\n", + "**verify the model dir does contain a valid `model.bst` file**" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "from mlrun import new_model_server, mount_v3io\n", + "import requests" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fn = new_model_server('some-classifier-model', \n", + " models={'classifier_gen': TARGET_PATH}, \n", + " model_class='ClassifierModel')\n", + "\n", + "fn.apply(mount_v3io()) " + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-20 17:05:07,328 deploy started\n", + "[nuclio] 2020-01-20 17:07:37,734 (info) Build complete\n", + "[nuclio] 2020-01-20 17:07:46,843 done creating some-classifier-model, function address: 3.135.246.153:31529\n" + ] + } + ], + "source": [ + "addr = fn.deploy()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### **test our model server using HTTP request**\n", + "\n", + "\n", + "We invoke our model serving function using test data, the data vector is specified in the `instances` attribute." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "resp = requests.post(addr + '/classifier_gen/predict', json=event)" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[1.0, 1.0, 0.0]" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "json.loads(resp.content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**[back to top](#top)**" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 1c51ea0836a828c74a7727a3fb54e200325f2db8 Mon Sep 17 00:00:00 2001 From: yasha Date: Mon, 20 Jan 2020 17:23:52 +0000 Subject: [PATCH 02/32] removed vscode --- .vscode/settings.json | 3 --- 1 file changed, 3 deletions(-) delete mode 100644 .vscode/settings.json diff --git a/.vscode/settings.json b/.vscode/settings.json deleted file mode 100644 index 0ff9cf764..000000000 --- a/.vscode/settings.json +++ /dev/null @@ -1,3 +0,0 @@ -{ - "python.pythonPath": "/home/yasha/anaconda3/bin/python" -} \ No newline at end of file From db8da20c32124015e26f0b6b9911814493c5d5de Mon Sep 17 00:00:00 2001 From: yasha Date: Mon, 20 Jan 2020 18:18:08 +0000 Subject: [PATCH 03/32] add log header too to arc_2parq --- .gitignore.swp | Bin 0 -> 12288 bytes fileutils/arc_to_parquet/arc_to_parquet.py | 6 +++++- 2 files changed, 5 insertions(+), 1 deletion(-) create mode 100644 .gitignore.swp diff --git a/.gitignore.swp b/.gitignore.swp new file mode 100644 index 0000000000000000000000000000000000000000..ea205f57df386d1b6f3827c0f793b0dc8a8814e9 GIT binary patch literal 12288 zcmeI%y-LGS6u|LQp^G4jzCcx63Q1apB(s}?4niH3=A-de%?!oyY43s1QJ4GXhmP zvHYFU+3B!;)sK%3dk6cY&Ge=V0tg_000IagfB*srv_hcFI`V2J-D;-oHo5aFHzf%H z1Q0*~0R#|0009ILKmY**S|~7%MBaBr7A98z|9`&!zfApUVrk;TM7`g_`!O;C2q1s} x0tg_000IagfB*vjRiJN&#dG}->D$R%Y?aN@;4Zw6!-q_(v@GOFNAY9f$Tw{nIbr|+ literal 0 HcmV?d00001 diff --git a/fileutils/arc_to_parquet/arc_to_parquet.py b/fileutils/arc_to_parquet/arc_to_parquet.py index 80e2a3b5a..98f521e1f 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.py +++ b/fileutils/arc_to_parquet/arc_to_parquet.py @@ -65,5 +65,9 @@ def arc_to_parquet( context.logger.info("destination file already exists") if log_data: - context.logger.info(f"assign data to {key} in artifact store") context.log_artifact(key, target_path=dest_path) + if header: + header = [x.replace(' ', '_') for x in header] + filepath = path.join(target_path, 'header.json') + json.dump(header, open(filepath, 'w')) + context.log_artifact('header', target_path=filepath) From 874e777a048c032c930be2090646fbfe03439d64 Mon Sep 17 00:00:00 2001 From: yasha Date: Tue, 21 Jan 2020 07:19:06 +0000 Subject: [PATCH 04/32] add header to artifacts in arc_to_parquet --- .../arc_to_parquet/arc_to_parquet-bak.yaml | 13 + fileutils/arc_to_parquet/arc_to_parquet.py | 18 +- fileutils/arc_to_parquet/arc_to_parquet.yaml | 26 +- ...{file-utils.ipynb => arc_to_parquet.ipynb} | 721 +++++++++--------- tests/open_archive.ipynb | 306 ++++++++ 5 files changed, 692 insertions(+), 392 deletions(-) create mode 100644 fileutils/arc_to_parquet/arc_to_parquet-bak.yaml rename tests/{file-utils.ipynb => arc_to_parquet.ipynb} (64%) create mode 100644 tests/open_archive.ipynb diff --git a/fileutils/arc_to_parquet/arc_to_parquet-bak.yaml b/fileutils/arc_to_parquet/arc_to_parquet-bak.yaml new file mode 100644 index 000000000..63c7e5d68 --- /dev/null +++ b/fileutils/arc_to_parquet/arc_to_parquet-bak.yaml @@ -0,0 +1,13 @@ +kind: job +metadata: + name: arc-to-parquet +spec: + description: '' + build: + functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlciBvbiAyMDIwLTAxLTIxIDA2OjE2CgppbXBvcnQgbWxydW4KbWxydW4ubWxjb25mLmRicGF0aCA9ICdodHRwOi8vbWxydW4tYXBpOjgwODAnCgppbXBvcnQgdXJsbGliMwp1cmxsaWIzLmRpc2FibGVfd2FybmluZ3MoKQoKeGZuID0gbWxydW4uaW1wb3J0X2Z1bmN0aW9uKCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20veWpiLWRzL2Z1bmN0aW9ucy9hcmMycGFycS9maWxldXRpbHMvb3Blbl9hcmNoaXZlL2Z1bmN0aW9uLnlhbWwnKQoKeGZuLmFwcGx5KG1scnVuLm1vdW50X3YzaW8oKSkKeGZuLmludGVyYWN0aXZlID0gVHJ1ZQoKCmltYWdlc19wYXRoID0gJy9Vc2VyL21scnVuL2Z1bmN0aW9ucy9pbWFnZXMnCgpvcGVuX2FyY2hpdmVfdGFzayA9IG1scnVuLk5ld1Rhc2soCiAgICAnZG93bmxvYWQnLAogICAgaGFuZGxlcj0nb3Blbl9hcmNoaXZlJywgCiAgICBwYXJhbXM9eyd0YXJnZXRfcGF0aCc6IGltYWdlc19wYXRoLAogICAgICAgICAgICAna2V5JyAgICAgICAgOiAnY29udGVudHMnfSwKICAgIGlucHV0cz17J2FyY2hpdmVfdXJsJzogJ2h0dHA6Ly9pZ3VhemlvLXNhbXBsZS1kYXRhLnMzLmFtYXpvbmF3cy5jb20vY2F0c25kb2dzLnppcCd9CikKCnJ1biA9IHhmbi5ydW4ob3Blbl9hcmNoaXZlX3Rhc2spCgppbXBvcnQgbWxydW4KbWxydW4ubWxjb25mLmRicGF0aCA9ICdodHRwOi8vbWxydW4tYXBpOjgwODAnCgp4Zm4gPSBtbHJ1bi5jb2RlX3RvX2Z1bmN0aW9uKCcvVXNlci9yZXBvcy9mdW5jdGlvbnMvZmlsZXV0aWxzL2FyY190b19wYXJxdWV0L2FyY190b19wYXJxdWV0LnB5Jywga2luZD0nam9iJykKCnhmbi5leHBvcnQoJy9Vc2VyL3JlcG9zL2Z1bmN0aW9ucy9maWxldXRpbHMvYXJjX3RvX3BhcnF1ZXQvYXJjX3RvX3BhcnF1ZXQueWFtbCcpCgp4Zm4gPSBtbHJ1bi5pbXBvcnRfZnVuY3Rpb24oJy9Vc2VyL3JlcG9zL2Z1bmN0aW9ucy9maWxldXRpbHMvYXJjX3RvX3BhcnF1ZXQvYXJjX3RvX3BhcnF1ZXQueWFtbCcpCgp4Zm4uYXBwbHkobWxydW4ubW91bnRfdjNpbygpKQp4Zm4uaW50ZXJhY3RpdmUgPSBUcnVlCgp4Zm4uZGVwbG95KCkKCnRhcmdldF9wYXRoID0gJy9Vc2VyL21scnVuL2Z1bmN0aW9ucy9wYXJxdWV0JwphcmNoaXZlID0gJ2h0dHBzOi8vZnBzaWduYWxzLXB1YmxpYy5zMy5hbWF6b25hd3MuY29tL29uZV9jc3YudGFyLmd6JwpwYXJxdWV0X2ZpbGUgPSAneF90ZXN0XzUwLnBhcnF1ZXQnICMgdGhlIGZpbGUgZXh0ZW5zaW9uIGlzIG5vdCBuZWNlc3NhcnkKcGFycXVldF9maWxlX3BhdGggPSB0YXJnZXRfcGF0aCArICIvIiArIHBhcnF1ZXRfZmlsZQphcnRpZmFjdF9rZXkgPSAncmF3X2RhdGEnCgphcmNfdG9fcGFycV90YXNrID0gbWxydW4uTmV3VGFzaygKICAgICdhcmMycGFycScsIAogICAgaGFuZGxlcj0nYXJjX3RvX3BhcnF1ZXQnLCAgIyBhIHN0cmluZyBzaW5jZSB3ZSBhcmUgY2FsbGluZyB0aGlzICdyZW1vdGVseScsIG91dHNpZGUgdGhpcyBub3RlYm9vawogICAgcGFyYW1zPXsKICAgICAgICAndGFyZ2V0X3BhdGgnOiB0YXJnZXRfcGF0aCwKICAgICAgICAnbmFtZScgICAgICAgOiBwYXJxdWV0X2ZpbGUsIAogICAgICAgICdrZXknICAgICAgICA6IGFydGlmYWN0X2tleSwKICAgICAgICAnYXJjaGl2ZV91cmwnOiBhcmNoaXZlfSwKICAgIG91dHB1dHM9W2FydGlmYWN0X2tleV0pCgpydW4gPSB4Zm4ucnVuKGFyY190b19wYXJxX3Rhc2spCgppbXBvcnQgcHlhcnJvdy5wYXJxdWV0IGFzIHBxCmRmID0gcHEucmVhZF90YWJsZShwYXJxdWV0X2ZpbGVfcGF0aCkudG9fcGFuZGFzKCkKZGYuaGVhZCgpCgppbXBvcnQgb3MKaW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBwYW5kYXMgYXMgcGQKCmFzc2VydCBhcnRpZmFjdF9rZXkgaW4gcnVuLm91dHB1dHMua2V5cygpLCBmIm1scnVuLmZ1bmN0aW9uczoga2V5IHthcnRpZmFjdF9rZXl9IG5vdCBmb25kIGluIG91dHB1dHMiCmFzc2VydCBvcy5wYXRoLmlzZmlsZShwYXJxdWV0X2ZpbGVfcGF0aCksICBmIm1scnVuLmZ1bmN0aW9uczogYXJ0aWZhY3Qgc291cmNlIG5vdCBmb3VuZCBhdCB7cGFycXVldF9maWxlX3BhdGh9IgoKb3JpZ2luYWwgPSBwZC5yZWFkX2NzdihhcmNoaXZlKQpjb3BpZWQgICA9IHBkLnJlYWRfcGFycXVldChwYXJxdWV0X2ZpbGVfcGF0aCwgZW5naW5lPSJweWFycm93IikKYXNzZXJ0IG5wLmFycmF5X2VxdWFsKG9yaWdpbmFsLCBjb3BpZWQpLCAgICJtbHJ1bi5mdW5jdGlvbnM6IG9yaWdpbmFsIGFuZCBjb3BpZWQgZGF0YSBub3QgZXF1YWwiCgpvcy5yZW1vdmUocGFycXVldF9maWxlX3BhdGgpCgo= + base_image: python:3.6-jessie + commands: + - pip install -q mlrun + - pip install -q pyarrow + - pip install -q numpy + - pip install -q pandas \ No newline at end of file diff --git a/fileutils/arc_to_parquet/arc_to_parquet.py b/fileutils/arc_to_parquet/arc_to_parquet.py index 98f521e1f..64b889801 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.py +++ b/fileutils/arc_to_parquet/arc_to_parquet.py @@ -11,10 +11,10 @@ def arc_to_parquet( context: MLClientCtx, archive_url: Union[str, Path, IO[AnyStr]], - header: Optional[List[str]] = None, + header: Union[int, List[str], None] = 0, target_path: str = "", name: str = "", - chunksize: int = 10_000, + chunksize: int = 10_240, log_data: bool = True, add_uid: bool = False, key: str = "raw_data", @@ -45,7 +45,7 @@ def arc_to_parquet( dest_path = os.path.join(target_path, uid, name) os.makedirs(os.path.join(target_path, uid), exist_ok=True) - + if not os.path.isfile(dest_path): context.logger.info("destination file does not exist, downloading") pqwriter = None @@ -54,6 +54,8 @@ def arc_to_parquet( ): table = pa.Table.from_pandas(df) if i == 0: + header = list(df) + header = [x.replace(' ', '_') for x in header] pqwriter = pq.ParquetWriter(dest_path, table.schema) pqwriter.write_table(table) @@ -66,8 +68,8 @@ def arc_to_parquet( if log_data: context.log_artifact(key, target_path=dest_path) - if header: - header = [x.replace(' ', '_') for x in header] - filepath = path.join(target_path, 'header.json') - json.dump(header, open(filepath, 'w')) - context.log_artifact('header', target_path=filepath) + # log header + filepath = path.join(target_path, 'header.json') + json.dump(header, open(filepath, 'w')) + context.log_artifact('header', target_path=filepath) + diff --git a/fileutils/arc_to_parquet/arc_to_parquet.yaml b/fileutils/arc_to_parquet/arc_to_parquet.yaml index 28e73eca8..a1dd5e6be 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.yaml +++ b/fileutils/arc_to_parquet/arc_to_parquet.yaml @@ -1,13 +1,23 @@ kind: job metadata: - name: arc_to_parquet + name: arc-to-parquet + tag: '' + hash: 251c8e50eb09d09032b4b3accf3c3a3a3c1b467b + project: '' spec: - description: 'archive to parquet and log' + command: '' + args: [] + volumes: [] + volume_mounts: [] + env: [] + description: '' build: - functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlciBvbiAyMDIwLTAxLTA5IDE3OjA3CgppbXBvcnQgb3MKCmZyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIHR5cGluZyBpbXBvcnQgSU8sIEFueVN0ciwgVW5pb24sIExpc3QKZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCgppbXBvcnQgcGFuZGFzIGFzIHBkCmltcG9ydCBweWFycm93LnBhcnF1ZXQgYXMgcHEKaW1wb3J0IHB5YXJyb3cgYXMgcGEKCmRlZiBhcmNfdG9fcGFycXVldCgKICAgIGNvbnRleHQ6IE1MQ2xpZW50Q3R4LAogICAgYXJjaGl2ZV91cmw6IFVuaW9uW3N0ciwgUGF0aCwgSU9bQW55U3RyXV0sCiAgICBoZWFkZXI6IFVuaW9uW05vbmUsIExpc3Rbc3RyXV0gPSBOb25lLAogICAgdGFyZ2V0X3BhdGg6IHN0ciA9ICIiLAogICAgbmFtZTogc3RyID0gIiIsCiAgICBjaHVua3NpemU6IGludCA9IDEwXzAwMCwKICAgIGxvZ19kYXRhOiBib29sID0gVHJ1ZSwKICAgIGtleTogc3RyID0gJ3Jhd19kYXRhJwopIC0+IE5vbmU6CiAgICAiIiJPcGVuIGEgZmlsZS9vYmplY3QgYXJjaGl2ZSBhbmQgc2F2ZSBhcyBhIHBhcnF1ZXQgZmlsZS4KICAgIAogICAgQXJnczoKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIGFyY2hpdmVfdXJsOiBhbnkgdmFsaWQgc3RyaW5nIHBhdGggY29uc2lzdGVudCB3aXRoIHRoZSBwYXRoIHZhcmlhYmxlCiAgICAgICAgICAgICAgICAgICAgICAgIG9mIHBhbmRhcy5yZWFkX2Nzdi4gbmNsdWRpbmcgc3RyaW5ncyBhcyBmaWxlIHBhdGhzLCBhcyB1cmxzLCAKICAgICAgICAgICAgICAgICAgICAgICAgcGF0aGxpYi5QYXRoIG9iamVjdHMsIGV0Yy4uLgogICAgOnBhcmFtIGhlYWRlcjogICAgICBjb2x1bW4gbmFtZXMKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogZGVzdGluYXRpb24gZm9sZGVyIG9mIHRhYmxlCiAgICA6cGFyYW0gbmFtZTogICAgICAgIG5hbWUgZmlsZSB0byBiZSBzYXZlZCBsb2NhbGx5LCBhbHNvCiAgICA6cGFyYW0gY2h1bmtzaXplOiAgICgwKSByb3cgc2l6ZSByZXRyaWV2ZWQgcGVyIGl0ZXJhdGlvbgogICAgOnBhcmFtIGxvZ19kYXRhOiAgICAoVHJ1ZSkgaWYgVHJ1ZSwgbG9nIHRoZSBkYXRhIHNvIHRoYXQgaXQgaXMgYXZhaWxhYmxlCiAgICAgICAgICAgICAgICAgICAgICAgIGF0IHRoZSBuZXh0IHN0ZXAKICAgICIiIgogICAgb3MubWFrZWRpcnModGFyZ2V0X3BhdGgsIGV4aXN0X29rPVRydWUpCgogICAgaWYgbm90IG5hbWUuZW5kc3dpdGgoIi5wYXJxdWV0Iik6CiAgICAgICAgbmFtZSArPSAiLnBhcnF1ZXQiCgogICAgZGVzdF9wYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQoKICAgIGlmIG5vdCBvcy5wYXRoLmlzZmlsZShkZXN0X3BhdGgpOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oImRlc3RpbmF0aW9uIGZpbGUgZG9lcyBub3QgZXhpc3QsIGRvd25sb2FkaW5nIikKICAgICAgICBwcXdyaXRlciA9IE5vbmUKICAgICAgICBmb3IgaSwgZGYgaW4gZW51bWVyYXRlKAogICAgICAgICAgICBwZC5yZWFkX2NzdihhcmNoaXZlX3VybCwgY2h1bmtzaXplPWNodW5rc2l6ZSwgbmFtZXM9aGVhZGVyKQogICAgICAgICk6CiAgICAgICAgICAgIHRhYmxlID0gcGEuVGFibGUuZnJvbV9wYW5kYXMoZGYpCiAgICAgICAgICAgIGlmIGkgPT0gMDoKICAgICAgICAgICAgICAgIHBxd3JpdGVyID0gcHEuUGFycXVldFdyaXRlcihkZXN0X3BhdGgsIHRhYmxlLnNjaGVtYSkKICAgICAgICAgICAgcHF3cml0ZXIud3JpdGVfdGFibGUodGFibGUpCgogICAgICAgIGlmIHBxd3JpdGVyOgogICAgICAgICAgICBwcXdyaXRlci5jbG9zZSgpCgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oZiJzYXZlZCB0YWJsZSB0byB7ZGVzdF9wYXRofSIpCiAgICBlbHNlOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oImRlc3RpbmF0aW9uIGZpbGUgZXhpc3RzIikKCiAgICBpZiBsb2dfZGF0YToKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKGYibG9nZ2luZyB7ZGVzdF9wYXRofSB0byBjb250ZXh0IikKICAgICAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWRlc3RfcGF0aCkKCg== - base_image: python:3.6-jessie + functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlciBvbiAyMDIwLTAxLTIxIDA3OjExCgppbXBvcnQgbWxydW4KbWxydW4ubWxjb25mLmRicGF0aCA9ICdodHRwOi8vbWxydW4tYXBpOjgwODAnCgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IHB5YXJyb3cucGFycXVldCBhcyBwcQppbXBvcnQgcHlhcnJvdyBhcyBwYQoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gdHlwaW5nIGltcG9ydCBJTywgQW55U3RyLCBVbmlvbiwgTGlzdCwgT3B0aW9uYWwKCgpkZWYgYXJjX3RvX3BhcnF1ZXQoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwKICAgIGFyY2hpdmVfdXJsOiBVbmlvbltzdHIsIFBhdGgsIElPW0FueVN0cl1dLAogICAgaGVhZGVyOiBVbmlvbltpbnQsIExpc3Rbc3RyXSwgTm9uZV0gPSAwLAogICAgdGFyZ2V0X3BhdGg6IHN0ciA9ICIiLAogICAgbmFtZTogc3RyID0gIiIsCiAgICBjaHVua3NpemU6IGludCA9IDEwXzAwMCwKICAgIGxvZ19kYXRhOiBib29sID0gVHJ1ZSwKICAgIGFkZF91aWQ6IGJvb2wgPSBGYWxzZSwKICAgIGtleTogc3RyID0gInJhd19kYXRhIiwKKSAtPiBOb25lOgogICAgIiIiT3BlbiBhIGZpbGUvb2JqZWN0IGFyY2hpdmUgYW5kIHNhdmUgYXMgYSBwYXJxdWV0IGZpbGUuCiAgICAKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIGFyY2hpdmVfdXJsOiBhbnkgdmFsaWQgc3RyaW5nIHBhdGggY29uc2lzdGVudCB3aXRoIHRoZSBwYXRoIHZhcmlhYmxlCiAgICAgICAgICAgICAgICAgICAgICAgIG9mIHBhbmRhcy5yZWFkX2NzdiwgaW5jbHVkaW5nIHN0cmluZ3MgYXMgZmlsZSBwYXRocywgYXMgdXJscywgCiAgICAgICAgICAgICAgICAgICAgICAgIHBhdGhsaWIuUGF0aCBvYmplY3RzLCBldGMuLi4KICAgIDpwYXJhbSBoZWFkZXI6ICAgICAgY29sdW1uIG5hbWVzCiAgICA6cGFyYW0gdGFyZ2V0X3BhdGg6IGRlc3RpbmF0aW9uIGZvbGRlciBvZiB0YWJsZQogICAgOnBhcmFtIG5hbWU6ICAgICAgICBuYW1lIGZpbGUgdG8gYmUgc2F2ZWQgbG9jYWxseSwgYWxzbwogICAgOnBhcmFtIGNodW5rc2l6ZTogICAoMCkgcm93IHNpemUgcmV0cmlldmVkIHBlciBpdGVyYXRpb24KICAgIDpwYXJhbSBsb2dfZGF0YTogICAgKFRydWUpIGlmIFRydWUsIGxvZyB0aGUgZGF0YSBzbyB0aGF0IGl0IGlzIGF2YWlsYWJsZQogICAgICAgICAgICAgICAgICAgICAgICBhdCB0aGUgbmV4dCBzdGVwCiAgICA6cGFyYW0gYWRkX3VpZDogICAgIChGYWxzZSkgYWRkIHRoZSBtZXRhZGF0YSB1aWQgdG8gdGhlIHRhcmdldF9wYXRoIHNvIHRoYXQgCiAgICAgICAgICAgICAgICAgICAgICAgIHJ1bnMgY2FuIGJlIGlkZW50aWZpZWQKICAgIDpwYXJhbSBrZXk6ICAgICAgICAga2V5IGluIGFydGlmYWN0IHN0b3JlICh3aGVuIGxvZ19kYXRhPVRydWUpCiAgICAiIiIKICAgIGlmIG5vdCBuYW1lLmVuZHN3aXRoKCIucGFycXVldCIpOgogICAgICAgIG5hbWUgKz0gIi5wYXJxdWV0IgoKICAgIGlmIG5vdCBhZGRfdWlkOgogICAgICAgIHVpZCA9ICIiCiAgICBlbHNlOgogICAgICAgIHVpZCA9IGNvbnRleHQudWlkCgogICAgZGVzdF9wYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCB1aWQsIG5hbWUpCiAgICBvcy5tYWtlZGlycyhvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIHVpZCksIGV4aXN0X29rPVRydWUpCiAgICBpZiBoZWFkZXIgPT0gMDoKICAgICAgICBoZWFkZXIgPSBwZC5yZWFkX2NzdihhcmNoaXZlX3VybCwgaGVhZGVyPU5vbmUsIG5yb3dzPTEpLmlsb2NbMF0udmFsdWVzCiAgICBoZWFkZXIgPSBbeC5yZXBsYWNlKCcgJywgJ18nKSBmb3IgeCBpbiBoZWFkZXJdCiAgICBpZiBub3Qgb3MucGF0aC5pc2ZpbGUoZGVzdF9wYXRoKToKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCJkZXN0aW5hdGlvbiBmaWxlIGRvZXMgbm90IGV4aXN0LCBkb3dubG9hZGluZyIpCiAgICAgICAgcHF3cml0ZXIgPSBOb25lCiAgICAgICAgZm9yIGksIGRmIGluIGVudW1lcmF0ZShwZC5yZWFkX2NzdihhcmNoaXZlX3VybCwgY2h1bmtzaXplPWNodW5rc2l6ZSwgbmFtZXM9aGVhZGVyKSk6CiAgICAgICAgICAgIHRhYmxlID0gcGEuVGFibGUuZnJvbV9wYW5kYXMoZGYpCiAgICAgICAgICAgIGlmIGkgPT0gMDoKICAgICAgICAgICAgICAgIHBxd3JpdGVyID0gcHEuUGFycXVldFdyaXRlcihkZXN0X3BhdGgsIHRhYmxlLnNjaGVtYSkKICAgICAgICAgICAgcHF3cml0ZXIud3JpdGVfdGFibGUodGFibGUpCgogICAgICAgIGlmIHBxd3JpdGVyOgogICAgICAgICAgICBwcXdyaXRlci5jbG9zZSgpCgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oZiJzYXZlZCB0YWJsZSB0byB7ZGVzdF9wYXRofSIpCiAgICBlbHNlOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oImRlc3RpbmF0aW9uIGZpbGUgYWxyZWFkeSBleGlzdHMiKQoKICAgIGlmIGxvZ19kYXRhOgogICAgICAgIGNvbnRleHQubG9nX2FydGlmYWN0KGtleSwgdGFyZ2V0X3BhdGg9ZGVzdF9wYXRoKQogICAgICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCAnaGVhZGVyLmpzb24nKQogICAgICAgIGpzb24uZHVtcChoZWFkZXIsIG9wZW4oZmlsZXBhdGgsICd3JykpCiAgICAgICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ2hlYWRlcicsIHRhcmdldF9wYXRoPWZpbGVwYXRoKQoK commands: - - pip install -q mlrun - - pip install -q pyarrow - - pip install -q numpy - - pip install -q pandas \ No newline at end of file + - python -m pip uninstall mlrun + - python -m pip install -U -q mlrun + - python -m pip install -U -q pandas + - python -m pip install -U -q pyarrow + - python -m pip install -U -q numpy==1.17.4 + code_origin: https://github.com/yjb-ds/functions.git#db8da20c32124015e26f0b6b9911814493c5d5de:arc + to parquet.ipynb diff --git a/tests/file-utils.ipynb b/tests/arc_to_parquet.ipynb similarity index 64% rename from tests/file-utils.ipynb rename to tests/arc_to_parquet.ipynb index d3fdb407f..47bd4f0a5 100644 --- a/tests/file-utils.ipynb +++ b/tests/arc_to_parquet.ipynb @@ -1,8 +1,15 @@ { "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# archive to parquet" + ] + }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -11,309 +18,215 @@ ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# nuclio: ignore\n", + "import nuclio" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "%%nuclio cmd -c\n", + "python -m pip uninstall mlrun\n", + "python -m pip install -U -q mlrun\n", + "python -m pip install -U -q pandas\n", + "python -m pip install -U -q pyarrow\n", + "python -m pip install -U -q numpy==1.17.4" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import json\n", + "from pathlib import Path\n", + "import pandas as pd\n", + "import pyarrow.parquet as pq\n", + "import pyarrow as pa\n", + "\n", + "from mlrun.execution import MLClientCtx\n", + "from typing import IO, AnyStr, Union, List, Optional\n", + "\n", + "\n", + "def arc_to_parquet(\n", + " context: MLClientCtx,\n", + " archive_url: Union[str, Path, IO[AnyStr]],\n", + " header: Union[int, List[str], None] = 0,\n", + " target_path: str = \"\",\n", + " name: str = \"\",\n", + " chunksize: int = 10_000,\n", + " log_data: bool = True,\n", + " add_uid: bool = False,\n", + " key: str = \"raw_data\",\n", + ") -> None:\n", + " \"\"\"Open a file/object archive and save as a parquet file.\n", + " \n", + " :param context: function context\n", + " :param archive_url: any valid string path consistent with the path variable\n", + " of pandas.read_csv, including strings as file paths, as urls, \n", + " pathlib.Path objects, etc...\n", + " :param header: column names\n", + " :param target_path: destination folder of table\n", + " :param name: name file to be saved locally, also\n", + " :param chunksize: (0) row size retrieved per iteration\n", + " :param log_data: (True) if True, log the data so that it is available\n", + " at the next step\n", + " :param add_uid: (False) add the metadata uid to the target_path so that \n", + " runs can be identified\n", + " :param key: key in artifact store (when log_data=True)\n", + " \"\"\"\n", + " if not name.endswith(\".parquet\"):\n", + " name += \".parquet\"\n", + "\n", + " if not add_uid:\n", + " uid = \"\"\n", + " else:\n", + " uid = context.uid\n", + "\n", + " dest_path = os.path.join(target_path, uid, name)\n", + " os.makedirs(os.path.join(target_path, uid), exist_ok=True)\n", + " if header == 0:\n", + " header = pd.read_csv(archive_url, header=None, nrows=1).iloc[0].values\n", + " header = [x.replace(' ', '_') for x in header]\n", + " if not os.path.isfile(dest_path):\n", + " context.logger.info(\"destination file does not exist, downloading\")\n", + " pqwriter = None\n", + " for i, df in enumerate(pd.read_csv(archive_url, chunksize=chunksize, names=header)):\n", + " table = pa.Table.from_pandas(df)\n", + " if i == 0:\n", + " pqwriter = pq.ParquetWriter(dest_path, table.schema)\n", + " pqwriter.write_table(table)\n", + "\n", + " if pqwriter:\n", + " pqwriter.close()\n", + "\n", + " context.logger.info(f\"saved table to {dest_path}\")\n", + " else:\n", + " context.logger.info(\"destination file already exists\")\n", + "\n", + " if log_data:\n", + " context.log_artifact(key, target_path=dest_path)\n", + " # log header\n", + " filepath = os.path.join(target_path, 'header.json')\n", + " json.dump(header, open(filepath, 'w'))\n", + " context.log_artifact('header', target_path=filepath)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, "metadata": {}, + "outputs": [], "source": [ - "# archive to folder" + "# nuclio: end-code" ] }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ - "import urllib3\n", - "urllib3.disable_warnings()" + "# create job function object from notebook code\n", + "fn = mlrun.code_to_function(\n", + " 'arc to parquet',\n", + " runtime='job', \n", + " handler=arc_to_parquet)" ] }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-20 08:36:14,989 starting run download uid=79a5b0f103c24367961cf8c107126dd2 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-20 08:36:15,069 Job is running in the background, pod: download-6mg4q\n", - "[mlrun] 2020-01-20 08:36:19,610 downloading http://iguazio-sample-data.s3.amazonaws.com/catsndogs.zip to local tmp\n", - "[mlrun] 2020-01-20 08:36:21,218 Verified directories\n", - "[mlrun] 2020-01-20 08:36:21,218 Extracting zip\n", - "[mlrun] 2020-01-20 08:36:22,988 extracted archive to content\n", - "[mlrun] 2020-01-20 08:36:23,001 log artifact content at content, size: None, db: Y\n", - "\n", - "[mlrun] 2020-01-20 08:36:23,011 run executed, status=completed\n", - "final state: succeeded\n" - ] - }, - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...126dd2
0Jan 20 08:36:19completedfile_utils
host=download-6mg4q
kind=job
owner=admin
archive_url
key=contents
target_path=/User/mlrun/functions/images
content
\n", - "
\n", - "
\n", - "
\n", - " Title\n", - " ×\n", - "
\n", - " \n", - "
\n", - "
\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 79a5b0f103c24367961cf8c107126dd2 , !mlrun logs 79a5b0f103c24367961cf8c107126dd2 \n", - "[mlrun] 2020-01-20 08:36:24,208 run executed, status=completed\n" + "[mlrun] 2020-01-21 07:11:35,663 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" ] } ], "source": [ - "# load function from Github\n", - "xfn = mlrun.import_function('https://raw.githubusercontent.com/yjb-ds/functions/arc2parq/fileutils/open_archive/function.yaml')\n", - "\n", - "# configute it: mount on iguazio fabric, set as interactive (return stdout)\n", - "xfn.apply(mlrun.mount_v3io())\n", - "xfn.interactive = True\n", - "\n", - "# create and run the task\n", - "\n", - "images_path = '/User/mlrun/functions/images'\n", - "\n", - "open_archive_task = mlrun.NewTask(\n", - " 'download',\n", - " handler='open_archive', \n", - " params={'target_path': images_path,\n", - " 'key' : 'contents'},\n", - " inputs={'archive_url': 'http://iguazio-sample-data.s3.amazonaws.com/catsndogs.zip'}\n", - ")\n", - "\n", - "# run\n", - "run = xfn.run(open_archive_task)" + "fn.export('/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "_________" + "#### load and configure function" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 8, "metadata": {}, + "outputs": [], "source": [ - "# archive to parquet" + "# load function from a local Python file\n", + "# fn = mlrun.code_to_function('/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py', kind='job')" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 9, "metadata": {}, + "outputs": [], "source": [ - "#### load and configure function" + "# export function yaml\n", + "# fn.export('/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml')" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# import function yaml\n", + "fn = mlrun.import_function('/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml')" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "# push yaml to github" + ] + }, + { + "cell_type": "code", + "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# load function from Github\n", - "xfn = mlrun.import_function('https://raw.githubusercontent.com/yjb-ds/functions/arc2parq/fileutils/arc_to_parquet/arc_to_parquet.yaml')\n", - "\n", + "# fn = mlrun.import_function(\n", + "# 'https://raw.githubusercontent.com/mlrun/functions/master/fileutils/arc_to_parquet/arc_to_parquet.yaml')" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ "# configure function: mount on the Iguazio data fabric, set as interactive (return stdout)\n", - "xfn.apply(mlrun.mount_v3io())\n", - "xfn.interactive = True" + "fn.apply(mlrun.mount_v3io())\n", + "fn.interactive = True" ] }, { @@ -332,7 +245,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 14, "metadata": { "collapsed": true, "jupyter": { @@ -344,7 +257,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-20 05:35:07,015 starting remote build, image: .mlrun/func-default-arc_to_parquet-latest\n", + "[mlrun] 2020-01-21 07:11:35,802 starting remote build, image: .mlrun/func-default-arc-to-parquet-latest\n", "\u001b[36mINFO\u001b[0m[0000] Resolved base name python:3.6-jessie to python:3.6-jessie \n", "\u001b[36mINFO\u001b[0m[0000] Resolved base name python:3.6-jessie to python:3.6-jessie \n", "\u001b[36mINFO\u001b[0m[0000] Downloading base image python:3.6-jessie \n", @@ -354,146 +267,180 @@ "\u001b[36mINFO\u001b[0m[0000] Downloading base image python:3.6-jessie \n", "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:0318d80cb241983eda20b905d77fa0bfb06e29e5aabf075c7941ea687f1c125a: no such file or directory \n", "\u001b[36mINFO\u001b[0m[0000] Downloading base image python:3.6-jessie \n", - "\u001b[36mINFO\u001b[0m[0001] Unpacking rootfs as cmd RUN pip install -q mlrun requires it. \n", + "\u001b[36mINFO\u001b[0m[0001] Unpacking rootfs as cmd RUN python -m pip uninstall mlrun requires it. \n", "\u001b[36mINFO\u001b[0m[0011] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0018] RUN pip install -q mlrun \n", + "\u001b[36mINFO\u001b[0m[0018] RUN python -m pip uninstall mlrun \n", "\u001b[36mINFO\u001b[0m[0018] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0018] args: [-c pip install -q mlrun] \n", + "\u001b[36mINFO\u001b[0m[0018] args: [-c python -m pip uninstall mlrun] \n", + "WARNING: Skipping mlrun as it is not installed.\n", + "\u001b[36mINFO\u001b[0m[0019] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0021] RUN python -m pip install -U -q mlrun \n", + "\u001b[36mINFO\u001b[0m[0021] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0021] args: [-c python -m pip install -U -q mlrun] \n", + "ERROR: kfp 0.2.0 has requirement urllib3<1.25,>=1.15, but you'll have urllib3 1.25.7 which is incompatible.\n", "WARNING: You are using pip version 19.1.1, however version 19.3.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0065] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0082] RUN pip install -q pyarrow \n", - "\u001b[36mINFO\u001b[0m[0082] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0082] args: [-c pip install -q pyarrow] \n", + "\u001b[36mINFO\u001b[0m[0067] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0083] RUN python -m pip install -U -q pandas \n", + "\u001b[36mINFO\u001b[0m[0083] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0083] args: [-c python -m pip install -U -q pandas] \n", "WARNING: You are using pip version 19.1.1, however version 19.3.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0086] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0095] RUN pip install -q numpy \n", - "\u001b[36mINFO\u001b[0m[0095] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0095] args: [-c pip install -q numpy] \n", + "\u001b[36mINFO\u001b[0m[0084] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0088] RUN python -m pip install -U -q pyarrow \n", + "\u001b[36mINFO\u001b[0m[0088] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0088] args: [-c python -m pip install -U -q pyarrow] \n", "WARNING: You are using pip version 19.1.1, however version 19.3.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0096] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0099] RUN pip install -q pandas \n", - "\u001b[36mINFO\u001b[0m[0099] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0099] args: [-c pip install -q pandas] \n", + "\u001b[36mINFO\u001b[0m[0092] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0100] RUN python -m pip install -U -q numpy==1.17.4 \n", + "\u001b[36mINFO\u001b[0m[0100] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0100] args: [-c python -m pip install -U -q numpy==1.17.4] \n", "WARNING: You are using pip version 19.1.1, however version 19.3.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0100] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0102] RUN pip install mlrun \n", - "\u001b[36mINFO\u001b[0m[0102] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0102] args: [-c pip install mlrun] \n", + "\u001b[36mINFO\u001b[0m[0103] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bit_generator.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_examples \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_mt19937.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/include/numpy/random/distributions.h \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/tests/__pycache__/test_extending.cpython-36.pyc \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_common.pxd \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/test_issue14735.py \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy-1.18.1.dist-info \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/__init__.pxd \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bit_generator.pxd \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_pcg64.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bounded_integers.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_common.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/__pycache__/test__exceptions.cpython-36.pyc \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/test__exceptions.py \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bounded_integers.pxd \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_sfc64.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_generator.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/__pycache__/test_issue14735.cpython-36.pyc \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_philox.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/tests/test_extending.py \n", + "\u001b[36mINFO\u001b[0m[0108] RUN pip install mlrun \n", + "\u001b[36mINFO\u001b[0m[0108] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0108] args: [-c pip install mlrun] \n", "Requirement already satisfied: mlrun in /usr/local/lib/python3.6/site-packages (0.4.3)\n", - "Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (7.0)\n", - "Requirement already satisfied: gunicorn==19.9.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (19.9.0)\n", - "Requirement already satisfied: requests>=2.20.1 in /usr/local/lib/python3.6/site-packages (from mlrun) (2.22.0)\n", + "Requirement already satisfied: nest-asyncio>=1.0.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.2.2)\n", + "Requirement already satisfied: nuclio-jupyter>=0.8.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.8.0)\n", "Requirement already satisfied: aiohttp>=3.5.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (3.6.2)\n", + "Requirement already satisfied: kfp>=0.1.29 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.2.0)\n", "Requirement already satisfied: sqlalchemy==1.3.11 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.3.11)\n", - "Requirement already satisfied: gevent==1.4.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.4.0)\n", - "Requirement already satisfied: nuclio-jupyter>=0.8.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.8.0)\n", - "Requirement already satisfied: kfp>=0.1.29 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.1.40)\n", - "Requirement already satisfied: GitPython>=2.1.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (3.0.5)\n", "Requirement already satisfied: Flask>=1.1.1 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.1.1)\n", + "Requirement already satisfied: croniter==0.3.31 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.3.31)\n", + "Requirement already satisfied: gunicorn==19.9.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (19.9.0)\n", "Requirement already satisfied: pandas>=0.23.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.25.3)\n", + "Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (7.0)\n", "Requirement already satisfied: tabulate<=0.8.3,>=0.8.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.8.3)\n", - "Requirement already satisfied: boto3>=1.9 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.11.5)\n", "Requirement already satisfied: pyyaml>=5.1.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (5.3)\n", - "Requirement already satisfied: nest-asyncio>=1.0.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.2.2)\n", + "Requirement already satisfied: GitPython>=2.1.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (3.0.5)\n", + "Requirement already satisfied: gevent==1.4.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.4.0)\n", "Requirement already satisfied: nuclio-sdk>=0.0.3 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.0.5)\n", - "Requirement already satisfied: croniter==0.3.31 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.3.31)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/site-packages (from requests>=2.20.1->mlrun) (2019.11.28)\n", - "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/site-packages (from requests>=2.20.1->mlrun) (3.0.4)\n", - "Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/site-packages (from requests>=2.20.1->mlrun) (2.8)\n", - "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/site-packages (from requests>=2.20.1->mlrun) (1.24.3)\n", - "Requirement already satisfied: async-timeout<4.0,>=3.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.1)\n", + "Requirement already satisfied: requests>=2.20.1 in /usr/local/lib/python3.6/site-packages (from mlrun) (2.22.0)\n", + "Requirement already satisfied: boto3>=1.9 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.11.6)\n", + "Requirement already satisfied: tornado<6,>=5 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.1.1)\n", + "Requirement already satisfied: nbconvert>=5.4 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.6.1)\n", + "Requirement already satisfied: jupyterlab>=0.35.4 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (1.2.5)\n", + "Requirement already satisfied: notebook>=5.7.2 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.2)\n", + "Requirement already satisfied: ipython>=7.2 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (7.11.1)\n", "Requirement already satisfied: idna-ssl>=1.0; python_version < \"3.7\" in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (1.1.0)\n", - "Requirement already satisfied: multidict<5.0,>=4.5 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (4.7.4)\n", - "Requirement already satisfied: typing-extensions>=3.6.5; python_version < \"3.7\" in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.7.4.1)\n", + "Requirement already satisfied: async-timeout<4.0,>=3.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.1)\n", "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: typing-extensions>=3.6.5; python_version < \"3.7\" in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.7.4.1)\n", "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (19.3.0)\n", - "Requirement already satisfied: greenlet>=0.4.14; platform_python_implementation == \"CPython\" in /usr/local/lib/python3.6/site-packages (from gevent==1.4.0->mlrun) (0.4.15)\n", - "Requirement already satisfied: notebook>=5.7.2 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.2)\n", - "Requirement already satisfied: jupyterlab>=0.35.4 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (1.2.5)\n", - "Requirement already satisfied: tornado<6,>=5 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.1.1)\n", - "Requirement already satisfied: ipython>=7.2 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (7.11.1)\n", - "Requirement already satisfied: nbconvert>=5.4 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.6.1)\n", - "Requirement already satisfied: python-dateutil in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2.8.1)\n", - "Requirement already satisfied: argo-models==2.2.1a in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2.2.1a0)\n", - "Requirement already satisfied: cloudpickle==1.1.1 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.1.1)\n", + "Requirement already satisfied: chardet<4.0,>=2.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.4)\n", + "Requirement already satisfied: multidict<5.0,>=4.5 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (4.7.4)\n", + "Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.14.0)\n", "Requirement already satisfied: kubernetes<=10.0.0,>=8.0.0 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (10.0.0)\n", + "Requirement already satisfied: PyJWT>=1.6.4 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.7.1)\n", + "Requirement already satisfied: python-dateutil in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2.8.1)\n", + "Requirement already satisfied: certifi in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2019.11.28)\n", "Requirement already satisfied: Deprecated in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.2.7)\n", + "Requirement already satisfied: requests-toolbelt>=0.8.0 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (0.9.1)\n", "Requirement already satisfied: google-cloud-storage>=1.13.0 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.25.0)\n", + "Requirement already satisfied: kfp-server-api<=0.1.40,>=0.1.18 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (0.1.40)\n", + "Collecting urllib3<1.25,>=1.15 (from kfp>=0.1.29->mlrun)\n", + " Downloading https://files.pythonhosted.org/packages/01/11/525b02e4acc0c747de8b6ccdab376331597c569c42ea66ab0a1dbd36eca2/urllib3-1.24.3-py2.py3-none-any.whl (118kB)\n", + "Requirement already satisfied: cryptography>=2.4.2 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2.8)\n", "Requirement already satisfied: google-auth>=1.6.1 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.10.1)\n", - "Requirement already satisfied: requests-toolbelt>=0.8.0 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (0.9.1)\n", - "Requirement already satisfied: PyJWT>=1.6.4 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.7.1)\n", "Requirement already satisfied: jsonschema>=3.0.1 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (3.2.0)\n", - "Requirement already satisfied: cryptography>=2.4.2 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2.8)\n", - "Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.14.0)\n", - "Requirement already satisfied: kfp-server-api<=0.1.40,>=0.1.18 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (0.1.40)\n", - "Requirement already satisfied: gitdb2>=2.0.0 in /usr/local/lib/python3.6/site-packages (from GitPython>=2.1.0->mlrun) (2.0.6)\n", - "Requirement already satisfied: Werkzeug>=0.15 in /usr/local/lib/python3.6/site-packages (from Flask>=1.1.1->mlrun) (0.16.0)\n", + "Requirement already satisfied: cloudpickle==1.1.1 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.1.1)\n", + "Requirement already satisfied: argo-models==2.2.1a in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2.2.1a0)\n", "Requirement already satisfied: itsdangerous>=0.24 in /usr/local/lib/python3.6/site-packages (from Flask>=1.1.1->mlrun) (1.1.0)\n", + "Requirement already satisfied: Werkzeug>=0.15 in /usr/local/lib/python3.6/site-packages (from Flask>=1.1.1->mlrun) (0.16.0)\n", "Requirement already satisfied: Jinja2>=2.10.1 in /usr/local/lib/python3.6/site-packages (from Flask>=1.1.1->mlrun) (2.10.3)\n", - "Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.6/site-packages (from pandas>=0.23.0->mlrun) (1.18.1)\n", + "Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.6/site-packages (from pandas>=0.23.0->mlrun) (1.17.4)\n", "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/site-packages (from pandas>=0.23.0->mlrun) (2019.3)\n", - "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (0.9.4)\n", + "Requirement already satisfied: gitdb2>=2.0.0 in /usr/local/lib/python3.6/site-packages (from GitPython>=2.1.0->mlrun) (2.0.6)\n", + "Requirement already satisfied: greenlet>=0.4.14; platform_python_implementation == \"CPython\" in /usr/local/lib/python3.6/site-packages (from gevent==1.4.0->mlrun) (0.4.15)\n", + "Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/site-packages (from requests>=2.20.1->mlrun) (2.8)\n", + "Requirement already satisfied: botocore<1.15.0,>=1.14.6 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (1.14.6)\n", "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (0.3.1)\n", - "Requirement already satisfied: botocore<1.15.0,>=1.14.5 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (1.14.5)\n", + "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (0.9.4)\n", + "Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.4)\n", + "Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.3)\n", + "Requirement already satisfied: bleach in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (3.1.0)\n", + "Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: jupyter-core in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (4.6.1)\n", + "Requirement already satisfied: testpath in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.4.4)\n", + "Requirement already satisfied: nbformat>=4.4 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (5.0.3)\n", + "Requirement already satisfied: defusedxml in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (4.3.3)\n", + "Requirement already satisfied: pygments in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (2.5.2)\n", + "Requirement already satisfied: jupyterlab-server~=1.0.0 in /usr/local/lib/python3.6/site-packages (from jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (1.0.6)\n", "Requirement already satisfied: jupyter-client>=5.3.4 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.3.4)\n", - "Requirement already satisfied: traitlets>=4.2.1 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (4.3.3)\n", - "Requirement already satisfied: Send2Trash in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (1.5.0)\n", - "Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.8.3)\n", "Requirement already satisfied: pyzmq>=17 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (18.1.1)\n", - "Requirement already satisfied: prometheus-client in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.1)\n", - "Requirement already satisfied: jupyter-core>=4.6.0 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (4.6.1)\n", - "Requirement already satisfied: nbformat in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.0.3)\n", "Requirement already satisfied: ipykernel in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.1.3)\n", + "Requirement already satisfied: Send2Trash in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (1.5.0)\n", "Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.2.0)\n", - "Requirement already satisfied: jupyterlab-server~=1.0.0 in /usr/local/lib/python3.6/site-packages (from jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (1.0.6)\n", - "Requirement already satisfied: jedi>=0.10 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.15.2)\n", - "Requirement already satisfied: pygments in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (2.5.2)\n", - "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (3.0.2)\n", + "Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.8.3)\n", + "Requirement already satisfied: prometheus-client in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.1)\n", + "Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.5)\n", + "Requirement already satisfied: decorator in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.4.1)\n", "Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (41.0.1)\n", + "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (3.0.2)\n", "Requirement already satisfied: backcall in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.0)\n", - "Requirement already satisfied: decorator in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.4.1)\n", + "Requirement already satisfied: jedi>=0.10 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.15.2)\n", "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.7.0)\n", - "Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.5)\n", - "Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (1.4.2)\n", - "Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.3)\n", - "Requirement already satisfied: testpath in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.4.4)\n", - "Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.4)\n", - "Requirement already satisfied: defusedxml in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", - "Requirement already satisfied: bleach in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (3.1.0)\n", - "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /usr/local/lib/python3.6/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (0.57.0)\n", "Requirement already satisfied: requests-oauthlib in /usr/local/lib/python3.6/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (1.3.0)\n", + "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /usr/local/lib/python3.6/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (0.57.0)\n", "Requirement already satisfied: wrapt<2,>=1.10 in /usr/local/lib/python3.6/site-packages (from Deprecated->kfp>=0.1.29->mlrun) (1.11.2)\n", - "Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /usr/local/lib/python3.6/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (0.5.0)\n", "Requirement already satisfied: google-cloud-core<2.0dev,>=1.2.0 in /usr/local/lib/python3.6/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.2.0)\n", + "Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /usr/local/lib/python3.6/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (0.5.0)\n", + "Requirement already satisfied: cffi!=1.11.3,>=1.8 in /usr/local/lib/python3.6/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (1.13.2)\n", "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.2.8)\n", "Requirement already satisfied: rsa<4.1,>=3.1.4 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0)\n", "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0.0)\n", "Requirement already satisfied: pyrsistent>=0.14.0 in /usr/local/lib/python3.6/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (0.15.7)\n", "Requirement already satisfied: importlib-metadata; python_version < \"3.8\" in /usr/local/lib/python3.6/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (1.4.0)\n", - "Requirement already satisfied: cffi!=1.11.3,>=1.8 in /usr/local/lib/python3.6/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (1.13.2)\n", - "Requirement already satisfied: smmap2>=2.0.0 in /usr/local/lib/python3.6/site-packages (from gitdb2>=2.0.0->GitPython>=2.1.0->mlrun) (2.0.5)\n", "Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/site-packages (from Jinja2>=2.10.1->Flask>=1.1.1->mlrun) (1.1.1)\n", - "Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/site-packages (from botocore<1.15.0,>=1.14.5->boto3>=1.9->mlrun) (0.15.2)\n", - "Requirement already satisfied: ptyprocess; os_name != \"nt\" in /usr/local/lib/python3.6/site-packages (from terminado>=0.8.1->notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: smmap2>=2.0.0 in /usr/local/lib/python3.6/site-packages (from gitdb2>=2.0.0->GitPython>=2.1.0->mlrun) (2.0.5)\n", + "Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/site-packages (from botocore<1.15.0,>=1.14.6->boto3>=1.9->mlrun) (0.15.2)\n", + "Requirement already satisfied: webencodings in /usr/local/lib/python3.6/site-packages (from bleach->nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.5.1)\n", "Requirement already satisfied: json5 in /usr/local/lib/python3.6/site-packages (from jupyterlab-server~=1.0.0->jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.5)\n", - "Requirement already satisfied: parso>=0.5.2 in /usr/local/lib/python3.6/site-packages (from jedi>=0.10->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.5.2)\n", + "Requirement already satisfied: ptyprocess; os_name != \"nt\" in /usr/local/lib/python3.6/site-packages (from terminado>=0.8.1->notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", "Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.8)\n", - "Requirement already satisfied: webencodings in /usr/local/lib/python3.6/site-packages (from bleach->nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.5.1)\n", + "Requirement already satisfied: parso>=0.5.2 in /usr/local/lib/python3.6/site-packages (from jedi>=0.10->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.5.2)\n", "Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.6/site-packages (from requests-oauthlib->kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (3.1.0)\n", "Requirement already satisfied: google-api-core<2.0.0dev,>=1.16.0 in /usr/local/lib/python3.6/site-packages (from google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.16.0)\n", + "Requirement already satisfied: pycparser in /usr/local/lib/python3.6/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.4.2->kfp>=0.1.29->mlrun) (2.19)\n", "Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.6/site-packages (from pyasn1-modules>=0.2.1->google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.4.8)\n", "Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/site-packages (from importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (2.0.0)\n", - "Requirement already satisfied: pycparser in /usr/local/lib/python3.6/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.4.2->kfp>=0.1.29->mlrun) (2.19)\n", - "Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /usr/local/lib/python3.6/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.51.0)\n", "Requirement already satisfied: protobuf>=3.4.0 in /usr/local/lib/python3.6/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (3.11.2)\n", + "Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /usr/local/lib/python3.6/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.51.0)\n", "Requirement already satisfied: more-itertools in /usr/local/lib/python3.6/site-packages (from zipp>=0.5->importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (8.1.0)\n", + "Installing collected packages: urllib3\n", + " Found existing installation: urllib3 1.25.7\n", + " Uninstalling urllib3-1.25.7:\n", + " Successfully uninstalled urllib3-1.25.7\n", + "Successfully installed urllib3-1.24.3\n", "WARNING: You are using pip version 19.1.1, however version 19.3.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0103] Taking snapshot of full filesystem... \n" + "\u001b[36mINFO\u001b[0m[0110] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0110] Adding whiteout for /usr/local/lib/python3.6/site-packages/urllib3-1.25.7.dist-info \n" ] }, { @@ -502,31 +449,40 @@ "True" ] }, - "execution_count": 7, + "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "xfn.deploy()" + "fn.deploy()" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "# fn.with_code()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Also note that the build time can be reduced if you specifiy a pre-built image with all required packages." + "Also note that the build time can be reduced if you specifiy a pre-built image with all required packages pre-installed." ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# useful constants\n", "target_path = '/User/mlrun/functions/parquet'\n", - "archive = 'https://fpsignals-public.s3.amazonaws.com/x_test_50.csv.gz'\n", + "archive = 'https://fpsignals-public.s3.amazonaws.com/one_csv.tar.gz'\n", "parquet_file = 'x_test_50.parquet' # the file extension is not necessary\n", "parquet_file_path = target_path + \"/\" + parquet_file\n", "artifact_key = 'raw_data'" @@ -534,20 +490,21 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-20 05:38:21,743 starting run arc2parq uid=42af41d93f294cd09aace4942d25b106 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-20 05:38:21,823 Job is running in the background, pod: arc2parq-96gmq\n", - "[mlrun] 2020-01-20 05:38:37,072 destination file exists\n", - "[mlrun] 2020-01-20 05:38:37,072 logging /User/mlrun/functions/parquet/x_test_50.parquet to context\n", - "[mlrun] 2020-01-20 05:38:37,083 log artifact raw_data at /User/mlrun/functions/parquet/x_test_50.parquet, size: None, db: Y\n", + "[mlrun] 2020-01-21 07:17:44,068 starting run arc2parq uid=2a211d65872442cf85e745bde5c81392 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-21 07:17:44,153 Job is running in the background, pod: arc2parq-vjhbh\n", + "[mlrun] 2020-01-21 07:17:50,433 destination file does not exist, downloading\n", + "[mlrun] 2020-01-21 07:17:50,536 saved table to /User/mlrun/functions/parquet/x_test_50.parquet\n", + "[mlrun] 2020-01-21 07:17:50,549 log artifact raw_data at /User/mlrun/functions/parquet/x_test_50.parquet, size: None, db: Y\n", + "[mlrun] 2020-01-21 07:17:50,561 log artifact header at /User/mlrun/functions/parquet/header.json, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-20 05:38:37,094 run executed, status=completed\n", + "[mlrun] 2020-01-21 07:17:50,571 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -720,26 +677,26 @@ " \n", " \n", " \n", - "
...25b106
\n", + "
...c81392
\n", " 0\n", - " Jan 20 05:38:37\n", + " Jan 21 07:17:50\n", " completed\n", - " arc_to_parquet\n", - "
host=arc2parq-96gmq
kind=job
owner=admin
\n", + " arc-to-parquet\n", + "
host=arc2parq-vjhbh
kind=job
owner=admin
\n", " \n", - "
archive_url=https://fpsignals-public.s3.amazonaws.com/x_test_50.csv.gz
key=raw_data
name=x_test_50.parquet
target_path=/User/mlrun/functions/parquet
\n", + "
archive_url=https://fpsignals-public.s3.amazonaws.com/one_csv.tar.gz
key=raw_data
name=x_test_50.parquet
target_path=/User/mlrun/functions/parquet
\n", " \n", - "
raw_data
\n", + "
raw_data
header
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -755,8 +712,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 42af41d93f294cd09aace4942d25b106 , !mlrun logs 42af41d93f294cd09aace4942d25b106 \n", - "[mlrun] 2020-01-20 05:38:40,974 run executed, status=completed\n" + "!mlrun get run 2a211d65872442cf85e745bde5c81392 , !mlrun logs 2a211d65872442cf85e745bde5c81392 \n", + "[mlrun] 2020-01-21 07:17:53,318 run executed, status=completed\n" ] } ], @@ -773,7 +730,7 @@ " outputs=[artifact_key])\n", "\n", "# run\n", - "run = xfn.run(arc_to_parq_task)" + "run = fn.run(arc_to_parq_task)" ] }, { @@ -792,7 +749,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -803,7 +760,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -813,22 +770,34 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ - "assert artifact_key in run.outputs.keys(), f\"mlrun.functions: key {artifact_key} not fond in outputs\"\n", + "assert artifact_key in run.outputs.keys(), f\"mlrun.functions: key {artifact_key} not found in outputs\"\n", "assert os.path.isfile(parquet_file_path), f\"mlrun.functions: artifact source not found at {parquet_file_path}\"" ] }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 9, "metadata": {}, - "outputs": [], + "outputs": [ + { + "ename": "AssertionError", + "evalue": "mlrun.functions: original and copied data not equal", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mAssertionError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0moriginal\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marchive\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mcopied\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_parquet\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mparquet_file_path\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mengine\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"pyarrow\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32massert\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray_equal\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moriginal\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopied\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"mlrun.functions: original and copied data not equal\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mAssertionError\u001b[0m: mlrun.functions: original and copied data not equal" + ] + } + ], "source": [ - "original = pd.read_csv(archive)\n", - "copied = pd.read_parquet(parquet_file_path, engine=\"pyarrow\")\n", + "original = pd.read_csv(archive).values\n", + "copied = pd.read_parquet(parquet_file_path, engine=\"pyarrow\").values\n", "assert np.array_equal(original, copied), \"mlrun.functions: original and copied data not equal\"" ] }, @@ -841,7 +810,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ diff --git a/tests/open_archive.ipynb b/tests/open_archive.ipynb new file mode 100644 index 000000000..71eb35f01 --- /dev/null +++ b/tests/open_archive.ipynb @@ -0,0 +1,306 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import mlrun\n", + "mlrun.mlconf.dbpath = 'http://mlrun-api:8080'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# archive to folder" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import urllib3\n", + "urllib3.disable_warnings()" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-20 08:36:14,989 starting run download uid=79a5b0f103c24367961cf8c107126dd2 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-20 08:36:15,069 Job is running in the background, pod: download-6mg4q\n", + "[mlrun] 2020-01-20 08:36:19,610 downloading http://iguazio-sample-data.s3.amazonaws.com/catsndogs.zip to local tmp\n", + "[mlrun] 2020-01-20 08:36:21,218 Verified directories\n", + "[mlrun] 2020-01-20 08:36:21,218 Extracting zip\n", + "[mlrun] 2020-01-20 08:36:22,988 extracted archive to content\n", + "[mlrun] 2020-01-20 08:36:23,001 log artifact content at content, size: None, db: Y\n", + "\n", + "[mlrun] 2020-01-20 08:36:23,011 run executed, status=completed\n", + "final state: succeeded\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...126dd2
0Jan 20 08:36:19completedfile_utils
host=download-6mg4q
kind=job
owner=admin
archive_url
key=contents
target_path=/User/mlrun/functions/images
content
\n", + "
\n", + "
\n", + "
\n", + " Title\n", + " ×\n", + "
\n", + " \n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "to track results use .show() or .logs() or in CLI: \n", + "!mlrun get run 79a5b0f103c24367961cf8c107126dd2 , !mlrun logs 79a5b0f103c24367961cf8c107126dd2 \n", + "[mlrun] 2020-01-20 08:36:24,208 run executed, status=completed\n" + ] + } + ], + "source": [ + "# load function from Github\n", + "xfn = mlrun.import_function('https://raw.githubusercontent.com/yjb-ds/functions/arc2parq/fileutils/open_archive/function.yaml')\n", + "\n", + "# configute it: mount on iguazio fabric, set as interactive (return stdout)\n", + "xfn.apply(mlrun.mount_v3io())\n", + "xfn.interactive = True\n", + "\n", + "# create and run the task\n", + "\n", + "images_path = '/User/mlrun/functions/images'\n", + "\n", + "open_archive_task = mlrun.NewTask(\n", + " 'download',\n", + " handler='open_archive', \n", + " params={'target_path': images_path,\n", + " 'key' : 'contents'},\n", + " inputs={'archive_url': 'http://iguazio-sample-data.s3.amazonaws.com/catsndogs.zip'}\n", + ")\n", + "\n", + "# run\n", + "run = xfn.run(open_archive_task)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 5a28adbd7643e2a27fcfed41e081a6ffaffcfd08 Mon Sep 17 00:00:00 2001 From: yasha Date: Tue, 21 Jan 2020 07:21:45 +0000 Subject: [PATCH 05/32] mf --- fileutils/arc_to_parquet/arc_to_parquet-bak.yaml | 13 ------------- 1 file changed, 13 deletions(-) delete mode 100644 fileutils/arc_to_parquet/arc_to_parquet-bak.yaml diff --git a/fileutils/arc_to_parquet/arc_to_parquet-bak.yaml b/fileutils/arc_to_parquet/arc_to_parquet-bak.yaml deleted file mode 100644 index 63c7e5d68..000000000 --- a/fileutils/arc_to_parquet/arc_to_parquet-bak.yaml +++ /dev/null @@ -1,13 +0,0 @@ -kind: job -metadata: - name: arc-to-parquet -spec: - description: '' - build: - functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlciBvbiAyMDIwLTAxLTIxIDA2OjE2CgppbXBvcnQgbWxydW4KbWxydW4ubWxjb25mLmRicGF0aCA9ICdodHRwOi8vbWxydW4tYXBpOjgwODAnCgppbXBvcnQgdXJsbGliMwp1cmxsaWIzLmRpc2FibGVfd2FybmluZ3MoKQoKeGZuID0gbWxydW4uaW1wb3J0X2Z1bmN0aW9uKCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20veWpiLWRzL2Z1bmN0aW9ucy9hcmMycGFycS9maWxldXRpbHMvb3Blbl9hcmNoaXZlL2Z1bmN0aW9uLnlhbWwnKQoKeGZuLmFwcGx5KG1scnVuLm1vdW50X3YzaW8oKSkKeGZuLmludGVyYWN0aXZlID0gVHJ1ZQoKCmltYWdlc19wYXRoID0gJy9Vc2VyL21scnVuL2Z1bmN0aW9ucy9pbWFnZXMnCgpvcGVuX2FyY2hpdmVfdGFzayA9IG1scnVuLk5ld1Rhc2soCiAgICAnZG93bmxvYWQnLAogICAgaGFuZGxlcj0nb3Blbl9hcmNoaXZlJywgCiAgICBwYXJhbXM9eyd0YXJnZXRfcGF0aCc6IGltYWdlc19wYXRoLAogICAgICAgICAgICAna2V5JyAgICAgICAgOiAnY29udGVudHMnfSwKICAgIGlucHV0cz17J2FyY2hpdmVfdXJsJzogJ2h0dHA6Ly9pZ3VhemlvLXNhbXBsZS1kYXRhLnMzLmFtYXpvbmF3cy5jb20vY2F0c25kb2dzLnppcCd9CikKCnJ1biA9IHhmbi5ydW4ob3Blbl9hcmNoaXZlX3Rhc2spCgppbXBvcnQgbWxydW4KbWxydW4ubWxjb25mLmRicGF0aCA9ICdodHRwOi8vbWxydW4tYXBpOjgwODAnCgp4Zm4gPSBtbHJ1bi5jb2RlX3RvX2Z1bmN0aW9uKCcvVXNlci9yZXBvcy9mdW5jdGlvbnMvZmlsZXV0aWxzL2FyY190b19wYXJxdWV0L2FyY190b19wYXJxdWV0LnB5Jywga2luZD0nam9iJykKCnhmbi5leHBvcnQoJy9Vc2VyL3JlcG9zL2Z1bmN0aW9ucy9maWxldXRpbHMvYXJjX3RvX3BhcnF1ZXQvYXJjX3RvX3BhcnF1ZXQueWFtbCcpCgp4Zm4gPSBtbHJ1bi5pbXBvcnRfZnVuY3Rpb24oJy9Vc2VyL3JlcG9zL2Z1bmN0aW9ucy9maWxldXRpbHMvYXJjX3RvX3BhcnF1ZXQvYXJjX3RvX3BhcnF1ZXQueWFtbCcpCgp4Zm4uYXBwbHkobWxydW4ubW91bnRfdjNpbygpKQp4Zm4uaW50ZXJhY3RpdmUgPSBUcnVlCgp4Zm4uZGVwbG95KCkKCnRhcmdldF9wYXRoID0gJy9Vc2VyL21scnVuL2Z1bmN0aW9ucy9wYXJxdWV0JwphcmNoaXZlID0gJ2h0dHBzOi8vZnBzaWduYWxzLXB1YmxpYy5zMy5hbWF6b25hd3MuY29tL29uZV9jc3YudGFyLmd6JwpwYXJxdWV0X2ZpbGUgPSAneF90ZXN0XzUwLnBhcnF1ZXQnICMgdGhlIGZpbGUgZXh0ZW5zaW9uIGlzIG5vdCBuZWNlc3NhcnkKcGFycXVldF9maWxlX3BhdGggPSB0YXJnZXRfcGF0aCArICIvIiArIHBhcnF1ZXRfZmlsZQphcnRpZmFjdF9rZXkgPSAncmF3X2RhdGEnCgphcmNfdG9fcGFycV90YXNrID0gbWxydW4uTmV3VGFzaygKICAgICdhcmMycGFycScsIAogICAgaGFuZGxlcj0nYXJjX3RvX3BhcnF1ZXQnLCAgIyBhIHN0cmluZyBzaW5jZSB3ZSBhcmUgY2FsbGluZyB0aGlzICdyZW1vdGVseScsIG91dHNpZGUgdGhpcyBub3RlYm9vawogICAgcGFyYW1zPXsKICAgICAgICAndGFyZ2V0X3BhdGgnOiB0YXJnZXRfcGF0aCwKICAgICAgICAnbmFtZScgICAgICAgOiBwYXJxdWV0X2ZpbGUsIAogICAgICAgICdrZXknICAgICAgICA6IGFydGlmYWN0X2tleSwKICAgICAgICAnYXJjaGl2ZV91cmwnOiBhcmNoaXZlfSwKICAgIG91dHB1dHM9W2FydGlmYWN0X2tleV0pCgpydW4gPSB4Zm4ucnVuKGFyY190b19wYXJxX3Rhc2spCgppbXBvcnQgcHlhcnJvdy5wYXJxdWV0IGFzIHBxCmRmID0gcHEucmVhZF90YWJsZShwYXJxdWV0X2ZpbGVfcGF0aCkudG9fcGFuZGFzKCkKZGYuaGVhZCgpCgppbXBvcnQgb3MKaW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBwYW5kYXMgYXMgcGQKCmFzc2VydCBhcnRpZmFjdF9rZXkgaW4gcnVuLm91dHB1dHMua2V5cygpLCBmIm1scnVuLmZ1bmN0aW9uczoga2V5IHthcnRpZmFjdF9rZXl9IG5vdCBmb25kIGluIG91dHB1dHMiCmFzc2VydCBvcy5wYXRoLmlzZmlsZShwYXJxdWV0X2ZpbGVfcGF0aCksICBmIm1scnVuLmZ1bmN0aW9uczogYXJ0aWZhY3Qgc291cmNlIG5vdCBmb3VuZCBhdCB7cGFycXVldF9maWxlX3BhdGh9IgoKb3JpZ2luYWwgPSBwZC5yZWFkX2NzdihhcmNoaXZlKQpjb3BpZWQgICA9IHBkLnJlYWRfcGFycXVldChwYXJxdWV0X2ZpbGVfcGF0aCwgZW5naW5lPSJweWFycm93IikKYXNzZXJ0IG5wLmFycmF5X2VxdWFsKG9yaWdpbmFsLCBjb3BpZWQpLCAgICJtbHJ1bi5mdW5jdGlvbnM6IG9yaWdpbmFsIGFuZCBjb3BpZWQgZGF0YSBub3QgZXF1YWwiCgpvcy5yZW1vdmUocGFycXVldF9maWxlX3BhdGgpCgo= - base_image: python:3.6-jessie - commands: - - pip install -q mlrun - - pip install -q pyarrow - - pip install -q numpy - - pip install -q pandas \ No newline at end of file From 0aba6b352e741d9bf036c9b8a945088772e118c4 Mon Sep 17 00:00:00 2001 From: yasha Date: Tue, 21 Jan 2020 09:37:44 +0000 Subject: [PATCH 06/32] buggy open_archive, implicit download and name mangling issue --- tests/open_archive.ipynb | 580 ++++++++++++++++++++++++++++++++++++--- 1 file changed, 542 insertions(+), 38 deletions(-) diff --git a/tests/open_archive.ipynb b/tests/open_archive.ipynb index 71eb35f01..72513ba26 100644 --- a/tests/open_archive.ipynb +++ b/tests/open_archive.ipynb @@ -5,6 +5,16 @@ "execution_count": 1, "metadata": {}, "outputs": [], + "source": [ + "# !python -m pip uninstall -y mlrun\n", + "# !python -m pip install mlrun" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], "source": [ "import mlrun\n", "mlrun.mlconf.dbpath = 'http://mlrun-api:8080'" @@ -19,32 +29,200 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# nuclio: ignore\n", + "import nuclio" + ] + }, + { + "cell_type": "code", + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ - "import urllib3\n", - "urllib3.disable_warnings()" + "import urllib.request\n", + "# urllib.disable_warnings()" ] }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import zipfile\n", + "import urllib\n", + "import tarfile\n", + "import json\n", + "\n", + "from mlrun.execution import MLClientCtx\n", + "\n", + "def open_archive(context: MLClientCtx, \n", + " target_dir: str = 'content',\n", + " archive_url: str = '',\n", + " archive_tar: str = '', # fudge parameter\n", + " archive_TMP: str = '' # fudge parameter\n", + "):\n", + " \"\"\"Open a file/object archive into a target directory\n", + " \n", + " Currently supports zip and tar.gz\n", + " \"\"\"\n", + " # Define locations\n", + " os.makedirs(target_dir, exist_ok=True)\n", + " context.logger.info('Verified directories')\n", + " \n", + " # performs an implicit download to /tmp at this point and MANGLES name:\n", + " print('archive url', archive_url)\n", + " print('archive tar', archive_tar)\n", + " print('archive TMP', archive_TMP)\n", + "# assert archive_tar == archive_url \n", + "\n", + " splits = archive_url.split('.')\n", + " print(splits)\n", + " if (splits[-1] == 'gz'):\n", + " # Extract dataset from tar\n", + " context.logger.info('opening tar_gz')\n", + " ref = tarfile.open(fileobj=urllib.request.urlopen(archive_url), mode='r|gz')\n", + " elif splits[-1] == 'zip':\n", + " # Extract dataset from zip\n", + " context.logger.info('opening zip')\n", + " ref = zipfile.ZipFile(archive_url, 'r')\n", + "\n", + " ref.extractall(target_dir)\n", + " ref.close()\n", + "\n", + " context.log_artifact('content', target_path=target_dir)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "# nuclio: end-code" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "# create job function object from notebook code\n", + "fn = mlrun.code_to_function(\n", + " 'open_archive', \n", + " runtime='job', \n", + " handler=open_archive, \n", + " image='mlrun/mlrun:latest')" + ] + }, + { + "cell_type": "code", + "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-20 08:36:14,989 starting run download uid=79a5b0f103c24367961cf8c107126dd2 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-20 08:36:15,069 Job is running in the background, pod: download-6mg4q\n", - "[mlrun] 2020-01-20 08:36:19,610 downloading http://iguazio-sample-data.s3.amazonaws.com/catsndogs.zip to local tmp\n", - "[mlrun] 2020-01-20 08:36:21,218 Verified directories\n", - "[mlrun] 2020-01-20 08:36:21,218 Extracting zip\n", - "[mlrun] 2020-01-20 08:36:22,988 extracted archive to content\n", - "[mlrun] 2020-01-20 08:36:23,001 log artifact content at content, size: None, db: Y\n", + "[mlrun] 2020-01-21 09:34:20,162 function spec saved to path: /User/repos/functions/fileutils/open_archive/function.yaml\n" + ] + } + ], + "source": [ + "# export function yaml\n", + "fn.export('/User/repos/functions/fileutils/open_archive/function.yaml')" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "# import function yaml\n", + "fn = mlrun.import_function('/User/repos/functions/fileutils/open_archive/function.yaml')" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "# load function from Github\n", + "# fn = mlrun.import_function('https://raw.githubusercontent.com/yjb-ds/functions/arc2parq/fileutils/open_archive/function.yaml')" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "# configute it: mount on iguazio fabric, set as interactive (return stdout)\n", + "fn.apply(mlrun.mount_v3io())\n", + "fn.interactive = True" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### zip file" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "# create and run the task\n", + "\n", + "images_path = '/User/mlrun/functions/images'" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "open_archive_task = mlrun.NewTask(\n", + " 'download',\n", + " handler='open_archive', \n", + " params={'target_dir' : images_path,\n", + " 'key' : 'contents'},\n", + " inputs={'archive_url': 'http://iguazio-sample-data.s3.amazonaws.com/catsndogs.zip'})" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-21 09:34:20,222 starting run download uid=f867d42711c346128fe5fb8abf152b75 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-21 09:34:20,303 Job is running in the background, pod: download-ffkxl\n", + "[mlrun] 2020-01-21 09:34:26,693 downloading http://iguazio-sample-data.s3.amazonaws.com/catsndogs.zip to local tmp\n", + "[mlrun] 2020-01-21 09:34:27,552 Verified directories\n", + "archive url /tmp/tmp05kj3avl.zip\n", + "archive tar \n", + "archive TMP \n", + "['/tmp/tmp05kj3avl', 'zip']\n", + "[mlrun] 2020-01-21 09:34:27,552 opening zip\n", + "[mlrun] 2020-01-21 09:34:34,430 log artifact content at /User/mlrun/functions/images, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-20 08:36:23,011 run executed, status=completed\n", + "[mlrun] 2020-01-21 09:34:34,443 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -217,26 +395,26 @@ " \n", " \n", " \n", - "
...126dd2
\n", + "
...152b75
\n", " 0\n", - " Jan 20 08:36:19\n", + " Jan 21 09:34:26\n", " completed\n", - " file_utils\n", - "
host=download-6mg4q
kind=job
owner=admin
\n", + " open-archive\n", + "
host=download-ffkxl
kind=job
owner=admin
\n", "
archive_url
\n", - "
key=contents
target_path=/User/mlrun/functions/images
\n", + "
key=contents
target_dir=/User/mlrun/functions/images
\n", " \n", - "
content
\n", + "
content
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -252,34 +430,360 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 79a5b0f103c24367961cf8c107126dd2 , !mlrun logs 79a5b0f103c24367961cf8c107126dd2 \n", - "[mlrun] 2020-01-20 08:36:24,208 run executed, status=completed\n" + "!mlrun get run f867d42711c346128fe5fb8abf152b75 , !mlrun logs f867d42711c346128fe5fb8abf152b75 \n", + "[mlrun] 2020-01-21 09:34:35,542 run executed, status=completed\n" ] } ], "source": [ - "# load function from Github\n", - "xfn = mlrun.import_function('https://raw.githubusercontent.com/yjb-ds/functions/arc2parq/fileutils/open_archive/function.yaml')\n", - "\n", - "# configute it: mount on iguazio fabric, set as interactive (return stdout)\n", - "xfn.apply(mlrun.mount_v3io())\n", - "xfn.interactive = True\n", - "\n", + "# run\n", + "run = fn.run(open_archive_task)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### tar.gz" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ "# create and run the task\n", - "\n", - "images_path = '/User/mlrun/functions/images'\n", + "images_path = '/User/mlrun/functions/t00'\n", "\n", "open_archive_task = mlrun.NewTask(\n", " 'download',\n", " handler='open_archive', \n", - " params={'target_path': images_path,\n", + " params={'target_dir' : images_path,\n", " 'key' : 'contents'},\n", - " inputs={'archive_url': 'http://iguazio-sample-data.s3.amazonaws.com/catsndogs.zip'}\n", - ")\n", - "\n", + " inputs={\n", + " 'archive_url': 'https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz',\n", + " 'archive_tar': 'https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz',\n", + " 'archive_TMP': 'https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz'})" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-21 09:34:35,557 starting run download uid=053019b4cdd34f1f9aad440d3423e400 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-21 09:34:35,625 Job is running in the background, pod: download-hhzsd\n", + "[mlrun] 2020-01-21 09:34:39,966 downloading https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz to local tmp\n", + "[mlrun] 2020-01-21 09:34:40,790 downloading https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz to local tmp\n", + "[mlrun] 2020-01-21 09:34:41,615 downloading https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz to local tmp\n", + "[mlrun] 2020-01-21 09:34:42,575 Verified directories\n", + "archive url /tmp/tmp_a8c0lt3.gz\n", + "archive tar /tmp/tmp7pmvyi7g.gz\n", + "archive TMP /tmp/tmpw_5blqyr.gz\n", + "['/tmp/tmp_a8c0lt3', 'gz']\n", + "[mlrun] 2020-01-21 09:34:42,575 opening tar_gz\n", + "[mlrun] 2020-01-21 09:34:42,578 Traceback (most recent call last):\n", + " File \"/usr/local/lib/python3.6/site-packages/mlrun-0.4.3-py3.6.egg/mlrun/runtimes/local.py\", line 174, in exec_from_params\n", + " val = handler(*args_list)\n", + " File \"main.py\", line 37, in open_archive\n", + " ref = tarfile.open(fileobj=urllib.request.urlopen(archive_url), mode='r|gz')\n", + " File \"/usr/local/lib/python3.6/urllib/request.py\", line 223, in urlopen\n", + " return opener.open(url, data, timeout)\n", + " File \"/usr/local/lib/python3.6/urllib/request.py\", line 511, in open\n", + " req = Request(fullurl, data)\n", + " File \"/usr/local/lib/python3.6/urllib/request.py\", line 329, in __init__\n", + " self.full_url = url\n", + " File \"/usr/local/lib/python3.6/urllib/request.py\", line 355, in full_url\n", + " self._parse()\n", + " File \"/usr/local/lib/python3.6/urllib/request.py\", line 384, in _parse\n", + " raise ValueError(\"unknown url type: %r\" % self.full_url)\n", + "ValueError: unknown url type: '/tmp/tmp_a8c0lt3.gz'\n", + "\n", + "\n", + "[mlrun] 2020-01-21 09:34:42,587 exec error - unknown url type: '/tmp/tmp_a8c0lt3.gz'\n", + "[mlrun] 2020-01-21 09:34:42,612 run executed, status=error\n", + "runtime error: unknown url type: '/tmp/tmp_a8c0lt3.gz'\n", + "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings\n", + " InsecureRequestWarning)\n", + "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings\n", + " InsecureRequestWarning)\n", + "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings\n", + " InsecureRequestWarning)\n", + "unknown url type: '/tmp/tmp_a8c0lt3.gz'\n", + "final state: failed\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...23e400
0Jan 21 09:34:39
error
open-archive
host=download-hhzsd
kind=job
owner=admin
archive_TMP
archive_tar
archive_url
key=contents
target_dir=/User/mlrun/functions/t00
\n", + "
\n", + "
\n", + "
\n", + " Title\n", + " ×\n", + "
\n", + " \n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "to track results use .show() or .logs() or in CLI: \n", + "!mlrun get run 053019b4cdd34f1f9aad440d3423e400 , !mlrun logs 053019b4cdd34f1f9aad440d3423e400 \n", + "[mlrun] 2020-01-21 09:34:44,783 run executed, status=error\n", + "runtime error: unknown url type: '/tmp/tmp_a8c0lt3.gz'\n" + ] + }, + { + "ename": "RunError", + "evalue": "unknown url type: '/tmp/tmp_a8c0lt3.gz'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mRunError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# run\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mrun\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mopen_archive_task\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, runspec, handler, name, project, params, inputs, out_path, workdir, watch, schedule)\u001b[0m\n\u001b[1;32m 266\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_post_run\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 267\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 268\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 269\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 270\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_api_server\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkfp\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36m_wrap_result\u001b[0;34m(self, result, runspec, err)\u001b[0m\n\u001b[1;32m 334\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_child\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 335\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'runtime error: {}'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 336\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mRunError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 337\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrun\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 338\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mRunError\u001b[0m: unknown url type: '/tmp/tmp_a8c0lt3.gz'" + ] + } + ], + "source": [ "# run\n", - "run = xfn.run(open_archive_task)" + "run = fn.run(open_archive_task)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "______" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### test outside mlrun" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ref = tarfile.open(fileobj=urllib.request.urlopen('https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz'), mode='r|gz')" ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ref.extractall('/User/test25')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { From 6deeed18dd44b0d9e5c4d67fdca1696b1eaecbac Mon Sep 17 00:00:00 2001 From: yasha Date: Tue, 21 Jan 2020 09:38:24 +0000 Subject: [PATCH 07/32] buggy open_archive, implicit download and name mangling issue --- fileutils/open_archive/file_utils.py | 27 +++++++++++++++++++-------- fileutils/open_archive/function.yaml | 16 +++++++++++++--- 2 files changed, 32 insertions(+), 11 deletions(-) diff --git a/fileutils/open_archive/file_utils.py b/fileutils/open_archive/file_utils.py index b8cae15f1..a7f52a032 100644 --- a/fileutils/open_archive/file_utils.py +++ b/fileutils/open_archive/file_utils.py @@ -1,24 +1,35 @@ import os import zipfile +import urllib +import tarfile import json from tempfile import mktemp - def open_archive(context, target_dir='content', archive_url=''): - """Open a file/object archive into a target directory""" + """Open a file/object archive into a target directory + + Currently supports zip and tar.gz + """ # Define locations os.makedirs(target_dir, exist_ok=True) context.logger.info('Verified directories') - # Extract dataset from zip - context.logger.info('Extracting zip') - zip_ref = zipfile.ZipFile(archive_url, 'r') - zip_ref.extractall(target_dir) - zip_ref.close() + splits = archive_url.split('.') + if ('.'.join(splits[-2:]) == 'tar.gz'): + # Extract dataset from tar + context.logger.info('opening tar_gz') + ftpstream = urllib.request.urlopen(archive_url) + ref = tarfile.open(fileobj=ftpstream, mode="r|gz") + elif splits[-1] == 'zip': + # Extract dataset from zip + context.logger.info('opening zip') + ref = zipfile.ZipFile(archive_url, 'r') + + ref.extractall(target_dir) + ref.close() context.logger.info(f'extracted archive to {target_dir}') context.log_artifact('content', target_path=target_dir) - \ No newline at end of file diff --git a/fileutils/open_archive/function.yaml b/fileutils/open_archive/function.yaml index 5a4547ece..6c19ee35a 100644 --- a/fileutils/open_archive/function.yaml +++ b/fileutils/open_archive/function.yaml @@ -1,8 +1,18 @@ kind: job metadata: - name: file_utils + name: open-archive + tag: '' + hash: 3157c6b15e60e50b07c6289483c5fc90aa4702eb + project: '' spec: + command: '' + args: [] image: mlrun/mlrun:latest - description: 'file utilities' + volumes: [] + volume_mounts: [] + env: [] + description: '' build: - functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlciBvbiAyMDE5LTEwLTI4IDIyOjAzCgppbXBvcnQgb3MKaW1wb3J0IHppcGZpbGUKCmRlZiBvcGVuX2FyY2hpdmUoY29udGV4dCwgCiAgICAgICAgICAgICAgICAgdGFyZ2V0X2Rpcj0nY29udGVudCcsCiAgICAgICAgICAgICAgICAgYXJjaGl2ZV91cmw9JycpOgogICAgIiIiT3BlbiBhIGZpbGUvb2JqZWN0IGFyY2hpdmUgaW50byBhIHRhcmdldCBkaXJlY3RvcnkiIiIKICAgICAgICAKICAgIG9zLm1ha2VkaXJzKHRhcmdldF9kaXIsIGV4aXN0X29rPVRydWUpCiAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCdWZXJpZmllZCBkaXJlY3RvcmllcycpCiAgICAKICAgIGNvbnRleHQubG9nZ2VyLmluZm8oJ0V4dHJhY3RpbmcgemlwJykKICAgIHppcF9yZWYgPSB6aXBmaWxlLlppcEZpbGUoYXJjaGl2ZV91cmwsICdyJykKICAgIHppcF9yZWYuZXh0cmFjdGFsbCh0YXJnZXRfZGlyKQogICAgemlwX3JlZi5jbG9zZSgpCiAgICAKICAgIGNvbnRleHQubG9nZ2VyLmluZm8oZidleHRyYWN0ZWQgYXJjaGl2ZSB0byB7dGFyZ2V0X2Rpcn0nKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ2NvbnRlbnQnLCB0YXJnZXRfcGF0aD10YXJnZXRfZGlyKQoK + functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlciBvbiAyMDIwLTAxLTIxIDA5OjM0CgppbXBvcnQgbWxydW4KbWxydW4ubWxjb25mLmRicGF0aCA9ICdodHRwOi8vbWxydW4tYXBpOjgwODAnCgppbXBvcnQgdXJsbGliLnJlcXVlc3QKCmltcG9ydCBvcwppbXBvcnQgemlwZmlsZQppbXBvcnQgdXJsbGliCmltcG9ydCB0YXJmaWxlCmltcG9ydCBqc29uCgpmcm9tIG1scnVuLmV4ZWN1dGlvbiBpbXBvcnQgTUxDbGllbnRDdHgKCmRlZiBvcGVuX2FyY2hpdmUoY29udGV4dDogTUxDbGllbnRDdHgsIAogICAgICAgICAgICAgICAgIHRhcmdldF9kaXI6IHN0ciA9ICdjb250ZW50JywKICAgICAgICAgICAgICAgICBhcmNoaXZlX3VybDogc3RyID0gJycsCiAgICAgICAgICAgICAgICAgYXJjaGl2ZV90YXI6IHN0ciA9ICcnLCAjICBmdWRnZSBwYXJhbWV0ZXIKICAgICAgICAgICAgICAgICBhcmNoaXZlX1RNUDogc3RyID0gJycgICMgIGZ1ZGdlIHBhcmFtZXRlcgopOgogICAgIiIiT3BlbiBhIGZpbGUvb2JqZWN0IGFyY2hpdmUgaW50byBhIHRhcmdldCBkaXJlY3RvcnkKICAgIAogICAgQ3VycmVudGx5IHN1cHBvcnRzIHppcCBhbmQgdGFyLmd6CiAgICAiIiIKICAgIG9zLm1ha2VkaXJzKHRhcmdldF9kaXIsIGV4aXN0X29rPVRydWUpCiAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCdWZXJpZmllZCBkaXJlY3RvcmllcycpCiAgICAKICAgIHByaW50KCdhcmNoaXZlIHVybCcsIGFyY2hpdmVfdXJsKQogICAgcHJpbnQoJ2FyY2hpdmUgdGFyJywgYXJjaGl2ZV90YXIpCiAgICBwcmludCgnYXJjaGl2ZSBUTVAnLCBhcmNoaXZlX1RNUCkKCiAgICBzcGxpdHMgPSBhcmNoaXZlX3VybC5zcGxpdCgnLicpCiAgICBwcmludChzcGxpdHMpCiAgICBpZiAoc3BsaXRzWy0xXSA9PSAnZ3onKToKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCdvcGVuaW5nIHRhcl9neicpCiAgICAgICAgcmVmID0gdGFyZmlsZS5vcGVuKGZpbGVvYmo9dXJsbGliLnJlcXVlc3QudXJsb3BlbihhcmNoaXZlX3VybCksIG1vZGU9J3J8Z3onKQogICAgZWxpZiBzcGxpdHNbLTFdID09ICd6aXAnOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oJ29wZW5pbmcgemlwJykKICAgICAgICByZWYgPSB6aXBmaWxlLlppcEZpbGUoYXJjaGl2ZV91cmwsICdyJykKCiAgICByZWYuZXh0cmFjdGFsbCh0YXJnZXRfZGlyKQogICAgcmVmLmNsb3NlKCkKCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgnY29udGVudCcsIHRhcmdldF9wYXRoPXRhcmdldF9kaXIpCgo= + commands: [] + code_origin: https://github.com/yjb-ds/functions.git#5a28adbd7643e2a27fcfed41e081a6ffaffcfd08:open_archive.ipynb From 5cd132298ff59c7549c37af0184056b72b177e19 Mon Sep 17 00:00:00 2001 From: yasha Date: Tue, 21 Jan 2020 09:51:02 +0000 Subject: [PATCH 08/32] name mangling restricted to 'inputs' paramater, running --- fileutils/open_archive/function.yaml | 6 +- tests/open_archive.ipynb | 428 +++++++++++++++++++-------- 2 files changed, 312 insertions(+), 122 deletions(-) diff --git a/fileutils/open_archive/function.yaml b/fileutils/open_archive/function.yaml index 6c19ee35a..dfc733e4f 100644 --- a/fileutils/open_archive/function.yaml +++ b/fileutils/open_archive/function.yaml @@ -2,7 +2,7 @@ kind: job metadata: name: open-archive tag: '' - hash: 3157c6b15e60e50b07c6289483c5fc90aa4702eb + hash: f636d58e75f2044e010c7bfedc2ce0720eb207c5 project: '' spec: command: '' @@ -13,6 +13,6 @@ spec: env: [] description: '' build: - functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlciBvbiAyMDIwLTAxLTIxIDA5OjM0CgppbXBvcnQgbWxydW4KbWxydW4ubWxjb25mLmRicGF0aCA9ICdodHRwOi8vbWxydW4tYXBpOjgwODAnCgppbXBvcnQgdXJsbGliLnJlcXVlc3QKCmltcG9ydCBvcwppbXBvcnQgemlwZmlsZQppbXBvcnQgdXJsbGliCmltcG9ydCB0YXJmaWxlCmltcG9ydCBqc29uCgpmcm9tIG1scnVuLmV4ZWN1dGlvbiBpbXBvcnQgTUxDbGllbnRDdHgKCmRlZiBvcGVuX2FyY2hpdmUoY29udGV4dDogTUxDbGllbnRDdHgsIAogICAgICAgICAgICAgICAgIHRhcmdldF9kaXI6IHN0ciA9ICdjb250ZW50JywKICAgICAgICAgICAgICAgICBhcmNoaXZlX3VybDogc3RyID0gJycsCiAgICAgICAgICAgICAgICAgYXJjaGl2ZV90YXI6IHN0ciA9ICcnLCAjICBmdWRnZSBwYXJhbWV0ZXIKICAgICAgICAgICAgICAgICBhcmNoaXZlX1RNUDogc3RyID0gJycgICMgIGZ1ZGdlIHBhcmFtZXRlcgopOgogICAgIiIiT3BlbiBhIGZpbGUvb2JqZWN0IGFyY2hpdmUgaW50byBhIHRhcmdldCBkaXJlY3RvcnkKICAgIAogICAgQ3VycmVudGx5IHN1cHBvcnRzIHppcCBhbmQgdGFyLmd6CiAgICAiIiIKICAgIG9zLm1ha2VkaXJzKHRhcmdldF9kaXIsIGV4aXN0X29rPVRydWUpCiAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCdWZXJpZmllZCBkaXJlY3RvcmllcycpCiAgICAKICAgIHByaW50KCdhcmNoaXZlIHVybCcsIGFyY2hpdmVfdXJsKQogICAgcHJpbnQoJ2FyY2hpdmUgdGFyJywgYXJjaGl2ZV90YXIpCiAgICBwcmludCgnYXJjaGl2ZSBUTVAnLCBhcmNoaXZlX1RNUCkKCiAgICBzcGxpdHMgPSBhcmNoaXZlX3VybC5zcGxpdCgnLicpCiAgICBwcmludChzcGxpdHMpCiAgICBpZiAoc3BsaXRzWy0xXSA9PSAnZ3onKToKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCdvcGVuaW5nIHRhcl9neicpCiAgICAgICAgcmVmID0gdGFyZmlsZS5vcGVuKGZpbGVvYmo9dXJsbGliLnJlcXVlc3QudXJsb3BlbihhcmNoaXZlX3VybCksIG1vZGU9J3J8Z3onKQogICAgZWxpZiBzcGxpdHNbLTFdID09ICd6aXAnOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oJ29wZW5pbmcgemlwJykKICAgICAgICByZWYgPSB6aXBmaWxlLlppcEZpbGUoYXJjaGl2ZV91cmwsICdyJykKCiAgICByZWYuZXh0cmFjdGFsbCh0YXJnZXRfZGlyKQogICAgcmVmLmNsb3NlKCkKCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgnY29udGVudCcsIHRhcmdldF9wYXRoPXRhcmdldF9kaXIpCgo= + functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlciBvbiAyMDIwLTAxLTIxIDA5OjQ3CgppbXBvcnQgbWxydW4KbWxydW4ubWxjb25mLmRicGF0aCA9ICdodHRwOi8vbWxydW4tYXBpOjgwODAnCgppbXBvcnQgdXJsbGliLnJlcXVlc3QKCmltcG9ydCBvcwppbXBvcnQgemlwZmlsZQppbXBvcnQgdXJsbGliCmltcG9ydCB0YXJmaWxlCmltcG9ydCBqc29uCgpmcm9tIG1scnVuLmV4ZWN1dGlvbiBpbXBvcnQgTUxDbGllbnRDdHgKCmRlZiBvcGVuX2FyY2hpdmUoY29udGV4dDogTUxDbGllbnRDdHgsIAogICAgICAgICAgICAgICAgIHRhcmdldF9kaXI6IHN0ciA9ICdjb250ZW50JywKICAgICAgICAgICAgICAgICBhcmNoaXZlX3VybDogc3RyID0gJycpOgogICAgIiIiT3BlbiBhIGZpbGUvb2JqZWN0IGFyY2hpdmUgaW50byBhIHRhcmdldCBkaXJlY3RvcnkKICAgIAogICAgQ3VycmVudGx5IHN1cHBvcnRzIHppcCBhbmQgdGFyLmd6CiAgICAiIiIKICAgIG9zLm1ha2VkaXJzKHRhcmdldF9kaXIsIGV4aXN0X29rPVRydWUpCiAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCdWZXJpZmllZCBkaXJlY3RvcmllcycpCiAgICBwcmludChhcmNoaXZlX3VybCkKICAgIHNwbGl0cyA9IGFyY2hpdmVfdXJsLnNwbGl0KCcuJykKICAgIHByaW50KHNwbGl0cykKICAgIGlmIChzcGxpdHNbLTFdID09ICdneicpOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oJ29wZW5pbmcgdGFyX2d6JykKICAgICAgICByZWYgPSB0YXJmaWxlLm9wZW4oZmlsZW9iaj11cmxsaWIucmVxdWVzdC51cmxvcGVuKGFyY2hpdmVfdXJsKSwgbW9kZT0ncnxneicpCiAgICBlbGlmIHNwbGl0c1stMV0gPT0gJ3ppcCc6CiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbygnb3BlbmluZyB6aXAnKQogICAgICAgIHJlZiA9IHppcGZpbGUuWmlwRmlsZShhcmNoaXZlX3VybCwgJ3InKQoKICAgIHJlZi5leHRyYWN0YWxsKHRhcmdldF9kaXIpCiAgICByZWYuY2xvc2UoKQoKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KCdjb250ZW50JywgdGFyZ2V0X3BhdGg9dGFyZ2V0X2RpcikKCg== commands: [] - code_origin: https://github.com/yjb-ds/functions.git#5a28adbd7643e2a27fcfed41e081a6ffaffcfd08:open_archive.ipynb + code_origin: https://github.com/yjb-ds/functions.git#6deeed18dd44b0d9e5c4d67fdca1696b1eaecbac:open_archive.ipynb diff --git a/tests/open_archive.ipynb b/tests/open_archive.ipynb index 72513ba26..4a8e0a706 100644 --- a/tests/open_archive.ipynb +++ b/tests/open_archive.ipynb @@ -63,10 +63,7 @@ "\n", "def open_archive(context: MLClientCtx, \n", " target_dir: str = 'content',\n", - " archive_url: str = '',\n", - " archive_tar: str = '', # fudge parameter\n", - " archive_TMP: str = '' # fudge parameter\n", - "):\n", + " archive_url: str = ''):\n", " \"\"\"Open a file/object archive into a target directory\n", " \n", " Currently supports zip and tar.gz\n", @@ -74,13 +71,7 @@ " # Define locations\n", " os.makedirs(target_dir, exist_ok=True)\n", " context.logger.info('Verified directories')\n", - " \n", - " # performs an implicit download to /tmp at this point and MANGLES name:\n", - " print('archive url', archive_url)\n", - " print('archive tar', archive_tar)\n", - " print('archive TMP', archive_TMP)\n", - "# assert archive_tar == archive_url \n", - "\n", + " print(archive_url)\n", " splits = archive_url.split('.')\n", " print(splits)\n", " if (splits[-1] == 'gz'):\n", @@ -130,7 +121,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-21 09:34:20,162 function spec saved to path: /User/repos/functions/fileutils/open_archive/function.yaml\n" + "[mlrun] 2020-01-21 09:47:47,371 function spec saved to path: /User/repos/functions/fileutils/open_archive/function.yaml\n" ] } ], @@ -184,16 +175,8 @@ "outputs": [], "source": [ "# create and run the task\n", + "images_path = '/User/mlrun/functions/images'\n", "\n", - "images_path = '/User/mlrun/functions/images'" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [], - "source": [ "open_archive_task = mlrun.NewTask(\n", " 'download',\n", " handler='open_archive', \n", @@ -204,25 +187,23 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-21 09:34:20,222 starting run download uid=f867d42711c346128fe5fb8abf152b75 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-21 09:34:20,303 Job is running in the background, pod: download-ffkxl\n", - "[mlrun] 2020-01-21 09:34:26,693 downloading http://iguazio-sample-data.s3.amazonaws.com/catsndogs.zip to local tmp\n", - "[mlrun] 2020-01-21 09:34:27,552 Verified directories\n", - "archive url /tmp/tmp05kj3avl.zip\n", - "archive tar \n", - "archive TMP \n", - "['/tmp/tmp05kj3avl', 'zip']\n", - "[mlrun] 2020-01-21 09:34:27,552 opening zip\n", - "[mlrun] 2020-01-21 09:34:34,430 log artifact content at /User/mlrun/functions/images, size: None, db: Y\n", + "[mlrun] 2020-01-21 09:47:47,427 starting run download uid=299b648c59294e9891334ded6159d8aa -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-21 09:47:47,497 Job is running in the background, pod: download-92bzm\n", + "[mlrun] 2020-01-21 09:47:51,918 downloading http://iguazio-sample-data.s3.amazonaws.com/catsndogs.zip to local tmp\n", + "[mlrun] 2020-01-21 09:47:52,812 Verified directories\n", + "/tmp/tmpxqpdi5zq.zip\n", + "['/tmp/tmpxqpdi5zq', 'zip']\n", + "[mlrun] 2020-01-21 09:47:52,812 opening zip\n", + "[mlrun] 2020-01-21 09:47:59,625 log artifact content at /User/mlrun/functions/images, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-21 09:34:34,443 run executed, status=completed\n", + "[mlrun] 2020-01-21 09:47:59,635 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -395,12 +376,12 @@ " \n", " \n", " \n", - "
...152b75
\n", + "
...59d8aa
\n", " 0\n", - " Jan 21 09:34:26\n", + " Jan 21 09:47:51\n", " completed\n", " open-archive\n", - "
host=download-ffkxl
kind=job
owner=admin
\n", + "
host=download-92bzm
kind=job
owner=admin
\n", "
archive_url
\n", "
key=contents
target_dir=/User/mlrun/functions/images
\n", " \n", @@ -409,12 +390,12 @@ " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -430,8 +411,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run f867d42711c346128fe5fb8abf152b75 , !mlrun logs f867d42711c346128fe5fb8abf152b75 \n", - "[mlrun] 2020-01-21 09:34:35,542 run executed, status=completed\n" + "!mlrun get run 299b648c59294e9891334ded6159d8aa , !mlrun logs 299b648c59294e9891334ded6159d8aa \n", + "[mlrun] 2020-01-21 09:48:02,711 run executed, status=completed\n" ] } ], @@ -449,48 +430,293 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# create and run the task\n", - "images_path = '/User/mlrun/functions/t00'\n", + "images_path = '/User/mlrun/functions/t000'\n", + "\n", + "open_archive_task = mlrun.NewTask(\n", + " 'download',\n", + " handler='open_archive', \n", + " params={'target_dir' : images_path,\n", + " 'key' : 'contents',\n", + " 'archive_url': 'https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz'})" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-21 09:48:45,223 starting run download uid=e5df4261e94847c999df30bbe88fe6c8 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-21 09:48:45,298 Job is running in the background, pod: download-sr2pp\n", + "[mlrun] 2020-01-21 09:48:49,674 Verified directories\n", + "https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz\n", + "['https://fpsignals-public', 's3', 'amazonaws', 'com/catsndogs', 'tar', 'gz']\n", + "[mlrun] 2020-01-21 09:48:49,674 opening tar_gz\n", + "[mlrun] 2020-01-21 09:49:03,258 log artifact content at /User/mlrun/functions/t000, size: None, db: Y\n", + "\n", + "[mlrun] 2020-01-21 09:49:03,273 run executed, status=completed\n", + "final state: succeeded\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...8fe6c8
0Jan 21 09:48:49completedopen-archive
host=download-sr2pp
kind=job
owner=admin
archive_url=https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz
key=contents
target_dir=/User/mlrun/functions/t000
content
\n", + "
\n", + "
\n", + "
\n", + " Title\n", + " ×\n", + "
\n", + " \n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "to track results use .show() or .logs() or in CLI: \n", + "!mlrun get run e5df4261e94847c999df30bbe88fe6c8 , !mlrun logs e5df4261e94847c999df30bbe88fe6c8 \n", + "[mlrun] 2020-01-21 09:49:04,471 run executed, status=completed\n" + ] + } + ], + "source": [ + "# run\n", + "run = fn.run(open_archive_task)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "# create and run the task\n", + "images_path = '/User/mlrun/functions/t0000'\n", "\n", "open_archive_task = mlrun.NewTask(\n", " 'download',\n", " handler='open_archive', \n", " params={'target_dir' : images_path,\n", " 'key' : 'contents'},\n", - " inputs={\n", - " 'archive_url': 'https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz',\n", - " 'archive_tar': 'https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz',\n", - " 'archive_TMP': 'https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz'})" + " inputs={'archive_url': 'https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz'})" ] }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-21 09:34:35,557 starting run download uid=053019b4cdd34f1f9aad440d3423e400 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-21 09:34:35,625 Job is running in the background, pod: download-hhzsd\n", - "[mlrun] 2020-01-21 09:34:39,966 downloading https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz to local tmp\n", - "[mlrun] 2020-01-21 09:34:40,790 downloading https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz to local tmp\n", - "[mlrun] 2020-01-21 09:34:41,615 downloading https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz to local tmp\n", - "[mlrun] 2020-01-21 09:34:42,575 Verified directories\n", - "archive url /tmp/tmp_a8c0lt3.gz\n", - "archive tar /tmp/tmp7pmvyi7g.gz\n", - "archive TMP /tmp/tmpw_5blqyr.gz\n", - "['/tmp/tmp_a8c0lt3', 'gz']\n", - "[mlrun] 2020-01-21 09:34:42,575 opening tar_gz\n", - "[mlrun] 2020-01-21 09:34:42,578 Traceback (most recent call last):\n", + "[mlrun] 2020-01-21 09:50:08,023 starting run download uid=74f0391c2da04aabb3f0735bfa977b17 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-21 09:50:08,112 Job is running in the background, pod: download-8wt5x\n", + "[mlrun] 2020-01-21 09:50:14,529 downloading https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz to local tmp\n", + "[mlrun] 2020-01-21 09:50:15,651 Verified directories\n", + "/tmp/tmp14moqiew.gz\n", + "['/tmp/tmp14moqiew', 'gz']\n", + "[mlrun] 2020-01-21 09:50:15,651 opening tar_gz\n", + "[mlrun] 2020-01-21 09:50:15,653 Traceback (most recent call last):\n", " File \"/usr/local/lib/python3.6/site-packages/mlrun-0.4.3-py3.6.egg/mlrun/runtimes/local.py\", line 174, in exec_from_params\n", " val = handler(*args_list)\n", - " File \"main.py\", line 37, in open_archive\n", + " File \"main.py\", line 30, in open_archive\n", " ref = tarfile.open(fileobj=urllib.request.urlopen(archive_url), mode='r|gz')\n", " File \"/usr/local/lib/python3.6/urllib/request.py\", line 223, in urlopen\n", " return opener.open(url, data, timeout)\n", @@ -502,19 +728,15 @@ " self._parse()\n", " File \"/usr/local/lib/python3.6/urllib/request.py\", line 384, in _parse\n", " raise ValueError(\"unknown url type: %r\" % self.full_url)\n", - "ValueError: unknown url type: '/tmp/tmp_a8c0lt3.gz'\n", + "ValueError: unknown url type: '/tmp/tmp14moqiew.gz'\n", "\n", "\n", - "[mlrun] 2020-01-21 09:34:42,587 exec error - unknown url type: '/tmp/tmp_a8c0lt3.gz'\n", - "[mlrun] 2020-01-21 09:34:42,612 run executed, status=error\n", - "runtime error: unknown url type: '/tmp/tmp_a8c0lt3.gz'\n", + "[mlrun] 2020-01-21 09:50:15,663 exec error - unknown url type: '/tmp/tmp14moqiew.gz'\n", + "[mlrun] 2020-01-21 09:50:15,689 run executed, status=error\n", "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings\n", + "runtime error: unknown url type: '/tmp/tmp14moqiew.gz'\n", " InsecureRequestWarning)\n", - "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings\n", - " InsecureRequestWarning)\n", - "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings\n", - " InsecureRequestWarning)\n", - "unknown url type: '/tmp/tmp_a8c0lt3.gz'\n", + "unknown url type: '/tmp/tmp14moqiew.gz'\n", "final state: failed\n" ] }, @@ -687,26 +909,26 @@ " \n", " \n", " \n", - "
...23e400
\n", + "
...977b17
\n", " 0\n", - " Jan 21 09:34:39\n", - "
error
\n", + " Jan 21 09:50:14\n", + "
error
\n", " open-archive\n", - "
host=download-hhzsd
kind=job
owner=admin
\n", - "
archive_TMP
archive_tar
archive_url
\n", - "
key=contents
target_dir=/User/mlrun/functions/t00
\n", + "
host=download-8wt5x
kind=job
owner=admin
\n", + "
archive_url
\n", + "
key=contents
target_dir=/User/mlrun/functions/t0000
\n", " \n", " \n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -722,22 +944,22 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 053019b4cdd34f1f9aad440d3423e400 , !mlrun logs 053019b4cdd34f1f9aad440d3423e400 \n", - "[mlrun] 2020-01-21 09:34:44,783 run executed, status=error\n", - "runtime error: unknown url type: '/tmp/tmp_a8c0lt3.gz'\n" + "!mlrun get run 74f0391c2da04aabb3f0735bfa977b17 , !mlrun logs 74f0391c2da04aabb3f0735bfa977b17 \n", + "[mlrun] 2020-01-21 09:50:17,234 run executed, status=error\n", + "runtime error: unknown url type: '/tmp/tmp14moqiew.gz'\n" ] }, { "ename": "RunError", - "evalue": "unknown url type: '/tmp/tmp_a8c0lt3.gz'", + "evalue": "unknown url type: '/tmp/tmp14moqiew.gz'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mRunError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# run\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mrun\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mopen_archive_task\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# run\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mrun\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mopen_archive_task\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, runspec, handler, name, project, params, inputs, out_path, workdir, watch, schedule)\u001b[0m\n\u001b[1;32m 266\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_post_run\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 267\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 268\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 269\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 270\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_api_server\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkfp\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36m_wrap_result\u001b[0;34m(self, result, runspec, err)\u001b[0m\n\u001b[1;32m 334\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_child\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 335\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'runtime error: {}'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 336\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mRunError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 337\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrun\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 338\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mRunError\u001b[0m: unknown url type: '/tmp/tmp_a8c0lt3.gz'" + "\u001b[0;31mRunError\u001b[0m: unknown url type: '/tmp/tmp14moqiew.gz'" ] } ], @@ -746,38 +968,6 @@ "run = fn.run(open_archive_task)" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "______" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### test outside mlrun" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ref = tarfile.open(fileobj=urllib.request.urlopen('https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz'), mode='r|gz')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ref.extractall('/User/test25')" - ] - }, { "cell_type": "code", "execution_count": null, From 208ab7f82c1def90cd5911679a933aa787454f88 Mon Sep 17 00:00:00 2001 From: yasha Date: Tue, 21 Jan 2020 09:53:04 +0000 Subject: [PATCH 09/32] adjusted yaml link --- tests/open_archive.ipynb | 33 +++++++++++++++++---------------- 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/tests/open_archive.ipynb b/tests/open_archive.ipynb index 4a8e0a706..650c551d9 100644 --- a/tests/open_archive.ipynb +++ b/tests/open_archive.ipynb @@ -114,40 +114,41 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 20, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-21 09:47:47,371 function spec saved to path: /User/repos/functions/fileutils/open_archive/function.yaml\n" - ] - } - ], + "outputs": [], "source": [ "# export function yaml\n", - "fn.export('/User/repos/functions/fileutils/open_archive/function.yaml')" + "# fn.export('/User/repos/functions/fileutils/open_archive/function.yaml')" ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# import function yaml\n", - "fn = mlrun.import_function('/User/repos/functions/fileutils/open_archive/function.yaml')" + "# fn = mlrun.import_function('/User/repos/functions/fileutils/open_archive/function.yaml')" ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 22, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/User/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings\n", + " InsecureRequestWarning)\n" + ] + } + ], "source": [ "# load function from Github\n", - "# fn = mlrun.import_function('https://raw.githubusercontent.com/yjb-ds/functions/arc2parq/fileutils/open_archive/function.yaml')" + "fn = mlrun.import_function('https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/fileutils/open_archive/function.yaml')" ] }, { From 94a4aab5417c8979c32226656885a18cdb26cc28 Mon Sep 17 00:00:00 2001 From: yasha Date: Tue, 21 Jan 2020 15:51:05 +0000 Subject: [PATCH 10/32] arc-to-parq fixes --- fileutils/arc_to_parquet/arc_to_parquet.py | 51 ++- fileutils/arc_to_parquet/arc_to_parquet.yaml | 6 +- fileutils/open_archive/file_utils.py | 13 + serving/classifier_server.ipynb | 418 +++++++++++++++++++ tests/arc_to_parquet.ipynb | 414 +++++++++--------- tests/generate-some-classifiers.ipynb | 311 ++++++++++++++ 6 files changed, 973 insertions(+), 240 deletions(-) create mode 100644 serving/classifier_server.ipynb create mode 100644 tests/generate-some-classifiers.ipynb diff --git a/fileutils/arc_to_parquet/arc_to_parquet.py b/fileutils/arc_to_parquet/arc_to_parquet.py index 64b889801..56cf8f5a5 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.py +++ b/fileutils/arc_to_parquet/arc_to_parquet.py @@ -1,8 +1,23 @@ +# Copyright 2018 Iguazio +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os +import json from pathlib import Path import pandas as pd import pyarrow.parquet as pq import pyarrow as pa +from pickle import dump, load from mlrun.execution import MLClientCtx from typing import IO, AnyStr, Union, List, Optional @@ -11,10 +26,10 @@ def arc_to_parquet( context: MLClientCtx, archive_url: Union[str, Path, IO[AnyStr]], - header: Union[int, List[str], None] = 0, + header: Optional[List[str]] = None, target_path: str = "", name: str = "", - chunksize: int = 10_240, + chunksize: int = 10_000, log_data: bool = True, add_uid: bool = False, key: str = "raw_data", @@ -29,33 +44,19 @@ def arc_to_parquet( :param target_path: destination folder of table :param name: name file to be saved locally, also :param chunksize: (0) row size retrieved per iteration - :param log_data: (True) if True, log the data so that it is available - at the next step - :param add_uid: (False) add the metadata uid to the target_path so that - runs can be identified :param key: key in artifact store (when log_data=True) """ if not name.endswith(".parquet"): name += ".parquet" - if not add_uid: - uid = "" - else: - uid = context.uid - - dest_path = os.path.join(target_path, uid, name) - os.makedirs(os.path.join(target_path, uid), exist_ok=True) - + dest_path = os.path.join(target_path, name) + os.makedirs(os.path.join(target_path), exist_ok=True) if not os.path.isfile(dest_path): context.logger.info("destination file does not exist, downloading") pqwriter = None - for i, df in enumerate( - pd.read_csv(archive_url, chunksize=chunksize, names=header) - ): + for i, df in enumerate(pd.read_csv(archive_url, chunksize=chunksize, names=header)): table = pa.Table.from_pandas(df) if i == 0: - header = list(df) - header = [x.replace(' ', '_') for x in header] pqwriter = pq.ParquetWriter(dest_path, table.schema) pqwriter.write_table(table) @@ -66,10 +67,8 @@ def arc_to_parquet( else: context.logger.info("destination file already exists") - if log_data: - context.log_artifact(key, target_path=dest_path) - # log header - filepath = path.join(target_path, 'header.json') - json.dump(header, open(filepath, 'w')) - context.log_artifact('header', target_path=filepath) - + context.log_artifact(key, target_path=dest_path) + # log header + filepath = os.path.join(target_path, 'header.pkl') + dump(header, open(filepath, 'wb')) + context.log_artifact('header', target_path=filepath) diff --git a/fileutils/arc_to_parquet/arc_to_parquet.yaml b/fileutils/arc_to_parquet/arc_to_parquet.yaml index a1dd5e6be..610b7e025 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.yaml +++ b/fileutils/arc_to_parquet/arc_to_parquet.yaml @@ -2,7 +2,7 @@ kind: job metadata: name: arc-to-parquet tag: '' - hash: 251c8e50eb09d09032b4b3accf3c3a3a3c1b467b + hash: 41cf4cd59460123f71a3c50f4d399dfb84b54e3c project: '' spec: command: '' @@ -12,12 +12,12 @@ spec: env: [] description: '' build: - functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlciBvbiAyMDIwLTAxLTIxIDA3OjExCgppbXBvcnQgbWxydW4KbWxydW4ubWxjb25mLmRicGF0aCA9ICdodHRwOi8vbWxydW4tYXBpOjgwODAnCgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IHB5YXJyb3cucGFycXVldCBhcyBwcQppbXBvcnQgcHlhcnJvdyBhcyBwYQoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gdHlwaW5nIGltcG9ydCBJTywgQW55U3RyLCBVbmlvbiwgTGlzdCwgT3B0aW9uYWwKCgpkZWYgYXJjX3RvX3BhcnF1ZXQoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwKICAgIGFyY2hpdmVfdXJsOiBVbmlvbltzdHIsIFBhdGgsIElPW0FueVN0cl1dLAogICAgaGVhZGVyOiBVbmlvbltpbnQsIExpc3Rbc3RyXSwgTm9uZV0gPSAwLAogICAgdGFyZ2V0X3BhdGg6IHN0ciA9ICIiLAogICAgbmFtZTogc3RyID0gIiIsCiAgICBjaHVua3NpemU6IGludCA9IDEwXzAwMCwKICAgIGxvZ19kYXRhOiBib29sID0gVHJ1ZSwKICAgIGFkZF91aWQ6IGJvb2wgPSBGYWxzZSwKICAgIGtleTogc3RyID0gInJhd19kYXRhIiwKKSAtPiBOb25lOgogICAgIiIiT3BlbiBhIGZpbGUvb2JqZWN0IGFyY2hpdmUgYW5kIHNhdmUgYXMgYSBwYXJxdWV0IGZpbGUuCiAgICAKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIGFyY2hpdmVfdXJsOiBhbnkgdmFsaWQgc3RyaW5nIHBhdGggY29uc2lzdGVudCB3aXRoIHRoZSBwYXRoIHZhcmlhYmxlCiAgICAgICAgICAgICAgICAgICAgICAgIG9mIHBhbmRhcy5yZWFkX2NzdiwgaW5jbHVkaW5nIHN0cmluZ3MgYXMgZmlsZSBwYXRocywgYXMgdXJscywgCiAgICAgICAgICAgICAgICAgICAgICAgIHBhdGhsaWIuUGF0aCBvYmplY3RzLCBldGMuLi4KICAgIDpwYXJhbSBoZWFkZXI6ICAgICAgY29sdW1uIG5hbWVzCiAgICA6cGFyYW0gdGFyZ2V0X3BhdGg6IGRlc3RpbmF0aW9uIGZvbGRlciBvZiB0YWJsZQogICAgOnBhcmFtIG5hbWU6ICAgICAgICBuYW1lIGZpbGUgdG8gYmUgc2F2ZWQgbG9jYWxseSwgYWxzbwogICAgOnBhcmFtIGNodW5rc2l6ZTogICAoMCkgcm93IHNpemUgcmV0cmlldmVkIHBlciBpdGVyYXRpb24KICAgIDpwYXJhbSBsb2dfZGF0YTogICAgKFRydWUpIGlmIFRydWUsIGxvZyB0aGUgZGF0YSBzbyB0aGF0IGl0IGlzIGF2YWlsYWJsZQogICAgICAgICAgICAgICAgICAgICAgICBhdCB0aGUgbmV4dCBzdGVwCiAgICA6cGFyYW0gYWRkX3VpZDogICAgIChGYWxzZSkgYWRkIHRoZSBtZXRhZGF0YSB1aWQgdG8gdGhlIHRhcmdldF9wYXRoIHNvIHRoYXQgCiAgICAgICAgICAgICAgICAgICAgICAgIHJ1bnMgY2FuIGJlIGlkZW50aWZpZWQKICAgIDpwYXJhbSBrZXk6ICAgICAgICAga2V5IGluIGFydGlmYWN0IHN0b3JlICh3aGVuIGxvZ19kYXRhPVRydWUpCiAgICAiIiIKICAgIGlmIG5vdCBuYW1lLmVuZHN3aXRoKCIucGFycXVldCIpOgogICAgICAgIG5hbWUgKz0gIi5wYXJxdWV0IgoKICAgIGlmIG5vdCBhZGRfdWlkOgogICAgICAgIHVpZCA9ICIiCiAgICBlbHNlOgogICAgICAgIHVpZCA9IGNvbnRleHQudWlkCgogICAgZGVzdF9wYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCB1aWQsIG5hbWUpCiAgICBvcy5tYWtlZGlycyhvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIHVpZCksIGV4aXN0X29rPVRydWUpCiAgICBpZiBoZWFkZXIgPT0gMDoKICAgICAgICBoZWFkZXIgPSBwZC5yZWFkX2NzdihhcmNoaXZlX3VybCwgaGVhZGVyPU5vbmUsIG5yb3dzPTEpLmlsb2NbMF0udmFsdWVzCiAgICBoZWFkZXIgPSBbeC5yZXBsYWNlKCcgJywgJ18nKSBmb3IgeCBpbiBoZWFkZXJdCiAgICBpZiBub3Qgb3MucGF0aC5pc2ZpbGUoZGVzdF9wYXRoKToKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCJkZXN0aW5hdGlvbiBmaWxlIGRvZXMgbm90IGV4aXN0LCBkb3dubG9hZGluZyIpCiAgICAgICAgcHF3cml0ZXIgPSBOb25lCiAgICAgICAgZm9yIGksIGRmIGluIGVudW1lcmF0ZShwZC5yZWFkX2NzdihhcmNoaXZlX3VybCwgY2h1bmtzaXplPWNodW5rc2l6ZSwgbmFtZXM9aGVhZGVyKSk6CiAgICAgICAgICAgIHRhYmxlID0gcGEuVGFibGUuZnJvbV9wYW5kYXMoZGYpCiAgICAgICAgICAgIGlmIGkgPT0gMDoKICAgICAgICAgICAgICAgIHBxd3JpdGVyID0gcHEuUGFycXVldFdyaXRlcihkZXN0X3BhdGgsIHRhYmxlLnNjaGVtYSkKICAgICAgICAgICAgcHF3cml0ZXIud3JpdGVfdGFibGUodGFibGUpCgogICAgICAgIGlmIHBxd3JpdGVyOgogICAgICAgICAgICBwcXdyaXRlci5jbG9zZSgpCgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oZiJzYXZlZCB0YWJsZSB0byB7ZGVzdF9wYXRofSIpCiAgICBlbHNlOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oImRlc3RpbmF0aW9uIGZpbGUgYWxyZWFkeSBleGlzdHMiKQoKICAgIGlmIGxvZ19kYXRhOgogICAgICAgIGNvbnRleHQubG9nX2FydGlmYWN0KGtleSwgdGFyZ2V0X3BhdGg9ZGVzdF9wYXRoKQogICAgICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCAnaGVhZGVyLmpzb24nKQogICAgICAgIGpzb24uZHVtcChoZWFkZXIsIG9wZW4oZmlsZXBhdGgsICd3JykpCiAgICAgICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ2hlYWRlcicsIHRhcmdldF9wYXRoPWZpbGVwYXRoKQoK + functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlciBvbiAyMDIwLTAxLTIxIDE1OjEyCgppbXBvcnQgbWxydW4KbWxydW4ubWxjb25mLmRicGF0aCA9ICdodHRwOi8vbWxydW4tYXBpOjgwODAnCgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IHB5YXJyb3cucGFycXVldCBhcyBwcQppbXBvcnQgcHlhcnJvdyBhcyBwYQpmcm9tIHBpY2tsZSBpbXBvcnQgZHVtcCwgbG9hZAoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gdHlwaW5nIGltcG9ydCBJTywgQW55U3RyLCBVbmlvbiwgTGlzdCwgT3B0aW9uYWwKCgpkZWYgYXJjX3RvX3BhcnF1ZXQoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwKICAgIGFyY2hpdmVfdXJsOiBVbmlvbltzdHIsIFBhdGgsIElPW0FueVN0cl1dLAogICAgaGVhZGVyOiBPcHRpb25hbFtMaXN0W3N0cl1dID0gTm9uZSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAiIiwKICAgIG5hbWU6IHN0ciA9ICIiLAogICAgY2h1bmtzaXplOiBpbnQgPSAxMF8wMDAsCiAgICBsb2dfZGF0YTogYm9vbCA9IFRydWUsCiAgICBhZGRfdWlkOiBib29sID0gRmFsc2UsCiAgICBrZXk6IHN0ciA9ICJyYXdfZGF0YSIsCikgLT4gTm9uZToKICAgICIiIk9wZW4gYSBmaWxlL29iamVjdCBhcmNoaXZlIGFuZCBzYXZlIGFzIGEgcGFycXVldCBmaWxlLgogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBhcmNoaXZlX3VybDogYW55IHZhbGlkIHN0cmluZyBwYXRoIGNvbnNpc3RlbnQgd2l0aCB0aGUgcGF0aCB2YXJpYWJsZQogICAgICAgICAgICAgICAgICAgICAgICBvZiBwYW5kYXMucmVhZF9jc3YsIGluY2x1ZGluZyBzdHJpbmdzIGFzIGZpbGUgcGF0aHMsIGFzIHVybHMsIAogICAgICAgICAgICAgICAgICAgICAgICBwYXRobGliLlBhdGggb2JqZWN0cywgZXRjLi4uCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgIGNvbHVtbiBuYW1lcwogICAgOnBhcmFtIHRhcmdldF9wYXRoOiBkZXN0aW5hdGlvbiBmb2xkZXIgb2YgdGFibGUKICAgIDpwYXJhbSBuYW1lOiAgICAgICAgbmFtZSBmaWxlIHRvIGJlIHNhdmVkIGxvY2FsbHksIGFsc28KICAgIDpwYXJhbSBjaHVua3NpemU6ICAgKDApIHJvdyBzaXplIHJldHJpZXZlZCBwZXIgaXRlcmF0aW9uCiAgICA6cGFyYW0ga2V5OiAgICAgICAgIGtleSBpbiBhcnRpZmFjdCBzdG9yZSAod2hlbiBsb2dfZGF0YT1UcnVlKQogICAgIiIiCiAgICBpZiBub3QgbmFtZS5lbmRzd2l0aCgiLnBhcnF1ZXQiKToKICAgICAgICBuYW1lICs9ICIucGFycXVldCIKCiAgICBkZXN0X3BhdGggPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUpCiAgICBvcy5tYWtlZGlycyhvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgpLCBleGlzdF9vaz1UcnVlKQogICAgaWYgbm90IG9zLnBhdGguaXNmaWxlKGRlc3RfcGF0aCk6CiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbygiZGVzdGluYXRpb24gZmlsZSBkb2VzIG5vdCBleGlzdCwgZG93bmxvYWRpbmciKQogICAgICAgIHBxd3JpdGVyID0gTm9uZQogICAgICAgIGZvciBpLCBkZiBpbiBlbnVtZXJhdGUocGQucmVhZF9jc3YoYXJjaGl2ZV91cmwsIGNodW5rc2l6ZT1jaHVua3NpemUsIG5hbWVzPWhlYWRlcikpOgogICAgICAgICAgICB0YWJsZSA9IHBhLlRhYmxlLmZyb21fcGFuZGFzKGRmKQogICAgICAgICAgICBpZiBpID09IDA6CiAgICAgICAgICAgICAgICBwcXdyaXRlciA9IHBxLlBhcnF1ZXRXcml0ZXIoZGVzdF9wYXRoLCB0YWJsZS5zY2hlbWEpCiAgICAgICAgICAgIHBxd3JpdGVyLndyaXRlX3RhYmxlKHRhYmxlKQoKICAgICAgICBpZiBwcXdyaXRlcjoKICAgICAgICAgICAgcHF3cml0ZXIuY2xvc2UoKQoKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKGYic2F2ZWQgdGFibGUgdG8ge2Rlc3RfcGF0aH0iKQogICAgZWxzZToKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCJkZXN0aW5hdGlvbiBmaWxlIGFscmVhZHkgZXhpc3RzIikKCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWRlc3RfcGF0aCkKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCAnaGVhZGVyLnBrbCcpCiAgICBkdW1wKGhlYWRlciwgb3BlbihmaWxlcGF0aCwgJ3diJykpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgnaGVhZGVyJywgdGFyZ2V0X3BhdGg9ZmlsZXBhdGgpICAgICAgIAoK commands: - python -m pip uninstall mlrun - python -m pip install -U -q mlrun - python -m pip install -U -q pandas - python -m pip install -U -q pyarrow - python -m pip install -U -q numpy==1.17.4 - code_origin: https://github.com/yjb-ds/functions.git#db8da20c32124015e26f0b6b9911814493c5d5de:arc + code_origin: https://github.com/yjb-ds/functions.git#208ab7f82c1def90cd5911679a933aa787454f88:arc to parquet.ipynb diff --git a/fileutils/open_archive/file_utils.py b/fileutils/open_archive/file_utils.py index a7f52a032..10c8afbe3 100644 --- a/fileutils/open_archive/file_utils.py +++ b/fileutils/open_archive/file_utils.py @@ -1,3 +1,16 @@ +# Copyright 2018 Iguazio +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os import zipfile import urllib diff --git a/serving/classifier_server.ipynb b/serving/classifier_server.ipynb new file mode 100644 index 000000000..3622f3e13 --- /dev/null +++ b/serving/classifier_server.ipynb @@ -0,0 +1,418 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Deploy a Serverless Model Server with Nuclio-KFServing\n", + " --------------------------------------------------------------------\n", + "\n", + "The following notebook demonstrates how to deploy **any pickled model** using **[nuclio](https://github.com/nuclio/nuclio)** + **[KFServing](https://github.com/kubeflow/kfserving)** (a.k.a Nuclio-serving)\n", + "\n", + "#### **notebook how-to's**\n", + "* Write and test model serving (KFServing) class in a notebook.\n", + "* Deploy the model server as a Nuclio-serving function.\n", + "* Invoke and test the serving function." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "#### **steps**\n", + "**[define a new function and its dependencies](#define-function)**
\n", + "**[test the model serving class locally](#test-locally)**
\n", + "**[deploy our serving class using as a serverless function](#deploy)**
\n", + "**[test our model server using HTTP request](#test-model-server)**
" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# nuclio: ignore\n", + "# if the nuclio-jupyter package is not installed run !pip install nuclio-jupyter\n", + "import nuclio" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### **define a new function and its dependencies**" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "%nuclio: setting kind to 'nuclio:serving'\n", + "%nuclio: setting 'MODEL_CLASS' environment variable\n" + ] + } + ], + "source": [ + "%nuclio config kind=\"nuclio:serving\"\n", + "%nuclio env MODEL_CLASS=ClassifierModel" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "%%nuclio cmd -c\n", + "pip install -U -q kfserving\n", + "pip install -U -q azure\n", + "pip install -U -q numpy\n", + "pip install -U -q scikit-learn==0.21.3\n", + "pip install -U -q mlrun" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "import kfserving\n", + "import os\n", + "import numpy as np\n", + "from cloudpickle import load as cload" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "TARGET_PATH = '/User/mlrun/models'\n", + "MODEL_FILE = 'ada-classifier.cpkl'" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "class ClassifierModel(kfserving.KFModel):\n", + " def __init__(self, name: str, model_dir: str, model = None):\n", + " super().__init__(name)\n", + " self.name = name\n", + " self.model_dir = model_dir\n", + " if not model is None:\n", + " self.classifier = model\n", + " self.ready = True\n", + "\n", + " def load(self):\n", + " model_file = os.path.join(\n", + " kfserving.Storage.download(self.model_dir), MODEL_FILE)\n", + " self.classifier = cload(open(model_file, 'rb'))\n", + " self.ready = True\n", + "\n", + " def predict(self, body):\n", + " try:\n", + " feats = np.asarray(body['instances'])\n", + " result: np.ndarray = self.classifier.predict(feats)\n", + " return result.tolist()\n", + " except Exception as e:\n", + " raise Exception(\"Failed to predict %s\" % e)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following end-code annotation tells ```nuclio``` to stop parsing the notebook from this cell. _**Please do not remove this cell**_:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "# nuclio: end-code" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "______________________________________________" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### **test the model serving class locally**\n", + "The class above can be tested locally. Just instantiate the class, `.load()` will load the model to a local dir.\n", + "\n", + "> **Verify there is a `model.bst` file in the model_dir path (generated by the training notebook)**" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[I 200121 11:55:01 storage:35] Copying contents of /User/mlrun/models to local\n" + ] + } + ], + "source": [ + "my_server = ClassifierModel('some-classifier-model', model_dir='/User/mlrun/models')\n", + "my_server.load()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### _data_\n", + "Make some classification data using scikit learn's `make_classification`:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.datasets import make_classification\n", + "n_samples = 10\n", + "train_size = 0.7\n", + "X, y = make_classification(\n", + " n_samples=n_samples,\n", + " n_features=28, \n", + " random_state = np.random.RandomState(1))" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "event = {\"instances\": X.tolist()}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "We can use the `.predict(body)` method to test the model." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0, 0, 1, 1, 1, 0, 0, 1, 1, 1]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "my_server.predict(event)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### **deploy our serving class using as a serverless function**\n", + "in the following section we create a new model serving function which wraps our class , and specify model and other resources.\n", + "\n", + "the `models` dict store model names and the assosiated model **dir** URL (the URL can start with `S3://` and other blob store options), the faster way is to use a shared file volume, we use `.apply(mount_v3io())` to attach a v3io (iguazio data fabric) volume to our function. By default v3io will mount the current user home into the `\\User` function path.\n", + "\n", + "**verify the model dir does contain a valid `model.bst` file**" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "from mlrun import new_model_server, mount_v3io\n", + "import requests" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fn = new_model_server('some-classifier-model', \n", + " models={'classifier_gen': TARGET_PATH}, \n", + " model_class='ClassifierModel')\n", + "\n", + "fn.apply(mount_v3io()) " + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "fn.spec.no_cache = True" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-21 11:55:06,216 deploy started\n", + "[nuclio] 2020-01-21 11:57:28,926 (info) Build complete\n", + "[nuclio] 2020-01-21 11:57:35,016 (info) Function deploy complete\n", + "[nuclio] 2020-01-21 11:57:35,022 done updating some-classifier-model, function address: 3.135.246.153:31529\n" + ] + } + ], + "source": [ + "addr = fn.deploy()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### **test our model server using HTTP request**\n", + "\n", + "\n", + "We invoke our model serving function using test data, the data vector is specified in the `instances` attribute." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import requests\n", + "\n", + "resp = requests.post(addr + '/classifier_gen/predict', json=event)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "b'Exception caught in handler \"No module named \\'sklearn.gaussian_process._gpc\\'\": Traceback (most recent call last):\\n File \"/opt/nuclio/_nuclio_wrapper.py\", line 176, in serve_requests\\n entrypoint_output = self._entrypoint(self._context, event)\\n File \"/opt/nuclio/classifier_server.py\", line 40, in handler\\n return context.mlrun_handler(context, event)\\n File \"/usr/local/lib/python3.6/site-packages/mlrun/runtimes/serving.py\", line 67, in nuclio_serving_handler\\n return route(context, model_name, event)\\n File \"/usr/local/lib/python3.6/site-packages/mlrun/runtimes/serving.py\", line 132, in post\\n model = self.get_model(name)\\n File \"/usr/local/lib/python3.6/site-packages/mlrun/runtimes/serving.py\", line 88, in get_model\\n model.load()\\n File \"/opt/nuclio/classifier_server.py\", line 23, in load\\n self.classifier = cload(open(model_file, \\'rb\\'))\\nModuleNotFoundError: No module named \\'sklearn.gaussian_process._gpc\\'\\n'" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "resp.__dict__['_content'] " + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "ename": "JSONDecodeError", + "evalue": "Expecting value: line 1 column 1 (char 0)", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mJSONDecodeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mjson\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloads\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcontent\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m/conda/lib/python3.6/json/__init__.py\u001b[0m in \u001b[0;36mloads\u001b[0;34m(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)\u001b[0m\n\u001b[1;32m 352\u001b[0m \u001b[0mparse_int\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mparse_float\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;32mand\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 353\u001b[0m parse_constant is None and object_pairs_hook is None and not kw):\n\u001b[0;32m--> 354\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0m_default_decoder\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 355\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcls\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 356\u001b[0m \u001b[0mcls\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mJSONDecoder\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/conda/lib/python3.6/json/decoder.py\u001b[0m in \u001b[0;36mdecode\u001b[0;34m(self, s, _w)\u001b[0m\n\u001b[1;32m 337\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 338\u001b[0m \"\"\"\n\u001b[0;32m--> 339\u001b[0;31m \u001b[0mobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mend\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mraw_decode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0midx\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0m_w\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 340\u001b[0m \u001b[0mend\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_w\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mend\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 341\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mend\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/conda/lib/python3.6/json/decoder.py\u001b[0m in \u001b[0;36mraw_decode\u001b[0;34m(self, s, idx)\u001b[0m\n\u001b[1;32m 355\u001b[0m \u001b[0mobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mend\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscan_once\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0midx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 356\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mStopIteration\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 357\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mJSONDecodeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Expecting value\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 358\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mend\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mJSONDecodeError\u001b[0m: Expecting value: line 1 column 1 (char 0)" + ] + } + ], + "source": [ + "json.loads(resp.content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**[back to top](#top)**" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/tests/arc_to_parquet.ipynb b/tests/arc_to_parquet.ipynb index 47bd4f0a5..b9bbaaca4 100644 --- a/tests/arc_to_parquet.ipynb +++ b/tests/arc_to_parquet.ipynb @@ -41,6 +41,23 @@ "python -m pip install -U -q numpy==1.17.4" ] }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "%nuclio: setting spec.build.baseImage to 'python:3.6-jessie'\n" + ] + } + ], + "source": [ + "%nuclio config spec.build.baseImage = \"python:3.6-jessie\"" + ] + }, { "cell_type": "code", "execution_count": 4, @@ -53,6 +70,7 @@ "import pandas as pd\n", "import pyarrow.parquet as pq\n", "import pyarrow as pa\n", + "from pickle import dump, load\n", "\n", "from mlrun.execution import MLClientCtx\n", "from typing import IO, AnyStr, Union, List, Optional\n", @@ -61,7 +79,7 @@ "def arc_to_parquet(\n", " context: MLClientCtx,\n", " archive_url: Union[str, Path, IO[AnyStr]],\n", - " header: Union[int, List[str], None] = 0,\n", + " header: Optional[List[str]] = None,\n", " target_path: str = \"\",\n", " name: str = \"\",\n", " chunksize: int = 10_000,\n", @@ -79,25 +97,13 @@ " :param target_path: destination folder of table\n", " :param name: name file to be saved locally, also\n", " :param chunksize: (0) row size retrieved per iteration\n", - " :param log_data: (True) if True, log the data so that it is available\n", - " at the next step\n", - " :param add_uid: (False) add the metadata uid to the target_path so that \n", - " runs can be identified\n", " :param key: key in artifact store (when log_data=True)\n", " \"\"\"\n", " if not name.endswith(\".parquet\"):\n", " name += \".parquet\"\n", "\n", - " if not add_uid:\n", - " uid = \"\"\n", - " else:\n", - " uid = context.uid\n", - "\n", - " dest_path = os.path.join(target_path, uid, name)\n", - " os.makedirs(os.path.join(target_path, uid), exist_ok=True)\n", - " if header == 0:\n", - " header = pd.read_csv(archive_url, header=None, nrows=1).iloc[0].values\n", - " header = [x.replace(' ', '_') for x in header]\n", + " dest_path = os.path.join(target_path, name)\n", + " os.makedirs(os.path.join(target_path), exist_ok=True)\n", " if not os.path.isfile(dest_path):\n", " context.logger.info(\"destination file does not exist, downloading\")\n", " pqwriter = None\n", @@ -114,12 +120,11 @@ " else:\n", " context.logger.info(\"destination file already exists\")\n", "\n", - " if log_data:\n", - " context.log_artifact(key, target_path=dest_path)\n", - " # log header\n", - " filepath = os.path.join(target_path, 'header.json')\n", - " json.dump(header, open(filepath, 'w'))\n", - " context.log_artifact('header', target_path=filepath)\n" + " context.log_artifact(key, target_path=dest_path)\n", + " # log header\n", + " filepath = os.path.join(target_path, 'header.pkl')\n", + " dump(header, open(filepath, 'wb'))\n", + " context.log_artifact('header', target_path=filepath) " ] }, { @@ -144,23 +149,6 @@ " handler=arc_to_parquet)" ] }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-21 07:11:35,663 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" - ] - } - ], - "source": [ - "fn.export('/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml')" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -182,20 +170,28 @@ "cell_type": "code", "execution_count": 9, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-21 15:12:18,489 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" + ] + } + ], "source": [ "# export function yaml\n", - "# fn.export('/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml')" + "fn.export('/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml')" ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# import function yaml\n", - "fn = mlrun.import_function('/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml')" + "# fn = mlrun.import_function('/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml')" ] }, { @@ -220,7 +216,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 14, "metadata": {}, "outputs": [], "source": [ @@ -245,7 +241,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 15, "metadata": { "collapsed": true, "jupyter": { @@ -257,7 +253,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-21 07:11:35,802 starting remote build, image: .mlrun/func-default-arc-to-parquet-latest\n", + "[mlrun] 2020-01-21 15:12:46,452 starting remote build, image: .mlrun/func-default-arc-to-parquet-latest\n", "\u001b[36mINFO\u001b[0m[0000] Resolved base name python:3.6-jessie to python:3.6-jessie \n", "\u001b[36mINFO\u001b[0m[0000] Resolved base name python:3.6-jessie to python:3.6-jessie \n", "\u001b[36mINFO\u001b[0m[0000] Downloading base image python:3.6-jessie \n", @@ -267,180 +263,172 @@ "\u001b[36mINFO\u001b[0m[0000] Downloading base image python:3.6-jessie \n", "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:0318d80cb241983eda20b905d77fa0bfb06e29e5aabf075c7941ea687f1c125a: no such file or directory \n", "\u001b[36mINFO\u001b[0m[0000] Downloading base image python:3.6-jessie \n", - "\u001b[36mINFO\u001b[0m[0001] Unpacking rootfs as cmd RUN python -m pip uninstall mlrun requires it. \n", + "\u001b[36mINFO\u001b[0m[0000] Unpacking rootfs as cmd RUN python -m pip uninstall mlrun requires it. \n", "\u001b[36mINFO\u001b[0m[0011] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0018] RUN python -m pip uninstall mlrun \n", - "\u001b[36mINFO\u001b[0m[0018] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0018] args: [-c python -m pip uninstall mlrun] \n", + "\u001b[36mINFO\u001b[0m[0017] RUN python -m pip uninstall mlrun \n", + "\u001b[36mINFO\u001b[0m[0017] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0017] args: [-c python -m pip uninstall mlrun] \n", "WARNING: Skipping mlrun as it is not installed.\n", - "\u001b[36mINFO\u001b[0m[0019] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0021] RUN python -m pip install -U -q mlrun \n", - "\u001b[36mINFO\u001b[0m[0021] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0021] args: [-c python -m pip install -U -q mlrun] \n", - "ERROR: kfp 0.2.0 has requirement urllib3<1.25,>=1.15, but you'll have urllib3 1.25.7 which is incompatible.\n", - "WARNING: You are using pip version 19.1.1, however version 19.3.1 is available.\n", + "\u001b[36mINFO\u001b[0m[0018] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0020] RUN python -m pip install -U -q mlrun \n", + "\u001b[36mINFO\u001b[0m[0020] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0020] args: [-c python -m pip install -U -q mlrun] \n", + "WARNING: You are using pip version 19.1.1, however version 20.0.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0067] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0083] RUN python -m pip install -U -q pandas \n", - "\u001b[36mINFO\u001b[0m[0083] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0083] args: [-c python -m pip install -U -q pandas] \n", - "WARNING: You are using pip version 19.1.1, however version 19.3.1 is available.\n", + "\u001b[36mINFO\u001b[0m[0068] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0084] RUN python -m pip install -U -q pandas \n", + "\u001b[36mINFO\u001b[0m[0084] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0084] args: [-c python -m pip install -U -q pandas] \n", + "WARNING: You are using pip version 19.1.1, however version 20.0.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0084] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0085] Taking snapshot of full filesystem... \n", "\u001b[36mINFO\u001b[0m[0088] RUN python -m pip install -U -q pyarrow \n", "\u001b[36mINFO\u001b[0m[0088] cmd: /bin/sh \n", "\u001b[36mINFO\u001b[0m[0088] args: [-c python -m pip install -U -q pyarrow] \n", - "WARNING: You are using pip version 19.1.1, however version 19.3.1 is available.\n", + "WARNING: You are using pip version 19.1.1, however version 20.0.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\n", "\u001b[36mINFO\u001b[0m[0092] Taking snapshot of full filesystem... \n", "\u001b[36mINFO\u001b[0m[0100] RUN python -m pip install -U -q numpy==1.17.4 \n", "\u001b[36mINFO\u001b[0m[0100] cmd: /bin/sh \n", "\u001b[36mINFO\u001b[0m[0100] args: [-c python -m pip install -U -q numpy==1.17.4] \n", - "WARNING: You are using pip version 19.1.1, however version 19.3.1 is available.\n", + "WARNING: You are using pip version 19.1.1, however version 20.0.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0103] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bit_generator.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_examples \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_mt19937.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/include/numpy/random/distributions.h \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/tests/__pycache__/test_extending.cpython-36.pyc \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_common.pxd \n", + "\u001b[36mINFO\u001b[0m[0104] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_common.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/tests/test_extending.py \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/__pycache__/test_issue14735.cpython-36.pyc \n", "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/test_issue14735.py \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy-1.18.1.dist-info \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/__init__.pxd \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bit_generator.pxd \n", "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_pcg64.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bounded_integers.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_common.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bit_generator.pxd \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_mt19937.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/tests/__pycache__/test_extending.cpython-36.pyc \n", "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/__pycache__/test__exceptions.cpython-36.pyc \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/test__exceptions.py \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/__init__.pxd \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bit_generator.cpython-36m-x86_64-linux-gnu.so \n", "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bounded_integers.pxd \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_sfc64.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy-1.18.1.dist-info \n", "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_generator.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/__pycache__/test_issue14735.cpython-36.pyc \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_examples \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_common.pxd \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bounded_integers.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_sfc64.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/test__exceptions.py \n", + "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/include/numpy/random/distributions.h \n", "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_philox.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/tests/test_extending.py \n", "\u001b[36mINFO\u001b[0m[0108] RUN pip install mlrun \n", "\u001b[36mINFO\u001b[0m[0108] cmd: /bin/sh \n", "\u001b[36mINFO\u001b[0m[0108] args: [-c pip install mlrun] \n", "Requirement already satisfied: mlrun in /usr/local/lib/python3.6/site-packages (0.4.3)\n", - "Requirement already satisfied: nest-asyncio>=1.0.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.2.2)\n", - "Requirement already satisfied: nuclio-jupyter>=0.8.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.8.0)\n", - "Requirement already satisfied: aiohttp>=3.5.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (3.6.2)\n", - "Requirement already satisfied: kfp>=0.1.29 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.2.0)\n", - "Requirement already satisfied: sqlalchemy==1.3.11 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.3.11)\n", "Requirement already satisfied: Flask>=1.1.1 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.1.1)\n", - "Requirement already satisfied: croniter==0.3.31 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.3.31)\n", - "Requirement already satisfied: gunicorn==19.9.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (19.9.0)\n", "Requirement already satisfied: pandas>=0.23.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.25.3)\n", - "Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (7.0)\n", "Requirement already satisfied: tabulate<=0.8.3,>=0.8.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.8.3)\n", - "Requirement already satisfied: pyyaml>=5.1.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (5.3)\n", + "Requirement already satisfied: sqlalchemy==1.3.11 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.3.11)\n", + "Requirement already satisfied: croniter==0.3.31 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.3.31)\n", + "Requirement already satisfied: boto3>=1.9 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.11.6)\n", "Requirement already satisfied: GitPython>=2.1.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (3.0.5)\n", + "Requirement already satisfied: nest-asyncio>=1.0.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.2.2)\n", "Requirement already satisfied: gevent==1.4.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.4.0)\n", + "Requirement already satisfied: pyyaml>=5.1.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (5.3)\n", + "Requirement already satisfied: kfp>=0.1.29 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.2.0)\n", + "Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (7.0)\n", + "Requirement already satisfied: aiohttp>=3.5.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (3.6.2)\n", + "Requirement already satisfied: gunicorn==19.9.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (19.9.0)\n", "Requirement already satisfied: nuclio-sdk>=0.0.3 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.0.5)\n", + "Requirement already satisfied: nuclio-jupyter>=0.8.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.8.0)\n", "Requirement already satisfied: requests>=2.20.1 in /usr/local/lib/python3.6/site-packages (from mlrun) (2.22.0)\n", - "Requirement already satisfied: boto3>=1.9 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.11.6)\n", - "Requirement already satisfied: tornado<6,>=5 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.1.1)\n", - "Requirement already satisfied: nbconvert>=5.4 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.6.1)\n", - "Requirement already satisfied: jupyterlab>=0.35.4 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (1.2.5)\n", - "Requirement already satisfied: notebook>=5.7.2 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.2)\n", - "Requirement already satisfied: ipython>=7.2 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (7.11.1)\n", - "Requirement already satisfied: idna-ssl>=1.0; python_version < \"3.7\" in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (1.1.0)\n", - "Requirement already satisfied: async-timeout<4.0,>=3.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.1)\n", - "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (1.4.2)\n", - "Requirement already satisfied: typing-extensions>=3.6.5; python_version < \"3.7\" in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.7.4.1)\n", - "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (19.3.0)\n", - "Requirement already satisfied: chardet<4.0,>=2.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.4)\n", - "Requirement already satisfied: multidict<5.0,>=4.5 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (4.7.4)\n", + "Requirement already satisfied: Jinja2>=2.10.1 in /usr/local/lib/python3.6/site-packages (from Flask>=1.1.1->mlrun) (2.10.3)\n", + "Requirement already satisfied: itsdangerous>=0.24 in /usr/local/lib/python3.6/site-packages (from Flask>=1.1.1->mlrun) (1.1.0)\n", + "Requirement already satisfied: Werkzeug>=0.15 in /usr/local/lib/python3.6/site-packages (from Flask>=1.1.1->mlrun) (0.16.0)\n", + "Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/site-packages (from pandas>=0.23.0->mlrun) (2.8.1)\n", + "Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.6/site-packages (from pandas>=0.23.0->mlrun) (1.17.4)\n", + "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/site-packages (from pandas>=0.23.0->mlrun) (2019.3)\n", + "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (0.9.4)\n", + "Requirement already satisfied: botocore<1.15.0,>=1.14.6 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (1.14.6)\n", + "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (0.3.1)\n", + "Requirement already satisfied: gitdb2>=2.0.0 in /usr/local/lib/python3.6/site-packages (from GitPython>=2.1.0->mlrun) (2.0.6)\n", + "Requirement already satisfied: greenlet>=0.4.14; platform_python_implementation == \"CPython\" in /usr/local/lib/python3.6/site-packages (from gevent==1.4.0->mlrun) (0.4.15)\n", + "Requirement already satisfied: Deprecated in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.2.7)\n", "Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.14.0)\n", "Requirement already satisfied: kubernetes<=10.0.0,>=8.0.0 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (10.0.0)\n", - "Requirement already satisfied: PyJWT>=1.6.4 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.7.1)\n", - "Requirement already satisfied: python-dateutil in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2.8.1)\n", - "Requirement already satisfied: certifi in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2019.11.28)\n", - "Requirement already satisfied: Deprecated in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.2.7)\n", "Requirement already satisfied: requests-toolbelt>=0.8.0 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (0.9.1)\n", + "Requirement already satisfied: urllib3<1.25,>=1.15 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.24.3)\n", + "Requirement already satisfied: cloudpickle==1.1.1 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.1.1)\n", "Requirement already satisfied: google-cloud-storage>=1.13.0 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.25.0)\n", "Requirement already satisfied: kfp-server-api<=0.1.40,>=0.1.18 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (0.1.40)\n", - "Collecting urllib3<1.25,>=1.15 (from kfp>=0.1.29->mlrun)\n", - " Downloading https://files.pythonhosted.org/packages/01/11/525b02e4acc0c747de8b6ccdab376331597c569c42ea66ab0a1dbd36eca2/urllib3-1.24.3-py2.py3-none-any.whl (118kB)\n", "Requirement already satisfied: cryptography>=2.4.2 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2.8)\n", - "Requirement already satisfied: google-auth>=1.6.1 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.10.1)\n", - "Requirement already satisfied: jsonschema>=3.0.1 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (3.2.0)\n", - "Requirement already satisfied: cloudpickle==1.1.1 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.1.1)\n", + "Requirement already satisfied: certifi in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2019.11.28)\n", + "Requirement already satisfied: PyJWT>=1.6.4 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.7.1)\n", "Requirement already satisfied: argo-models==2.2.1a in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2.2.1a0)\n", - "Requirement already satisfied: itsdangerous>=0.24 in /usr/local/lib/python3.6/site-packages (from Flask>=1.1.1->mlrun) (1.1.0)\n", - "Requirement already satisfied: Werkzeug>=0.15 in /usr/local/lib/python3.6/site-packages (from Flask>=1.1.1->mlrun) (0.16.0)\n", - "Requirement already satisfied: Jinja2>=2.10.1 in /usr/local/lib/python3.6/site-packages (from Flask>=1.1.1->mlrun) (2.10.3)\n", - "Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.6/site-packages (from pandas>=0.23.0->mlrun) (1.17.4)\n", - "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/site-packages (from pandas>=0.23.0->mlrun) (2019.3)\n", - "Requirement already satisfied: gitdb2>=2.0.0 in /usr/local/lib/python3.6/site-packages (from GitPython>=2.1.0->mlrun) (2.0.6)\n", - "Requirement already satisfied: greenlet>=0.4.14; platform_python_implementation == \"CPython\" in /usr/local/lib/python3.6/site-packages (from gevent==1.4.0->mlrun) (0.4.15)\n", + "Requirement already satisfied: jsonschema>=3.0.1 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (3.2.0)\n", + "Requirement already satisfied: google-auth>=1.6.1 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.10.1)\n", + "Requirement already satisfied: idna-ssl>=1.0; python_version < \"3.7\" in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (1.1.0)\n", + "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (19.3.0)\n", + "Requirement already satisfied: multidict<5.0,>=4.5 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (4.7.4)\n", + "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: chardet<4.0,>=2.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.4)\n", + "Requirement already satisfied: typing-extensions>=3.6.5; python_version < \"3.7\" in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.7.4.1)\n", + "Requirement already satisfied: async-timeout<4.0,>=3.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.1)\n", + "Requirement already satisfied: notebook>=5.7.2 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.2)\n", + "Requirement already satisfied: ipython>=7.2 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (7.11.1)\n", + "Requirement already satisfied: nbconvert>=5.4 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.6.1)\n", + "Requirement already satisfied: jupyterlab>=0.35.4 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (1.2.5)\n", + "Requirement already satisfied: tornado<6,>=5 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.1.1)\n", "Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/site-packages (from requests>=2.20.1->mlrun) (2.8)\n", - "Requirement already satisfied: botocore<1.15.0,>=1.14.6 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (1.14.6)\n", - "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (0.3.1)\n", - "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (0.9.4)\n", - "Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.4)\n", - "Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.3)\n", - "Requirement already satisfied: bleach in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (3.1.0)\n", - "Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (1.4.2)\n", - "Requirement already satisfied: jupyter-core in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (4.6.1)\n", - "Requirement already satisfied: testpath in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.4.4)\n", - "Requirement already satisfied: nbformat>=4.4 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (5.0.3)\n", - "Requirement already satisfied: defusedxml in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", - "Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (4.3.3)\n", - "Requirement already satisfied: pygments in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (2.5.2)\n", - "Requirement already satisfied: jupyterlab-server~=1.0.0 in /usr/local/lib/python3.6/site-packages (from jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (1.0.6)\n", + "Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/site-packages (from Jinja2>=2.10.1->Flask>=1.1.1->mlrun) (1.1.1)\n", + "Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/site-packages (from botocore<1.15.0,>=1.14.6->boto3>=1.9->mlrun) (0.15.2)\n", + "Requirement already satisfied: smmap2>=2.0.0 in /usr/local/lib/python3.6/site-packages (from gitdb2>=2.0.0->GitPython>=2.1.0->mlrun) (2.0.5)\n", + "Requirement already satisfied: wrapt<2,>=1.10 in /usr/local/lib/python3.6/site-packages (from Deprecated->kfp>=0.1.29->mlrun) (1.11.2)\n", + "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /usr/local/lib/python3.6/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (0.57.0)\n", + "Requirement already satisfied: requests-oauthlib in /usr/local/lib/python3.6/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (1.3.0)\n", + "Requirement already satisfied: setuptools>=21.0.0 in /usr/local/lib/python3.6/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (41.0.1)\n", + "Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /usr/local/lib/python3.6/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (0.5.0)\n", + "Requirement already satisfied: google-cloud-core<2.0dev,>=1.2.0 in /usr/local/lib/python3.6/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.2.0)\n", + "Requirement already satisfied: cffi!=1.11.3,>=1.8 in /usr/local/lib/python3.6/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (1.13.2)\n", + "Requirement already satisfied: pyrsistent>=0.14.0 in /usr/local/lib/python3.6/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (0.15.7)\n", + "Requirement already satisfied: importlib-metadata; python_version < \"3.8\" in /usr/local/lib/python3.6/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (1.4.0)\n", + "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.2.8)\n", + "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0.0)\n", + "Requirement already satisfied: rsa<4.1,>=3.1.4 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0)\n", "Requirement already satisfied: jupyter-client>=5.3.4 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.3.4)\n", - "Requirement already satisfied: pyzmq>=17 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (18.1.1)\n", - "Requirement already satisfied: ipykernel in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.1.3)\n", + "Requirement already satisfied: nbformat in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.0.3)\n", + "Requirement already satisfied: jupyter-core>=4.6.0 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (4.6.1)\n", "Requirement already satisfied: Send2Trash in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (1.5.0)\n", - "Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.2.0)\n", + "Requirement already satisfied: ipykernel in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.1.3)\n", + "Requirement already satisfied: traitlets>=4.2.1 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (4.3.3)\n", "Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.8.3)\n", "Requirement already satisfied: prometheus-client in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.1)\n", + "Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.2.0)\n", + "Requirement already satisfied: pyzmq>=17 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (18.1.1)\n", + "Requirement already satisfied: jedi>=0.10 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.15.2)\n", "Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.5)\n", + "Requirement already satisfied: pygments in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (2.5.2)\n", "Requirement already satisfied: decorator in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.4.1)\n", - "Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (41.0.1)\n", - "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (3.0.2)\n", "Requirement already satisfied: backcall in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.0)\n", - "Requirement already satisfied: jedi>=0.10 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.15.2)\n", + "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (3.0.2)\n", "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.7.0)\n", - "Requirement already satisfied: requests-oauthlib in /usr/local/lib/python3.6/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (1.3.0)\n", - "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /usr/local/lib/python3.6/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (0.57.0)\n", - "Requirement already satisfied: wrapt<2,>=1.10 in /usr/local/lib/python3.6/site-packages (from Deprecated->kfp>=0.1.29->mlrun) (1.11.2)\n", - "Requirement already satisfied: google-cloud-core<2.0dev,>=1.2.0 in /usr/local/lib/python3.6/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.2.0)\n", - "Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /usr/local/lib/python3.6/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (0.5.0)\n", - "Requirement already satisfied: cffi!=1.11.3,>=1.8 in /usr/local/lib/python3.6/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (1.13.2)\n", - "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.2.8)\n", - "Requirement already satisfied: rsa<4.1,>=3.1.4 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0)\n", - "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0.0)\n", - "Requirement already satisfied: pyrsistent>=0.14.0 in /usr/local/lib/python3.6/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (0.15.7)\n", - "Requirement already satisfied: importlib-metadata; python_version < \"3.8\" in /usr/local/lib/python3.6/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (1.4.0)\n", - "Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/site-packages (from Jinja2>=2.10.1->Flask>=1.1.1->mlrun) (1.1.1)\n", - "Requirement already satisfied: smmap2>=2.0.0 in /usr/local/lib/python3.6/site-packages (from gitdb2>=2.0.0->GitPython>=2.1.0->mlrun) (2.0.5)\n", - "Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/site-packages (from botocore<1.15.0,>=1.14.6->boto3>=1.9->mlrun) (0.15.2)\n", - "Requirement already satisfied: webencodings in /usr/local/lib/python3.6/site-packages (from bleach->nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.5.1)\n", - "Requirement already satisfied: json5 in /usr/local/lib/python3.6/site-packages (from jupyterlab-server~=1.0.0->jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.5)\n", - "Requirement already satisfied: ptyprocess; os_name != \"nt\" in /usr/local/lib/python3.6/site-packages (from terminado>=0.8.1->notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", - "Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.8)\n", - "Requirement already satisfied: parso>=0.5.2 in /usr/local/lib/python3.6/site-packages (from jedi>=0.10->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.5.2)\n", + "Requirement already satisfied: defusedxml in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.3)\n", + "Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.4)\n", + "Requirement already satisfied: testpath in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.4.4)\n", + "Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: bleach in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (3.1.0)\n", + "Requirement already satisfied: jupyterlab-server~=1.0.0 in /usr/local/lib/python3.6/site-packages (from jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (1.0.6)\n", "Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.6/site-packages (from requests-oauthlib->kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (3.1.0)\n", "Requirement already satisfied: google-api-core<2.0.0dev,>=1.16.0 in /usr/local/lib/python3.6/site-packages (from google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.16.0)\n", "Requirement already satisfied: pycparser in /usr/local/lib/python3.6/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.4.2->kfp>=0.1.29->mlrun) (2.19)\n", - "Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.6/site-packages (from pyasn1-modules>=0.2.1->google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.4.8)\n", "Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/site-packages (from importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (2.0.0)\n", - "Requirement already satisfied: protobuf>=3.4.0 in /usr/local/lib/python3.6/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (3.11.2)\n", + "Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.6/site-packages (from pyasn1-modules>=0.2.1->google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.4.8)\n", + "Requirement already satisfied: ptyprocess; os_name != \"nt\" in /usr/local/lib/python3.6/site-packages (from terminado>=0.8.1->notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: parso>=0.5.2 in /usr/local/lib/python3.6/site-packages (from jedi>=0.10->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.5.2)\n", + "Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.8)\n", + "Requirement already satisfied: webencodings in /usr/local/lib/python3.6/site-packages (from bleach->nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.5.1)\n", + "Requirement already satisfied: json5 in /usr/local/lib/python3.6/site-packages (from jupyterlab-server~=1.0.0->jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.5)\n", "Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /usr/local/lib/python3.6/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.51.0)\n", + "Requirement already satisfied: protobuf>=3.4.0 in /usr/local/lib/python3.6/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (3.11.2)\n", "Requirement already satisfied: more-itertools in /usr/local/lib/python3.6/site-packages (from zipp>=0.5->importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (8.1.0)\n", - "Installing collected packages: urllib3\n", - " Found existing installation: urllib3 1.25.7\n", - " Uninstalling urllib3-1.25.7:\n", - " Successfully uninstalled urllib3-1.25.7\n", - "Successfully installed urllib3-1.24.3\n", - "WARNING: You are using pip version 19.1.1, however version 19.3.1 is available.\n", + "WARNING: You are using pip version 19.1.1, however version 20.0.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0110] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0110] Adding whiteout for /usr/local/lib/python3.6/site-packages/urllib3-1.25.7.dist-info \n" + "\u001b[36mINFO\u001b[0m[0109] Taking snapshot of full filesystem... \n" ] }, { @@ -449,7 +437,7 @@ "True" ] }, - "execution_count": 14, + "execution_count": 15, "metadata": {}, "output_type": "execute_result" } @@ -460,7 +448,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -476,35 +464,50 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# useful constants\n", - "target_path = '/User/mlrun/functions/parquet'\n", - "archive = 'https://fpsignals-public.s3.amazonaws.com/one_csv.tar.gz'\n", - "parquet_file = 'x_test_50.parquet' # the file extension is not necessary\n", + "target_path = '/User/mlrun/models'\n", + "archive = \"https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz\"\n", + "parquet_file = 'higgs.parquet' # the file extension is not necessary\n", "parquet_file_path = target_path + \"/\" + parquet_file\n", "artifact_key = 'raw_data'" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "higgs_header = ['labels', 'lepton pT ', 'lepton eta ', 'lepton phi ',\n", + " 'missing energy magnitude ', 'missing energy phi ', 'jet 1 pt ',\n", + " 'jet 1 eta ', 'jet 1 phi ', 'jet 1 b-tag ', 'jet 2 pt ',\n", + " 'jet 2 eta ', 'jet 2 phi ', 'jet 2 b-tag ', 'jet 3 pt ',\n", + " 'jet 3 eta ', 'jet 3 phi ', 'jet 3 b-tag ', 'jet 4 pt ',\n", + " 'jet 4 eta ', 'jet 4 phi ', 'jet 4 b-tag', 'm_jj', 'm_jjj',\n", + " 'm_lv ', 'm_jlv', 'm_bb ', 'm_wbb ', 'm_wwbb']" + ] + }, + { + "cell_type": "code", + "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-21 07:17:44,068 starting run arc2parq uid=2a211d65872442cf85e745bde5c81392 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-21 07:17:44,153 Job is running in the background, pod: arc2parq-vjhbh\n", - "[mlrun] 2020-01-21 07:17:50,433 destination file does not exist, downloading\n", - "[mlrun] 2020-01-21 07:17:50,536 saved table to /User/mlrun/functions/parquet/x_test_50.parquet\n", - "[mlrun] 2020-01-21 07:17:50,549 log artifact raw_data at /User/mlrun/functions/parquet/x_test_50.parquet, size: None, db: Y\n", - "[mlrun] 2020-01-21 07:17:50,561 log artifact header at /User/mlrun/functions/parquet/header.json, size: None, db: Y\n", + "[mlrun] 2020-01-21 15:15:40,261 starting run arc2parq uid=06e28db485184e7093e1222235199a29 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-21 15:15:40,356 Job is running in the background, pod: arc2parq-zllcw\n", + "[mlrun] 2020-01-21 15:15:54,505 destination file does not exist, downloading\n", + "[mlrun] 2020-01-21 15:20:53,806 saved table to /User/mlrun/models/higgs.parquet\n", + "[mlrun] 2020-01-21 15:20:53,820 log artifact raw_data at /User/mlrun/models/higgs.parquet, size: None, db: Y\n", + "[mlrun] 2020-01-21 15:20:53,833 log artifact header at /User/mlrun/models/header.pkl, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-21 07:17:50,571 run executed, status=completed\n", + "[mlrun] 2020-01-21 15:20:53,860 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -677,26 +680,26 @@ " \n", " \n", " \n", - "
...c81392
\n", + "
...199a29
\n", " 0\n", - " Jan 21 07:17:50\n", + " Jan 21 15:15:54\n", " completed\n", " arc-to-parquet\n", - "
host=arc2parq-vjhbh
kind=job
owner=admin
\n", + "
host=arc2parq-zllcw
kind=job
owner=admin
\n", " \n", - "
archive_url=https://fpsignals-public.s3.amazonaws.com/one_csv.tar.gz
key=raw_data
name=x_test_50.parquet
target_path=/User/mlrun/functions/parquet
\n", + "
archive_url=https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
header=['labels', 'lepton pT ', 'lepton eta ', 'lepton phi ', 'missing energy magnitude ', 'missing energy phi ', 'jet 1 pt ', 'jet 1 eta ', 'jet 1 phi ', 'jet 1 b-tag ', 'jet 2 pt ', 'jet 2 eta ', 'jet 2 phi ', 'jet 2 b-tag ', 'jet 3 pt ', 'jet 3 eta ', 'jet 3 phi ', 'jet 3 b-tag ', 'jet 4 pt ', 'jet 4 eta ', 'jet 4 phi ', 'jet 4 b-tag', 'm_jj', 'm_jjj', 'm_lv ', 'm_jlv', 'm_bb ', 'm_wbb ', 'm_wwbb']
key=raw_data
name=higgs.parquet
target_path=/User/mlrun/models
\n", " \n", - "
raw_data
header
\n", + "
raw_data
header
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -712,8 +715,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 2a211d65872442cf85e745bde5c81392 , !mlrun logs 2a211d65872442cf85e745bde5c81392 \n", - "[mlrun] 2020-01-21 07:17:53,318 run executed, status=completed\n" + "!mlrun get run 06e28db485184e7093e1222235199a29 , !mlrun logs 06e28db485184e7093e1222235199a29 \n", + "[mlrun] 2020-01-21 15:21:00,699 run executed, status=completed\n" ] } ], @@ -721,12 +724,13 @@ "# create and run the task\n", "arc_to_parq_task = mlrun.NewTask(\n", " 'arc2parq', \n", - " handler='arc_to_parquet', # a string since we are calling this 'remotely', outside this notebook\n", + " handler='arc_to_parquet', \n", " params={\n", " 'target_path': target_path,\n", " 'name' : parquet_file, \n", " 'key' : artifact_key,\n", - " 'archive_url': archive},\n", + " 'archive_url': archive,\n", + " 'header' : higgs_header},\n", " outputs=[artifact_key])\n", "\n", "# run\n", @@ -749,7 +753,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -760,7 +764,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -770,7 +774,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -780,21 +784,9 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "ename": "AssertionError", - "evalue": "mlrun.functions: original and copied data not equal", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mAssertionError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0moriginal\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marchive\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mcopied\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_parquet\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mparquet_file_path\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mengine\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"pyarrow\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32massert\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray_equal\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moriginal\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopied\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"mlrun.functions: original and copied data not equal\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;31mAssertionError\u001b[0m: mlrun.functions: original and copied data not equal" - ] - } - ], + "outputs": [], "source": [ "original = pd.read_csv(archive).values\n", "copied = pd.read_parquet(parquet_file_path, engine=\"pyarrow\").values\n", @@ -810,7 +802,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ diff --git a/tests/generate-some-classifiers.ipynb b/tests/generate-some-classifiers.ipynb new file mode 100644 index 000000000..4f6ef55f8 --- /dev/null +++ b/tests/generate-some-classifiers.ipynb @@ -0,0 +1,311 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# generate multiple classifier models for testing\n", + "\n", + "Generate classifier models and pickle them under `/User/mlrun/models/-classifier.cpkl`.\n", + "\n", + "In principle, the pickle `load` method should give us a class instance that we can predict with. This may not work in practice, and that is the purpose of this notebook, to figure out which models work, and which don't. Several pickling packages will also be tested in case there are differences.\n", + "\n", + "### _extensions_\n", + "* cpkl for `cloudpickle`\n", + "* pkl for `pickle`\n", + "* dpkl for `dill`...\n", + "\n", + "**gbc model:** adapted from **[Probabilistic predictions with Gaussian process classification (GPC)](https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpc.html#sphx-glr-auto-examples-gaussian-process-plot-gpc-py)**\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } + }, + "outputs": [], + "source": [ + "%matplotlib inline" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from cloudpickle import dump as cdump, load as cload\n", + "from pickle import dump as pdump, load as pload\n", + "from dill import dump as ddump, load as dload" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "from matplotlib import pyplot as plt\n", + "\n", + "from sklearn.metrics import accuracy_score, log_loss\n", + "from sklearn.datasets import make_classification\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.gaussian_process import GaussianProcessClassifier\n", + "from sklearn.gaussian_process.kernels import RBF" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "n_samples = 1000\n", + "train_size = 0.7\n", + "\n", + "X, y = make_classification(\n", + " n_samples=n_samples,\n", + " n_features=28, \n", + " random_state = 1)\n", + "\n", + "xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1-train_size)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } + }, + "outputs": [], + "source": [ + "kernel = 1.0*RBF(length_scale=1.0)\n", + "\n", + "clf = GaussianProcessClassifier(kernel=kernel, random_state=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "GaussianProcessClassifier(copy_X_train=True, kernel=1**2 * RBF(length_scale=1),\n", + " max_iter_predict=100, multi_class='one_vs_rest',\n", + " n_jobs=None, n_restarts_optimizer=0,\n", + " optimizer='fmin_l_bfgs_b', random_state=1,\n", + " warm_start=False)" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "clf.fit(xtrain, ytrain)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } + }, + "outputs": [], + "source": [ + "accuracy = accuracy_score(ytest, clf.predict(xtest))\n", + "\n", + "logloss = log_loss(ytest, clf.predict_proba(xtest)[:, 1])" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "cdump(clf, open('/User/mlrun/models/gpc-classifier.cpkl', 'wb'))" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "clf_loaded = cload(open('/User/mlrun/models/gpc-classifier.cpkl', 'rb'))" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "GaussianProcessClassifier(copy_X_train=True, kernel=1**2 * RBF(length_scale=1),\n", + " max_iter_predict=100, multi_class='one_vs_rest',\n", + " n_jobs=None, n_restarts_optimizer=0,\n", + " optimizer='fmin_l_bfgs_b', random_state=1,\n", + " warm_start=False)" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "clf_loaded" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "assert accuracy == accuracy_score(ytest, clf_loaded.predict(xtest))\n", + "assert logloss == log_loss(ytest, clf_loaded.predict_proba(xtest)[:, 1])" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.ensemble import AdaBoostClassifier" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "clf = AdaBoostClassifier(n_estimators=100, random_state=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,\n", + " n_estimators=100, random_state=1)" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "clf.fit(xtrain, ytrain)" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "cdump(clf, open('/User/mlrun/models/ada-classifier.cpkl', 'wb'))" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "clf_loaded = cload(open('/User/mlrun/models/ada-classifier.cpkl', 'rb'))" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } + }, + "outputs": [], + "source": [ + "accuracy = accuracy_score(ytest, clf.predict(xtest))\n", + "\n", + "logloss = log_loss(ytest, clf.predict_proba(xtest)[:, 1])" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [], + "source": [ + "assert accuracy == accuracy_score(ytest, clf_loaded.predict(xtest))\n", + "assert logloss == log_loss(ytest, clf_loaded.predict_proba(xtest)[:, 1])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 464551f7ad086b193c5d5ae9245f24e8435cca56 Mon Sep 17 00:00:00 2001 From: yasha Date: Tue, 21 Jan 2020 18:56:12 +0000 Subject: [PATCH 11/32] refactor, incomplete --- serving/classifier_server.ipynb | 86 +++++++++++++++++++++------------ 1 file changed, 54 insertions(+), 32 deletions(-) diff --git a/serving/classifier_server.ipynb b/serving/classifier_server.ipynb index 3622f3e13..7fa3ff9bc 100644 --- a/serving/classifier_server.ipynb +++ b/serving/classifier_server.ipynb @@ -74,8 +74,6 @@ "%%nuclio cmd -c\n", "pip install -U -q kfserving\n", "pip install -U -q azure\n", - "pip install -U -q numpy\n", - "pip install -U -q scikit-learn==0.21.3\n", "pip install -U -q mlrun" ] }, @@ -83,27 +81,44 @@ "cell_type": "code", "execution_count": 4, "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "%nuclio: setting spec.build.baseImage to 'python:3.6-jessie'\n" + ] + } + ], + "source": [ + "%nuclio config spec.build.baseImage = \"python:3.6-jessie\"" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, "outputs": [], "source": [ "import kfserving\n", "import os\n", "import numpy as np\n", - "from cloudpickle import load as cload" + "from pickle import load as pload" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "TARGET_PATH = '/User/mlrun/models'\n", - "MODEL_FILE = 'ada-classifier.cpkl'" + "MODEL_FILE = 'lgb-classifier.pkl'" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 28, "metadata": {}, "outputs": [], "source": [ @@ -119,7 +134,7 @@ " def load(self):\n", " model_file = os.path.join(\n", " kfserving.Storage.download(self.model_dir), MODEL_FILE)\n", - " self.classifier = cload(open(model_file, 'rb'))\n", + " self.classifier = pload(open(model_file, 'rb'))\n", " self.ready = True\n", "\n", " def predict(self, body):\n", @@ -140,7 +155,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 29, "metadata": {}, "outputs": [], "source": [ @@ -167,19 +182,26 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "[I 200121 11:55:01 storage:35] Copying contents of /User/mlrun/models to local\n" + "[I 200121 17:59:48 storage:35] Copying contents of /User/mlrun/models to local\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/User/mlrun/models/lgb-classifier.pkl\n" ] } ], "source": [ - "my_server = ClassifierModel('some-classifier-model', model_dir='/User/mlrun/models')\n", + "my_server = ClassifierModel('classifier', model_dir='/User/mlrun/models')\n", "my_server.load()" ] }, @@ -193,7 +215,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 31, "metadata": {}, "outputs": [], "source": [ @@ -208,7 +230,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 32, "metadata": {}, "outputs": [], "source": [ @@ -225,16 +247,16 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[0, 0, 1, 1, 1, 0, 0, 1, 1, 1]" + "[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0]" ] }, - "execution_count": 11, + "execution_count": 33, "metadata": {}, "output_type": "execute_result" } @@ -258,7 +280,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 23, "metadata": {}, "outputs": [], "source": [ @@ -268,16 +290,16 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 13, + "execution_count": 24, "metadata": {}, "output_type": "execute_result" } @@ -292,7 +314,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 15, "metadata": {}, "outputs": [], "source": [ @@ -301,17 +323,17 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-21 11:55:06,216 deploy started\n", - "[nuclio] 2020-01-21 11:57:28,926 (info) Build complete\n", - "[nuclio] 2020-01-21 11:57:35,016 (info) Function deploy complete\n", - "[nuclio] 2020-01-21 11:57:35,022 done updating some-classifier-model, function address: 3.135.246.153:31529\n" + "[mlrun] 2020-01-21 17:56:08,987 deploy started\n", + "[nuclio] 2020-01-21 17:58:21,215 (info) Build complete\n", + "[nuclio] 2020-01-21 17:58:28,339 (info) Function deploy complete\n", + "[nuclio] 2020-01-21 17:58:28,346 done updating some-classifier-model, function address: 3.135.246.153:31127\n" ] } ], @@ -332,7 +354,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 25, "metadata": {}, "outputs": [], "source": [ @@ -344,16 +366,16 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "b'Exception caught in handler \"No module named \\'sklearn.gaussian_process._gpc\\'\": Traceback (most recent call last):\\n File \"/opt/nuclio/_nuclio_wrapper.py\", line 176, in serve_requests\\n entrypoint_output = self._entrypoint(self._context, event)\\n File \"/opt/nuclio/classifier_server.py\", line 40, in handler\\n return context.mlrun_handler(context, event)\\n File \"/usr/local/lib/python3.6/site-packages/mlrun/runtimes/serving.py\", line 67, in nuclio_serving_handler\\n return route(context, model_name, event)\\n File \"/usr/local/lib/python3.6/site-packages/mlrun/runtimes/serving.py\", line 132, in post\\n model = self.get_model(name)\\n File \"/usr/local/lib/python3.6/site-packages/mlrun/runtimes/serving.py\", line 88, in get_model\\n model.load()\\n File \"/opt/nuclio/classifier_server.py\", line 23, in load\\n self.classifier = cload(open(model_file, \\'rb\\'))\\nModuleNotFoundError: No module named \\'sklearn.gaussian_process._gpc\\'\\n'" + "b'Exception caught in handler \"No module named \\'lightgbm\\'\": Traceback (most recent call last):\\n File \"/opt/nuclio/_nuclio_wrapper.py\", line 176, in serve_requests\\n entrypoint_output = self._entrypoint(self._context, event)\\n File \"/opt/nuclio/classifier_server.py\", line 40, in handler\\n return context.mlrun_handler(context, event)\\n File \"/usr/local/lib/python3.6/site-packages/mlrun/runtimes/serving.py\", line 67, in nuclio_serving_handler\\n return route(context, model_name, event)\\n File \"/usr/local/lib/python3.6/site-packages/mlrun/runtimes/serving.py\", line 132, in post\\n model = self.get_model(name)\\n File \"/usr/local/lib/python3.6/site-packages/mlrun/runtimes/serving.py\", line 88, in get_model\\n model.load()\\n File \"/opt/nuclio/classifier_server.py\", line 23, in load\\n self.classifier = pload(open(model_file, \\'rb\\'))\\nModuleNotFoundError: No module named \\'lightgbm\\'\\n'" ] }, - "execution_count": 17, + "execution_count": 26, "metadata": {}, "output_type": "execute_result" } @@ -364,7 +386,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 27, "metadata": {}, "outputs": [ { @@ -374,7 +396,7 @@ "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mJSONDecodeError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mjson\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloads\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcontent\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mjson\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloads\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcontent\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/conda/lib/python3.6/json/__init__.py\u001b[0m in \u001b[0;36mloads\u001b[0;34m(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)\u001b[0m\n\u001b[1;32m 352\u001b[0m \u001b[0mparse_int\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mparse_float\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;32mand\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 353\u001b[0m parse_constant is None and object_pairs_hook is None and not kw):\n\u001b[0;32m--> 354\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0m_default_decoder\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 355\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcls\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 356\u001b[0m \u001b[0mcls\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mJSONDecoder\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/conda/lib/python3.6/json/decoder.py\u001b[0m in \u001b[0;36mdecode\u001b[0;34m(self, s, _w)\u001b[0m\n\u001b[1;32m 337\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 338\u001b[0m \"\"\"\n\u001b[0;32m--> 339\u001b[0;31m \u001b[0mobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mend\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mraw_decode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0midx\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0m_w\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 340\u001b[0m \u001b[0mend\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_w\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mend\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 341\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mend\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/conda/lib/python3.6/json/decoder.py\u001b[0m in \u001b[0;36mraw_decode\u001b[0;34m(self, s, idx)\u001b[0m\n\u001b[1;32m 355\u001b[0m \u001b[0mobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mend\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscan_once\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0midx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 356\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mStopIteration\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 357\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mJSONDecodeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Expecting value\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 358\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mend\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", From 3f7e0c78313c0f8da3f2ae8535b625f06f5c3ee4 Mon Sep 17 00:00:00 2001 From: yasha Date: Tue, 21 Jan 2020 22:03:41 +0000 Subject: [PATCH 12/32] eod, backup, refactor incomplete --- serving/classifier_server.ipynb | 21 +- serving/lightgbm/train.yaml | 19 ++ tests/arc_to_parquet.ipynb | 381 +++++++++++++++++--------------- tests/open_archive.ipynb | 162 +++++++------- 4 files changed, 320 insertions(+), 263 deletions(-) create mode 100644 serving/lightgbm/train.yaml diff --git a/serving/classifier_server.ipynb b/serving/classifier_server.ipynb index 7fa3ff9bc..c1a6b2abf 100644 --- a/serving/classifier_server.ipynb +++ b/serving/classifier_server.ipynb @@ -118,7 +118,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -155,7 +155,7 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -182,21 +182,26 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "[I 200121 17:59:48 storage:35] Copying contents of /User/mlrun/models to local\n" + "[I 200121 20:35:56 storage:35] Copying contents of /User/mlrun/models to local\n" ] }, { - "name": "stdout", - "output_type": "stream", - "text": [ - "/User/mlrun/models/lgb-classifier.pkl\n" + "ename": "FileNotFoundError", + "evalue": "[Errno 2] No such file or directory: '/User/mlrun/models/lgb-classifier.pkl'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mmy_server\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mClassifierModel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'classifier'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmodel_dir\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'/User/mlrun/models'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mmy_server\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mload\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m\u001b[0m in \u001b[0;36mload\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 11\u001b[0m model_file = os.path.join(\n\u001b[1;32m 12\u001b[0m kfserving.Storage.download(self.model_dir), MODEL_FILE)\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclassifier\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpload\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mopen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmodel_file\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'rb'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 14\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mready\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '/User/mlrun/models/lgb-classifier.pkl'" ] } ], diff --git a/serving/lightgbm/train.yaml b/serving/lightgbm/train.yaml new file mode 100644 index 000000000..b6a2c7fbf --- /dev/null +++ b/serving/lightgbm/train.yaml @@ -0,0 +1,19 @@ +kind: job +metadata: + name: lgbm-job +spec: + description: 'train an LGBMClassifier' + build: + functionSourceCode: # Generated by nuclio.export.NuclioExporter on 2020-01-21 21:41

from io import BytesIO
from os import path, makedirs
import json
from cloudpickle import load, dump
from pathlib import Path
from urllib.request import urlretrieve
from typing import IO, AnyStr, TypeVar, Union, List, Optional

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.metrics import (roc_curve, confusion_matrix)
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
from matplotlib.figure import Figure
import seaborn as sns

import pyarrow.parquet as pq
import pyarrow as pa
from pyarrow import Table

from mlrun.artifacts import TableArtifact, PlotArtifact
from mlrun.execution import MLClientCtx
from mlrun.datastore import DataItem

def get_context_table(ctxtable: DataItem) -> Table:
    """deserialize table in artifact store
    
    :param ctxtable:  table in the artifact store
    """
    blob = BytesIO(ctxtable.get())
    return pd.read_parquet(blob, engine='pyarrow')

def log_context_table(
    context: MLClientCtx,
    target_path: str,
    key: str,
    table: pd.DataFrame
) -> None:
    """Log a table in the artifact store.
    
    The table is written as a parquet file, and its target
    path is saved in the context.
    
    :param context:      the function context
    :param target_path:  location (folder) of our DataItem
    :param key:          name of the object in the artifact store
    :param table:        the object we wish to store
    """
    filepath = path.join(target_path, key + '.parquet')
    pq.write_table(pa.Table.from_pandas(table), filepath)    
    context.log_artifact(key, target_path=filepath)

def log_lgbm_model(
    context: MLClientCtx,
    model,
    data,
    header: List = [],
    target_path: str = '',
    name: str = '',  # with file extension
    key: str = 'model',
    exp_labels: dict = {}
):
    """log a classifier model to the artifact store
    
    :param context:       function context
    :param model:         estimated model
    :param history:       training-validation metrics
    :param data:          train and test data
    :param header:        features labels
    :param target_path:   destintion folder for file artifacts
    :param name:          name of model file (or, prefix to model files)
    :param key:           key of model in artifact store
    :param labels:        model artifact labels
    
    Save an estimated model along with metadata, it's training-validation metrics 
    history and plots, roc curve, confusion matrix and feature importances.  
    """
    def _gcf_clear(plt):
        plt.cla()
        plt.clf()
        plt.close()        
    
    def plot_validation(train_metric, valid_metric):
        """Plot train and validation loss curves from a metrics table in an
        artifact store.

        These curves represent the training round losses from the training
        and validation sets.
        :param train_metric:    train metric
        :param valid_metric:    validation metric
        """
        plt.plot(train_metric)
        plt.plot(valid_metric)
        plt.title("training validation results")
        plt.xlabel("epoch")
        plt.ylabel("")
        plt.legend(["train", "valid"])
        fig = plt.gcf()

        plotpath = path.join(target_path, "history.png")
        plt.savefig(plotpath)
        context.log_artifact(PlotArtifact('training-validation-plot', body=fig, target_path=plotpath))

        _gcf_clear(plt)

    def plot_roc(y_labels, y_probs):
        """Plot an ROC curve from test data saved in an artifact store.
        :param y_labels:        test data labels
        :param y_probs:         test data 
        """
        fpr_xg, tpr_xg, _ = roc_curve(y_labels, y_probs)
        plt.plot([0, 1], [0, 1], "k--")
        plt.plot(fpr_xg, tpr_xg, label="roc")
        plt.xlabel("false positive rate")
        plt.ylabel("true positive rate")
        plt.title("roc curve")
        plt.legend(loc="best")
        fig = plt.gcf()

        plotpath = path.join(target_path, "roc.png")
        fig.savefig(plotpath, format=fmt)
        context.log_artifact(PlotArtifact('roc', body=fig))

        _gcf_clear(plt)

    def plot_confusion_matrix(labels, predictions):
        """Create a confusion matrix.
        Plot and save a confusion matrix using test data from a
        pipeline step.  The plot is generated usung default arguments.
        The present example could be extended by including a parameters `dict`
        that is passed through to sklearn's `confusion_matrix`,
        `ConfusionMatrixDisplay`, and matplotlib `plot`.
        :param labels:          test data labels
        :param predictions:     test data predictions
        """
        cm = confusion_matrix(labels,
                              predictions,
                              sample_weight=None,
                              labels=axislabels,
                              normalize='all')
        sns.heatmap(cm, annot=True, cmap="Blues")
        plotpath = path.join(target_path, "confusion.png")
        fig = plt.gcf()
        fig.savefig(plotpath)
        context.log_artifact(PlotArtifact('confusion_matrix', body=fig))

        _gcf_clear(plt)

    def plot_importance(model, header: List = []):
        """Display estimated feature importances.

        :param model:       fitted lightgbm model
        :param header:      list of feature names
        """
        zipped = zip(model.feature_importances_, header)

        feature_imp = pd.DataFrame(sorted(zipped), columns=['freq','feature']
                                  ).sort_values(by="freq", ascending=False)

        plt.figure(figsize=(20, 10))
        sns.barplot(x="freq", y="feature", data=feature_imp)
        plt.title('LightGBM Features')
        plt.tight_layout()
        fig = plt.gcf()
        plotpath = path.join(target_path, "feature-importances.png")
        fig.savefig(plotpath)
        context.log_artifact(PlotArtifact('feature-importances-plot', body=fig))

        tablepath = path.join(target_path, "feature-importances-table.csv")
        feature_imp.to_csv(tablepath)
        context.log_artifact(TableArtifact('feature-importances-table', target_path=tablepath))

        _gcf_clear(plt)

    if callable(getattr(model, 'predict_proba')):
        ypred_probs = model.predict_proba(data['xtest'])[:, 1]
        ypred = np.where(ypred_probs >= 0.5, 1, 0)
    else:
        ypred = model.predict(data['xtest'])
        ypred_probs = None

    context.log_result("test_accuracy", float(clf.score(data['xtest'], data['ytest'])))

    loss = np.asarray(model.evals_result_['train']['binary_logloss'], dtype=np.float)
    val_loss = np.asarray(model.evals_result_['valid']['binary_logloss'], dtype=np.float)

    plot_validation(loss, val_loss)
    if ypred_probs:
        plot_roc(data['ytest'], ypred_probs)
    if ypred:
        plot_confusion_matrix(data['ytest'], ypred)
    if hasattr(model, 'feature_importances_'):
        plot_importance(model, header)
   
    filepath = path.join(target_path, name)
    dump(model, open(filepath, 'wb'))
    context.log_artifact(key,
                         target_path=filepath,
                         labels=exp_labels)    

def train(
    context: MLClientCtx,
    src_file: str,
    header: DataItem,
    test_size: float = 0.1,
    train_val_split: float = 0.75,
    sample: int = -1,
    target_path: str = '',
    name: str = '',
    key: str = '',
    exp_labels = {},  # 'lightgbm_sklearn' if this were a pipeline
    verbose: bool = False,
    random_state = np.random.RandomState(1),
    **sklearn_params
) -> None:
    """Train and save a LightGBM model.
    
    :param context:         the function context
    :param src_file:        ('raw') name of raw data file
    :param header:          header artifact
    :param test_size:       (0.1) test set size
    :param train_val_split: (0.75) Once the test set has been removed the 
                            training set gets this proportion.
    :param sample:          (-1). Selects the first n rows, or select a sample starting
                            from the first. If negative <-1, select a random sample from 
                            the entire file
    :param target_path:     folder location of files
    :param name:            destination name for model file
    :param key:             key for model artifact
    :param exp_labels:      metadata dict, some keys are required (type, framework). 'type'
                            is either classifier or regressor, 'framework' can be sklearn or not
                            (sklearn models have a generic interface)
    :param verbose :        (False) show metrics for training/validation steps.
    :param random_state:    (1) sklearn rng seed
    :param sklearn_params   sklearn keyword params 
    """
    srcfilepath = path.join(target_path, src_file)
    if (sample == -1) or (sample >= 1):
        raw = pq.read_table(srcfilepath).to_pandas()
        labels = raw.pop('labels')
        raw = raw.iloc[:sample, :]
        labels = labels.iloc[:sample]
    else:
        raw = pq.read_table(srcfilepath).to_pandas().sample(sample*-1)
        labels = raw.pop('labels')

    x, xtest, y, ytest = train_test_split(raw, labels, train_size=1-test_size, 
                                          random_state=random_state)
   
    xtrain, xvalid, ytrain, yvalid = train_test_split(x, y, 
                                                      train_size=train_val_split, 
                                                      random_state=random_state)        
    
    clf = lgb.LGBMClassifier(random_state=random_state,
                             verbose=int(verbose == True))

    eval_results = dict()

    clf.fit(xtrain, 
            ytrain,
            eval_set=[(xvalid, yvalid), (xtrain, ytrain)],
            eval_names=['valid', 'train'],
            callbacks=[lgb.record_evaluation(eval_results)],
            verbose=verbose)
    
    context.log_result("train_accuracy", float(clf.score(xtrain, ytrain)))
    
    log_lgbm_model(
        context, 
        clf, 
        data = {'xtest':xtest, 'ytest':ytest},
        target_path=target_path,
        header=load(open(str(header), 'rb')),
        name=name, 
        key=key,
        exp_labels=exp_labels)

 + base_image: python:3.6-jessie + commands: + - rm /conda/lib/python3.6/site-packages/seaborn* -rf + - pip uninstall -y mlrun + - pip install -U -q mlrun + - pip install -U -q kfp + - pip install -U -q pyarrow + - pip install -U -q pandas + - pip install -U -q matplotlib + - pip install -U -q seaborn + - pip install -U -q scikit-learn + - pip install -U -q lightgbm diff --git a/tests/arc_to_parquet.ipynb b/tests/arc_to_parquet.ipynb index b9bbaaca4..8272c98ad 100644 --- a/tests/arc_to_parquet.ipynb +++ b/tests/arc_to_parquet.ipynb @@ -37,13 +37,12 @@ "python -m pip uninstall mlrun\n", "python -m pip install -U -q mlrun\n", "python -m pip install -U -q pandas\n", - "python -m pip install -U -q pyarrow\n", - "python -m pip install -U -q numpy==1.17.4" + "python -m pip install -U -q pyarrow" ] }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -60,7 +59,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -129,7 +128,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -138,7 +137,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -181,7 +180,7 @@ ], "source": [ "# export function yaml\n", - "fn.export('/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml')" + "f# n.export('/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml')" ] }, { @@ -205,18 +204,27 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 10, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/User/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings\n", + " InsecureRequestWarning)\n" + ] + } + ], "source": [ "# load function from Github\n", - "# fn = mlrun.import_function(\n", - "# 'https://raw.githubusercontent.com/mlrun/functions/master/fileutils/arc_to_parquet/arc_to_parquet.yaml')" + "fn = mlrun.import_function(\n", + " 'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/fileutils/arc_to_parquet/arc_to_parquet.yaml')" ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -241,7 +249,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 16, "metadata": { "collapsed": true, "jupyter": { @@ -253,7 +261,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-21 15:12:46,452 starting remote build, image: .mlrun/func-default-arc-to-parquet-latest\n", + "[mlrun] 2020-01-21 19:27:06,066 starting remote build, image: .mlrun/func-default-arc-to-parquet-latest\n", "\u001b[36mINFO\u001b[0m[0000] Resolved base name python:3.6-jessie to python:3.6-jessie \n", "\u001b[36mINFO\u001b[0m[0000] Resolved base name python:3.6-jessie to python:3.6-jessie \n", "\u001b[36mINFO\u001b[0m[0000] Downloading base image python:3.6-jessie \n", @@ -263,172 +271,173 @@ "\u001b[36mINFO\u001b[0m[0000] Downloading base image python:3.6-jessie \n", "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:0318d80cb241983eda20b905d77fa0bfb06e29e5aabf075c7941ea687f1c125a: no such file or directory \n", "\u001b[36mINFO\u001b[0m[0000] Downloading base image python:3.6-jessie \n", - "\u001b[36mINFO\u001b[0m[0000] Unpacking rootfs as cmd RUN python -m pip uninstall mlrun requires it. \n", + "\u001b[36mINFO\u001b[0m[0001] Unpacking rootfs as cmd RUN python -m pip uninstall mlrun requires it. \n", "\u001b[36mINFO\u001b[0m[0011] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0017] RUN python -m pip uninstall mlrun \n", - "\u001b[36mINFO\u001b[0m[0017] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0017] args: [-c python -m pip uninstall mlrun] \n", + "\u001b[36mINFO\u001b[0m[0036] RUN python -m pip uninstall mlrun \n", + "\u001b[36mINFO\u001b[0m[0036] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0036] args: [-c python -m pip uninstall mlrun] \n", "WARNING: Skipping mlrun as it is not installed.\n", - "\u001b[36mINFO\u001b[0m[0018] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0020] RUN python -m pip install -U -q mlrun \n", - "\u001b[36mINFO\u001b[0m[0020] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0020] args: [-c python -m pip install -U -q mlrun] \n", + "\u001b[36mINFO\u001b[0m[0037] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0039] RUN python -m pip install -U -q mlrun \n", + "\u001b[36mINFO\u001b[0m[0039] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0039] args: [-c python -m pip install -U -q mlrun] \n", + "ERROR: kfp 0.2.0 has requirement urllib3<1.25,>=1.15, but you'll have urllib3 1.25.7 which is incompatible.\n", "WARNING: You are using pip version 19.1.1, however version 20.0.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0068] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0084] RUN python -m pip install -U -q pandas \n", - "\u001b[36mINFO\u001b[0m[0084] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0084] args: [-c python -m pip install -U -q pandas] \n", + "\u001b[36mINFO\u001b[0m[0087] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0112] RUN python -m pip install -U -q pandas \n", + "\u001b[36mINFO\u001b[0m[0112] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0112] args: [-c python -m pip install -U -q pandas] \n", "WARNING: You are using pip version 19.1.1, however version 20.0.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0085] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0088] RUN python -m pip install -U -q pyarrow \n", - "\u001b[36mINFO\u001b[0m[0088] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0088] args: [-c python -m pip install -U -q pyarrow] \n", + "\u001b[36mINFO\u001b[0m[0113] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0122] RUN python -m pip install -U -q pyarrow \n", + "\u001b[36mINFO\u001b[0m[0122] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0122] args: [-c python -m pip install -U -q pyarrow] \n", "WARNING: You are using pip version 19.1.1, however version 20.0.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0092] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0100] RUN python -m pip install -U -q numpy==1.17.4 \n", - "\u001b[36mINFO\u001b[0m[0100] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0100] args: [-c python -m pip install -U -q numpy==1.17.4] \n", + "\u001b[36mINFO\u001b[0m[0125] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0136] RUN python -m pip install -U -q numpy==1.17.4 \n", + "\u001b[36mINFO\u001b[0m[0136] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0136] args: [-c python -m pip install -U -q numpy==1.17.4] \n", "WARNING: You are using pip version 19.1.1, however version 20.0.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0104] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_common.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/tests/test_extending.py \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/__pycache__/test_issue14735.cpython-36.pyc \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/test_issue14735.py \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_pcg64.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bit_generator.pxd \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_mt19937.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/tests/__pycache__/test_extending.cpython-36.pyc \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/__pycache__/test__exceptions.cpython-36.pyc \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/__init__.pxd \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bit_generator.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bounded_integers.pxd \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy-1.18.1.dist-info \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_generator.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_examples \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_common.pxd \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bounded_integers.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_sfc64.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/test__exceptions.py \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/include/numpy/random/distributions.h \n", - "\u001b[36mINFO\u001b[0m[0104] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_philox.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0108] RUN pip install mlrun \n", - "\u001b[36mINFO\u001b[0m[0108] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0108] args: [-c pip install mlrun] \n", + "\u001b[36mINFO\u001b[0m[0139] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_mt19937.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_generator.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_sfc64.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/__init__.pxd \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/tests/__pycache__/test_extending.cpython-36.pyc \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_pcg64.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_examples \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/tests/test_extending.py \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_philox.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bounded_integers.pxd \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/__pycache__/test__exceptions.cpython-36.pyc \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_common.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/test__exceptions.py \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/test_issue14735.py \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/include/numpy/random/distributions.h \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bit_generator.pxd \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/__pycache__/test_issue14735.cpython-36.pyc \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy-1.18.1.dist-info \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bounded_integers.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_common.pxd \n", + "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bit_generator.cpython-36m-x86_64-linux-gnu.so \n", + "\u001b[36mINFO\u001b[0m[0149] RUN pip install mlrun \n", + "\u001b[36mINFO\u001b[0m[0149] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0149] args: [-c pip install mlrun] \n", "Requirement already satisfied: mlrun in /usr/local/lib/python3.6/site-packages (0.4.3)\n", - "Requirement already satisfied: Flask>=1.1.1 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.1.1)\n", - "Requirement already satisfied: pandas>=0.23.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.25.3)\n", - "Requirement already satisfied: tabulate<=0.8.3,>=0.8.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.8.3)\n", + "Requirement already satisfied: requests>=2.20.1 in /usr/local/lib/python3.6/site-packages (from mlrun) (2.22.0)\n", + "Requirement already satisfied: nuclio-jupyter>=0.8.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.8.0)\n", "Requirement already satisfied: sqlalchemy==1.3.11 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.3.11)\n", - "Requirement already satisfied: croniter==0.3.31 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.3.31)\n", - "Requirement already satisfied: boto3>=1.9 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.11.6)\n", - "Requirement already satisfied: GitPython>=2.1.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (3.0.5)\n", "Requirement already satisfied: nest-asyncio>=1.0.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.2.2)\n", + "Requirement already satisfied: nuclio-sdk>=0.0.3 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.0.5)\n", + "Requirement already satisfied: Flask>=1.1.1 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.1.1)\n", "Requirement already satisfied: gevent==1.4.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.4.0)\n", - "Requirement already satisfied: pyyaml>=5.1.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (5.3)\n", - "Requirement already satisfied: kfp>=0.1.29 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.2.0)\n", - "Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (7.0)\n", + "Requirement already satisfied: pandas>=0.23.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.25.3)\n", "Requirement already satisfied: aiohttp>=3.5.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (3.6.2)\n", "Requirement already satisfied: gunicorn==19.9.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (19.9.0)\n", - "Requirement already satisfied: nuclio-sdk>=0.0.3 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.0.5)\n", - "Requirement already satisfied: nuclio-jupyter>=0.8.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.8.0)\n", - "Requirement already satisfied: requests>=2.20.1 in /usr/local/lib/python3.6/site-packages (from mlrun) (2.22.0)\n", + "Requirement already satisfied: boto3>=1.9 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.11.6)\n", + "Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (7.0)\n", + "Requirement already satisfied: kfp>=0.1.29 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.2.0)\n", + "Requirement already satisfied: croniter==0.3.31 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.3.31)\n", + "Requirement already satisfied: GitPython>=2.1.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (3.0.5)\n", + "Requirement already satisfied: tabulate<=0.8.3,>=0.8.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.8.3)\n", + "Requirement already satisfied: pyyaml>=5.1.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (5.3)\n", + "Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/site-packages (from requests>=2.20.1->mlrun) (2.8)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/site-packages (from requests>=2.20.1->mlrun) (2019.11.28)\n", + "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/site-packages (from requests>=2.20.1->mlrun) (1.25.7)\n", + "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/site-packages (from requests>=2.20.1->mlrun) (3.0.4)\n", + "Requirement already satisfied: ipython>=7.2 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (7.11.1)\n", + "Requirement already satisfied: jupyterlab>=0.35.4 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (1.2.5)\n", + "Requirement already satisfied: notebook>=5.7.2 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.3)\n", + "Requirement already satisfied: nbconvert>=5.4 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.6.1)\n", + "Requirement already satisfied: tornado<6,>=5 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.1.1)\n", "Requirement already satisfied: Jinja2>=2.10.1 in /usr/local/lib/python3.6/site-packages (from Flask>=1.1.1->mlrun) (2.10.3)\n", "Requirement already satisfied: itsdangerous>=0.24 in /usr/local/lib/python3.6/site-packages (from Flask>=1.1.1->mlrun) (1.1.0)\n", "Requirement already satisfied: Werkzeug>=0.15 in /usr/local/lib/python3.6/site-packages (from Flask>=1.1.1->mlrun) (0.16.0)\n", - "Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/site-packages (from pandas>=0.23.0->mlrun) (2.8.1)\n", - "Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.6/site-packages (from pandas>=0.23.0->mlrun) (1.17.4)\n", + "Requirement already satisfied: greenlet>=0.4.14; platform_python_implementation == \"CPython\" in /usr/local/lib/python3.6/site-packages (from gevent==1.4.0->mlrun) (0.4.15)\n", "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/site-packages (from pandas>=0.23.0->mlrun) (2019.3)\n", + "Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.6/site-packages (from pandas>=0.23.0->mlrun) (1.17.4)\n", + "Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/site-packages (from pandas>=0.23.0->mlrun) (2.8.1)\n", + "Requirement already satisfied: typing-extensions>=3.6.5; python_version < \"3.7\" in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.7.4.1)\n", + "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (19.3.0)\n", + "Requirement already satisfied: idna-ssl>=1.0; python_version < \"3.7\" in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (1.1.0)\n", + "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: multidict<5.0,>=4.5 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (4.7.4)\n", + "Requirement already satisfied: async-timeout<4.0,>=3.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.1)\n", "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (0.9.4)\n", - "Requirement already satisfied: botocore<1.15.0,>=1.14.6 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (1.14.6)\n", "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (0.3.1)\n", - "Requirement already satisfied: gitdb2>=2.0.0 in /usr/local/lib/python3.6/site-packages (from GitPython>=2.1.0->mlrun) (2.0.6)\n", - "Requirement already satisfied: greenlet>=0.4.14; platform_python_implementation == \"CPython\" in /usr/local/lib/python3.6/site-packages (from gevent==1.4.0->mlrun) (0.4.15)\n", - "Requirement already satisfied: Deprecated in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.2.7)\n", + "Requirement already satisfied: botocore<1.15.0,>=1.14.6 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (1.14.6)\n", "Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.14.0)\n", - "Requirement already satisfied: kubernetes<=10.0.0,>=8.0.0 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (10.0.0)\n", "Requirement already satisfied: requests-toolbelt>=0.8.0 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (0.9.1)\n", - "Requirement already satisfied: urllib3<1.25,>=1.15 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.24.3)\n", + "Requirement already satisfied: argo-models==2.2.1a in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2.2.1a0)\n", + "Requirement already satisfied: Deprecated in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.2.7)\n", + "Requirement already satisfied: kfp-server-api<=0.1.40,>=0.1.18 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (0.1.40)\n", "Requirement already satisfied: cloudpickle==1.1.1 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.1.1)\n", "Requirement already satisfied: google-cloud-storage>=1.13.0 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.25.0)\n", - "Requirement already satisfied: kfp-server-api<=0.1.40,>=0.1.18 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (0.1.40)\n", + "Requirement already satisfied: kubernetes<=10.0.0,>=8.0.0 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (10.0.0)\n", + "Requirement already satisfied: google-auth>=1.6.1 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.10.1)\n", "Requirement already satisfied: cryptography>=2.4.2 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2.8)\n", - "Requirement already satisfied: certifi in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2019.11.28)\n", - "Requirement already satisfied: PyJWT>=1.6.4 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.7.1)\n", - "Requirement already satisfied: argo-models==2.2.1a in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2.2.1a0)\n", "Requirement already satisfied: jsonschema>=3.0.1 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (3.2.0)\n", - "Requirement already satisfied: google-auth>=1.6.1 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.10.1)\n", - "Requirement already satisfied: idna-ssl>=1.0; python_version < \"3.7\" in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (1.1.0)\n", - "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (19.3.0)\n", - "Requirement already satisfied: multidict<5.0,>=4.5 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (4.7.4)\n", - "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (1.4.2)\n", - "Requirement already satisfied: chardet<4.0,>=2.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.4)\n", - "Requirement already satisfied: typing-extensions>=3.6.5; python_version < \"3.7\" in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.7.4.1)\n", - "Requirement already satisfied: async-timeout<4.0,>=3.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.1)\n", - "Requirement already satisfied: notebook>=5.7.2 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.2)\n", - "Requirement already satisfied: ipython>=7.2 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (7.11.1)\n", - "Requirement already satisfied: nbconvert>=5.4 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.6.1)\n", - "Requirement already satisfied: jupyterlab>=0.35.4 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (1.2.5)\n", - "Requirement already satisfied: tornado<6,>=5 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.1.1)\n", - "Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/site-packages (from requests>=2.20.1->mlrun) (2.8)\n", + "Requirement already satisfied: PyJWT>=1.6.4 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.7.1)\n", + "Requirement already satisfied: gitdb2>=2.0.0 in /usr/local/lib/python3.6/site-packages (from GitPython>=2.1.0->mlrun) (2.0.6)\n", + "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (3.0.2)\n", + "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.8.0)\n", + "Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (41.0.1)\n", + "Requirement already satisfied: backcall in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.0)\n", + "Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.5)\n", + "Requirement already satisfied: jedi>=0.10 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.15.2)\n", + "Requirement already satisfied: decorator in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.4.1)\n", + "Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.3.3)\n", + "Requirement already satisfied: pygments in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (2.5.2)\n", + "Requirement already satisfied: jupyterlab-server~=1.0.0 in /usr/local/lib/python3.6/site-packages (from jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (1.0.6)\n", + "Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.2.0)\n", + "Requirement already satisfied: ipykernel in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.1.3)\n", + "Requirement already satisfied: prometheus-client in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.1)\n", + "Requirement already satisfied: pyzmq>=17 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (18.1.1)\n", + "Requirement already satisfied: nbformat in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.0.3)\n", + "Requirement already satisfied: jupyter-core>=4.6.1 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (4.6.1)\n", + "Requirement already satisfied: jupyter-client>=5.3.4 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.3.4)\n", + "Requirement already satisfied: Send2Trash in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (1.5.0)\n", + "Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.8.3)\n", + "Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.4)\n", + "Requirement already satisfied: bleach in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (3.1.0)\n", + "Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.3)\n", + "Requirement already satisfied: testpath in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.4.4)\n", + "Requirement already satisfied: defusedxml in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", "Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/site-packages (from Jinja2>=2.10.1->Flask>=1.1.1->mlrun) (1.1.1)\n", "Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/site-packages (from botocore<1.15.0,>=1.14.6->boto3>=1.9->mlrun) (0.15.2)\n", - "Requirement already satisfied: smmap2>=2.0.0 in /usr/local/lib/python3.6/site-packages (from gitdb2>=2.0.0->GitPython>=2.1.0->mlrun) (2.0.5)\n", "Requirement already satisfied: wrapt<2,>=1.10 in /usr/local/lib/python3.6/site-packages (from Deprecated->kfp>=0.1.29->mlrun) (1.11.2)\n", - "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /usr/local/lib/python3.6/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (0.57.0)\n", - "Requirement already satisfied: requests-oauthlib in /usr/local/lib/python3.6/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (1.3.0)\n", - "Requirement already satisfied: setuptools>=21.0.0 in /usr/local/lib/python3.6/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (41.0.1)\n", "Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /usr/local/lib/python3.6/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (0.5.0)\n", "Requirement already satisfied: google-cloud-core<2.0dev,>=1.2.0 in /usr/local/lib/python3.6/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.2.0)\n", + "Requirement already satisfied: requests-oauthlib in /usr/local/lib/python3.6/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (1.3.0)\n", + "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /usr/local/lib/python3.6/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (0.57.0)\n", + "Requirement already satisfied: rsa<4.1,>=3.1.4 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0)\n", + "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0.0)\n", + "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.2.8)\n", "Requirement already satisfied: cffi!=1.11.3,>=1.8 in /usr/local/lib/python3.6/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (1.13.2)\n", "Requirement already satisfied: pyrsistent>=0.14.0 in /usr/local/lib/python3.6/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (0.15.7)\n", "Requirement already satisfied: importlib-metadata; python_version < \"3.8\" in /usr/local/lib/python3.6/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (1.4.0)\n", - "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.2.8)\n", - "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0.0)\n", - "Requirement already satisfied: rsa<4.1,>=3.1.4 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0)\n", - "Requirement already satisfied: jupyter-client>=5.3.4 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.3.4)\n", - "Requirement already satisfied: nbformat in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.0.3)\n", - "Requirement already satisfied: jupyter-core>=4.6.0 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (4.6.1)\n", - "Requirement already satisfied: Send2Trash in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (1.5.0)\n", - "Requirement already satisfied: ipykernel in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.1.3)\n", - "Requirement already satisfied: traitlets>=4.2.1 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (4.3.3)\n", - "Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.8.3)\n", - "Requirement already satisfied: prometheus-client in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.1)\n", - "Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.2.0)\n", - "Requirement already satisfied: pyzmq>=17 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (18.1.1)\n", - "Requirement already satisfied: jedi>=0.10 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.15.2)\n", - "Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.5)\n", - "Requirement already satisfied: pygments in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (2.5.2)\n", - "Requirement already satisfied: decorator in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.4.1)\n", - "Requirement already satisfied: backcall in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.0)\n", - "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (3.0.2)\n", - "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.7.0)\n", - "Requirement already satisfied: defusedxml in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", - "Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.3)\n", - "Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.4)\n", - "Requirement already satisfied: testpath in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.4.4)\n", - "Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (1.4.2)\n", - "Requirement already satisfied: bleach in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (3.1.0)\n", - "Requirement already satisfied: jupyterlab-server~=1.0.0 in /usr/local/lib/python3.6/site-packages (from jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (1.0.6)\n", - "Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.6/site-packages (from requests-oauthlib->kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (3.1.0)\n", + "Requirement already satisfied: smmap2>=2.0.0 in /usr/local/lib/python3.6/site-packages (from gitdb2>=2.0.0->GitPython>=2.1.0->mlrun) (2.0.5)\n", + "Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.8)\n", + "Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.6/site-packages (from pexpect; sys_platform != \"win32\"->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: parso>=0.5.2 in /usr/local/lib/python3.6/site-packages (from jedi>=0.10->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.5.2)\n", + "Requirement already satisfied: json5 in /usr/local/lib/python3.6/site-packages (from jupyterlab-server~=1.0.0->jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.5)\n", + "Requirement already satisfied: webencodings in /usr/local/lib/python3.6/site-packages (from bleach->nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.5.1)\n", "Requirement already satisfied: google-api-core<2.0.0dev,>=1.16.0 in /usr/local/lib/python3.6/site-packages (from google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.16.0)\n", + "Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.6/site-packages (from requests-oauthlib->kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (3.1.0)\n", + "Requirement already satisfied: pyasn1>=0.1.3 in /usr/local/lib/python3.6/site-packages (from rsa<4.1,>=3.1.4->google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.4.8)\n", "Requirement already satisfied: pycparser in /usr/local/lib/python3.6/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.4.2->kfp>=0.1.29->mlrun) (2.19)\n", "Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/site-packages (from importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (2.0.0)\n", - "Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.6/site-packages (from pyasn1-modules>=0.2.1->google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.4.8)\n", - "Requirement already satisfied: ptyprocess; os_name != \"nt\" in /usr/local/lib/python3.6/site-packages (from terminado>=0.8.1->notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", - "Requirement already satisfied: parso>=0.5.2 in /usr/local/lib/python3.6/site-packages (from jedi>=0.10->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.5.2)\n", - "Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.8)\n", - "Requirement already satisfied: webencodings in /usr/local/lib/python3.6/site-packages (from bleach->nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.5.1)\n", - "Requirement already satisfied: json5 in /usr/local/lib/python3.6/site-packages (from jupyterlab-server~=1.0.0->jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.5)\n", - "Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /usr/local/lib/python3.6/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.51.0)\n", "Requirement already satisfied: protobuf>=3.4.0 in /usr/local/lib/python3.6/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (3.11.2)\n", + "Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /usr/local/lib/python3.6/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.51.0)\n", "Requirement already satisfied: more-itertools in /usr/local/lib/python3.6/site-packages (from zipp>=0.5->importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (8.1.0)\n", "WARNING: You are using pip version 19.1.1, however version 20.0.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0109] Taking snapshot of full filesystem... \n" + "\u001b[36mINFO\u001b[0m[0150] Taking snapshot of full filesystem... \n" ] }, { @@ -437,7 +446,7 @@ "True" ] }, - "execution_count": 15, + "execution_count": 16, "metadata": {}, "output_type": "execute_result" } @@ -448,9 +457,20 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 12, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# fn.with_code()" ] @@ -464,50 +484,46 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# useful constants\n", "target_path = '/User/mlrun/models'\n", - "archive = \"https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz\"\n", + "archive = \"https://fpsignals-public.s3.amazonaws.com/higgs-small.tar.gz\"\n", "parquet_file = 'higgs.parquet' # the file extension is not necessary\n", "parquet_file_path = target_path + \"/\" + parquet_file\n", - "artifact_key = 'raw_data'" + "artifact_key = 'higgs_small'" ] }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 21, "metadata": {}, "outputs": [], "source": [ - "higgs_header = ['labels', 'lepton pT ', 'lepton eta ', 'lepton phi ',\n", - " 'missing energy magnitude ', 'missing energy phi ', 'jet 1 pt ',\n", - " 'jet 1 eta ', 'jet 1 phi ', 'jet 1 b-tag ', 'jet 2 pt ',\n", - " 'jet 2 eta ', 'jet 2 phi ', 'jet 2 b-tag ', 'jet 3 pt ',\n", - " 'jet 3 eta ', 'jet 3 phi ', 'jet 3 b-tag ', 'jet 4 pt ',\n", - " 'jet 4 eta ', 'jet 4 phi ', 'jet 4 b-tag', 'm_jj', 'm_jjj',\n", - " 'm_lv ', 'm_jlv', 'm_bb ', 'm_wbb ', 'm_wwbb']" + "HIGGS_HEADER = ['labels', 'lepton_pT', 'lepton_eta', 'lepton_phi', 'missing_energy_magnitude', 'missing_energy_phi',\n", + " 'jet_1_pt', 'jet_1_eta', 'jet_1_phi', 'jet_1_b-tag', 'jet_2_pt', 'jet_2_eta', 'jet_2_phi', 'jet_2_b-tag', 'jet_3_pt',\n", + " 'jet_3_eta', 'jet_3_phi', 'jet_3_b-tag', 'jet_4_pt', 'jet_4_eta', 'jet_4_phi', 'jet_4_b-tag', 'm_jj', 'm_jjj', 'm_lv',\n", + " 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']" ] }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-21 15:15:40,261 starting run arc2parq uid=06e28db485184e7093e1222235199a29 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-21 15:15:40,356 Job is running in the background, pod: arc2parq-zllcw\n", - "[mlrun] 2020-01-21 15:15:54,505 destination file does not exist, downloading\n", - "[mlrun] 2020-01-21 15:20:53,806 saved table to /User/mlrun/models/higgs.parquet\n", - "[mlrun] 2020-01-21 15:20:53,820 log artifact raw_data at /User/mlrun/models/higgs.parquet, size: None, db: Y\n", - "[mlrun] 2020-01-21 15:20:53,833 log artifact header at /User/mlrun/models/header.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-21 20:23:50,660 starting run arc2parq uid=e20e88ae28a545da90e7ded360b78d6d -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-21 20:23:50,882 Job is running in the background, pod: arc2parq-c65q8\n", + "[mlrun] 2020-01-21 20:24:05,984 destination file already exists\n", + "[mlrun] 2020-01-21 20:24:06,002 log artifact higgs_small at /User/mlrun/models/higgs.parquet, size: None, db: Y\n", + "[mlrun] 2020-01-21 20:24:06,017 log artifact header at /User/mlrun/models/header.pkl, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-21 15:20:53,860 run executed, status=completed\n", + "[mlrun] 2020-01-21 20:24:06,029 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -680,26 +696,26 @@ " \n", " \n", " \n", - "
...199a29
\n", + "
...b78d6d
\n", " 0\n", - " Jan 21 15:15:54\n", + " Jan 21 20:24:05\n", " completed\n", " arc-to-parquet\n", - "
host=arc2parq-zllcw
kind=job
owner=admin
\n", + "
host=arc2parq-c65q8
kind=job
owner=admin
\n", " \n", - "
archive_url=https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
header=['labels', 'lepton pT ', 'lepton eta ', 'lepton phi ', 'missing energy magnitude ', 'missing energy phi ', 'jet 1 pt ', 'jet 1 eta ', 'jet 1 phi ', 'jet 1 b-tag ', 'jet 2 pt ', 'jet 2 eta ', 'jet 2 phi ', 'jet 2 b-tag ', 'jet 3 pt ', 'jet 3 eta ', 'jet 3 phi ', 'jet 3 b-tag ', 'jet 4 pt ', 'jet 4 eta ', 'jet 4 phi ', 'jet 4 b-tag', 'm_jj', 'm_jjj', 'm_lv ', 'm_jlv', 'm_bb ', 'm_wbb ', 'm_wwbb']
key=raw_data
name=higgs.parquet
target_path=/User/mlrun/models
\n", + "
archive_url=https://fpsignals-public.s3.amazonaws.com/higgs-small.tar.gz
header=['labels', 'lepton pT ', 'lepton eta ', 'lepton phi ', 'missing energy magnitude ', 'missing energy phi ', 'jet 1 pt ', 'jet 1 eta ', 'jet 1 phi ', 'jet 1 b-tag ', 'jet 2 pt ', 'jet 2 eta ', 'jet 2 phi ', 'jet 2 b-tag ', 'jet 3 pt ', 'jet 3 eta ', 'jet 3 phi ', 'jet 3 b-tag ', 'jet 4 pt ', 'jet 4 eta ', 'jet 4 phi ', 'jet 4 b-tag', 'm_jj', 'm_jjj', 'm_lv ', 'm_jlv', 'm_bb ', 'm_wbb ', 'm_wwbb']
key=higgs_small
name=higgs.parquet
target_path=/User/mlrun/models
\n", " \n", - "
raw_data
header
\n", + "
higgs_small
header
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -715,8 +731,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 06e28db485184e7093e1222235199a29 , !mlrun logs 06e28db485184e7093e1222235199a29 \n", - "[mlrun] 2020-01-21 15:21:00,699 run executed, status=completed\n" + "!mlrun get run e20e88ae28a545da90e7ded360b78d6d , !mlrun logs e20e88ae28a545da90e7ded360b78d6d \n", + "[mlrun] 2020-01-21 20:24:10,173 run executed, status=completed\n" ] } ], @@ -764,7 +780,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 23, "metadata": {}, "outputs": [], "source": [ @@ -774,7 +790,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 24, "metadata": {}, "outputs": [], "source": [ @@ -784,9 +800,26 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 25, "metadata": {}, - "outputs": [], + "outputs": [ + { + "ename": "TypeError", + "evalue": "unhashable type: 'dict'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0moriginal\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marchive\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mcopied\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_parquet\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mparquet_file_path\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mengine\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"pyarrow\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0;32massert\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray_equal\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moriginal\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopied\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"mlrun.functions: original and copied data not equal\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/conda/lib/python3.6/site-packages/pandas/io/parquet.py\u001b[0m in \u001b[0;36mread_parquet\u001b[0;34m(path, engine, columns, **kwargs)\u001b[0m\n", + "\u001b[0;32m/conda/lib/python3.6/site-packages/pandas/io/parquet.py\u001b[0m in \u001b[0;36mread\u001b[0;34m(self, path, columns, **kwargs)\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/pyarrow/array.pxi\u001b[0m in \u001b[0;36mpyarrow.lib._PandasConvertible.to_pandas\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/pyarrow/table.pxi\u001b[0m in \u001b[0;36mpyarrow.lib.Table._to_pandas\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32m/conda/lib/python3.6/site-packages/pyarrow/pandas_compat.py\u001b[0m in \u001b[0;36mtable_to_blockmanager\u001b[0;34m(options, table, categories, ignore_metadata)\u001b[0m\n", + "\u001b[0;31mTypeError\u001b[0m: unhashable type: 'dict'" + ] + } + ], "source": [ "original = pd.read_csv(archive).values\n", "copied = pd.read_parquet(parquet_file_path, engine=\"pyarrow\").values\n", diff --git a/tests/open_archive.ipynb b/tests/open_archive.ipynb index 650c551d9..42ff00d0e 100644 --- a/tests/open_archive.ipynb +++ b/tests/open_archive.ipynb @@ -12,7 +12,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -29,7 +29,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -39,7 +39,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -49,7 +49,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -91,7 +91,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -100,7 +100,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -134,7 +134,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 2, "metadata": {}, "outputs": [ { @@ -153,7 +153,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -171,7 +171,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -179,7 +179,7 @@ "images_path = '/User/mlrun/functions/images'\n", "\n", "open_archive_task = mlrun.NewTask(\n", - " 'download',\n", + " 'download-zip',\n", " handler='open_archive', \n", " params={'target_dir' : images_path,\n", " 'key' : 'contents'},\n", @@ -188,23 +188,23 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-21 09:47:47,427 starting run download uid=299b648c59294e9891334ded6159d8aa -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-21 09:47:47,497 Job is running in the background, pod: download-92bzm\n", - "[mlrun] 2020-01-21 09:47:51,918 downloading http://iguazio-sample-data.s3.amazonaws.com/catsndogs.zip to local tmp\n", - "[mlrun] 2020-01-21 09:47:52,812 Verified directories\n", - "/tmp/tmpxqpdi5zq.zip\n", - "['/tmp/tmpxqpdi5zq', 'zip']\n", - "[mlrun] 2020-01-21 09:47:52,812 opening zip\n", - "[mlrun] 2020-01-21 09:47:59,625 log artifact content at /User/mlrun/functions/images, size: None, db: Y\n", + "[mlrun] 2020-01-21 19:19:43,612 starting run download uid=31c5db9ef8174d40ac94c6dad0258069 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-21 19:19:43,808 Job is running in the background, pod: download-tcrfc\n", + "[mlrun] 2020-01-21 19:20:04,079 downloading http://iguazio-sample-data.s3.amazonaws.com/catsndogs.zip to local tmp\n", + "[mlrun] 2020-01-21 19:20:05,501 Verified directories\n", + "/tmp/tmp4_eoapfc.zip\n", + "['/tmp/tmp4_eoapfc', 'zip']\n", + "[mlrun] 2020-01-21 19:20:05,501 opening zip\n", + "[mlrun] 2020-01-21 19:20:13,406 log artifact content at /User/mlrun/functions/images, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-21 09:47:59,635 run executed, status=completed\n", + "[mlrun] 2020-01-21 19:20:13,416 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -377,12 +377,12 @@ " \n", " \n", " \n", - "
...59d8aa
\n", + "
...258069
\n", " 0\n", - " Jan 21 09:47:51\n", + " Jan 21 19:20:04\n", " completed\n", " open-archive\n", - "
host=download-92bzm
kind=job
owner=admin
\n", + "
host=download-tcrfc
kind=job
owner=admin
\n", "
archive_url
\n", "
key=contents
target_dir=/User/mlrun/functions/images
\n", " \n", @@ -391,12 +391,12 @@ " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -412,8 +412,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 299b648c59294e9891334ded6159d8aa , !mlrun logs 299b648c59294e9891334ded6159d8aa \n", - "[mlrun] 2020-01-21 09:48:02,711 run executed, status=completed\n" + "!mlrun get run 31c5db9ef8174d40ac94c6dad0258069 , !mlrun logs 31c5db9ef8174d40ac94c6dad0258069 \n", + "[mlrun] 2020-01-21 19:20:16,127 run executed, status=completed\n" ] } ], @@ -431,15 +431,15 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# create and run the task\n", - "images_path = '/User/mlrun/functions/t000'\n", + "images_path = '/User/mlrun/functions/images-from-tar'\n", "\n", "open_archive_task = mlrun.NewTask(\n", - " 'download',\n", + " 'download-tar',\n", " handler='open_archive', \n", " params={'target_dir' : images_path,\n", " 'key' : 'contents',\n", @@ -448,22 +448,22 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-21 09:48:45,223 starting run download uid=e5df4261e94847c999df30bbe88fe6c8 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-21 09:48:45,298 Job is running in the background, pod: download-sr2pp\n", - "[mlrun] 2020-01-21 09:48:49,674 Verified directories\n", + "[mlrun] 2020-01-21 19:22:37,587 starting run download-tar uid=500c634fd1c546c5a58292d37f50320f -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-21 19:22:37,659 Job is running in the background, pod: download-tar-zh72r\n", + "[mlrun] 2020-01-21 19:22:42,412 Verified directories\n", "https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz\n", "['https://fpsignals-public', 's3', 'amazonaws', 'com/catsndogs', 'tar', 'gz']\n", - "[mlrun] 2020-01-21 09:48:49,674 opening tar_gz\n", - "[mlrun] 2020-01-21 09:49:03,258 log artifact content at /User/mlrun/functions/t000, size: None, db: Y\n", + "[mlrun] 2020-01-21 19:22:42,412 opening tar_gz\n", + "[mlrun] 2020-01-21 19:22:57,936 log artifact content at /User/mlrun/functions/images-from-tar, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-21 09:49:03,273 run executed, status=completed\n", + "[mlrun] 2020-01-21 19:22:57,948 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -636,26 +636,26 @@ " \n", " \n", " \n", - "
...8fe6c8
\n", + "
...50320f
\n", " 0\n", - " Jan 21 09:48:49\n", + " Jan 21 19:22:42\n", " completed\n", " open-archive\n", - "
host=download-sr2pp
kind=job
owner=admin
\n", + "
host=download-tar-zh72r
kind=job
owner=admin
\n", " \n", - "
archive_url=https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz
key=contents
target_dir=/User/mlrun/functions/t000
\n", + "
archive_url=https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz
key=contents
target_dir=/User/mlrun/functions/images-from-tar
\n", " \n", - "
content
\n", + "
content
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -671,8 +671,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run e5df4261e94847c999df30bbe88fe6c8 , !mlrun logs e5df4261e94847c999df30bbe88fe6c8 \n", - "[mlrun] 2020-01-21 09:49:04,471 run executed, status=completed\n" + "!mlrun get run 500c634fd1c546c5a58292d37f50320f , !mlrun logs 500c634fd1c546c5a58292d37f50320f \n", + "[mlrun] 2020-01-21 19:23:06,873 run executed, status=completed\n" ] } ], @@ -683,12 +683,12 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# create and run the task\n", - "images_path = '/User/mlrun/functions/t0000'\n", + "images_path = '/User/mlrun/functions/images-from-tar-as-inputs'\n", "\n", "open_archive_task = mlrun.NewTask(\n", " 'download',\n", @@ -700,21 +700,21 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-21 09:50:08,023 starting run download uid=74f0391c2da04aabb3f0735bfa977b17 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-21 09:50:08,112 Job is running in the background, pod: download-8wt5x\n", - "[mlrun] 2020-01-21 09:50:14,529 downloading https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz to local tmp\n", - "[mlrun] 2020-01-21 09:50:15,651 Verified directories\n", - "/tmp/tmp14moqiew.gz\n", - "['/tmp/tmp14moqiew', 'gz']\n", - "[mlrun] 2020-01-21 09:50:15,651 opening tar_gz\n", - "[mlrun] 2020-01-21 09:50:15,653 Traceback (most recent call last):\n", + "[mlrun] 2020-01-21 19:23:39,448 starting run download uid=c163869b83cd49cc888f5e9126301911 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-21 19:23:39,535 Job is running in the background, pod: download-7qf2w\n", + "[mlrun] 2020-01-21 19:23:44,057 downloading https://fpsignals-public.s3.amazonaws.com/catsndogs.tar.gz to local tmp\n", + "[mlrun] 2020-01-21 19:23:44,877 Verified directories\n", + "/tmp/tmptshxsk7d.gz\n", + "['/tmp/tmptshxsk7d', 'gz']\n", + "[mlrun] 2020-01-21 19:23:44,877 opening tar_gz\n", + "[mlrun] 2020-01-21 19:23:44,879 Traceback (most recent call last):\n", " File \"/usr/local/lib/python3.6/site-packages/mlrun-0.4.3-py3.6.egg/mlrun/runtimes/local.py\", line 174, in exec_from_params\n", " val = handler(*args_list)\n", " File \"main.py\", line 30, in open_archive\n", @@ -729,15 +729,15 @@ " self._parse()\n", " File \"/usr/local/lib/python3.6/urllib/request.py\", line 384, in _parse\n", " raise ValueError(\"unknown url type: %r\" % self.full_url)\n", - "ValueError: unknown url type: '/tmp/tmp14moqiew.gz'\n", + "ValueError: unknown url type: '/tmp/tmptshxsk7d.gz'\n", "\n", "\n", - "[mlrun] 2020-01-21 09:50:15,663 exec error - unknown url type: '/tmp/tmp14moqiew.gz'\n", - "[mlrun] 2020-01-21 09:50:15,689 run executed, status=error\n", + "[mlrun] 2020-01-21 19:23:44,891 exec error - unknown url type: '/tmp/tmptshxsk7d.gz'\n", + "[mlrun] 2020-01-21 19:23:44,917 run executed, status=error\n", "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings\n", - "runtime error: unknown url type: '/tmp/tmp14moqiew.gz'\n", " InsecureRequestWarning)\n", - "unknown url type: '/tmp/tmp14moqiew.gz'\n", + "unknown url type: '/tmp/tmptshxsk7d.gz'\n", + "runtime error: unknown url type: '/tmp/tmptshxsk7d.gz'\n", "final state: failed\n" ] }, @@ -910,26 +910,26 @@ " \n", " \n", " \n", - "
...977b17
\n", + "
...301911
\n", " 0\n", - " Jan 21 09:50:14\n", - "
error
\n", + " Jan 21 19:23:44\n", + "
error
\n", " open-archive\n", - "
host=download-8wt5x
kind=job
owner=admin
\n", + "
host=download-7qf2w
kind=job
owner=admin
\n", "
archive_url
\n", - "
key=contents
target_dir=/User/mlrun/functions/t0000
\n", + "
key=contents
target_dir=/User/mlrun/functions/images-from-tar-as-inputs
\n", " \n", " \n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -945,22 +945,22 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 74f0391c2da04aabb3f0735bfa977b17 , !mlrun logs 74f0391c2da04aabb3f0735bfa977b17 \n", - "[mlrun] 2020-01-21 09:50:17,234 run executed, status=error\n", - "runtime error: unknown url type: '/tmp/tmp14moqiew.gz'\n" + "!mlrun get run c163869b83cd49cc888f5e9126301911 , !mlrun logs c163869b83cd49cc888f5e9126301911 \n", + "[mlrun] 2020-01-21 19:23:48,687 run executed, status=error\n", + "runtime error: unknown url type: '/tmp/tmptshxsk7d.gz'\n" ] }, { "ename": "RunError", - "evalue": "unknown url type: '/tmp/tmp14moqiew.gz'", + "evalue": "unknown url type: '/tmp/tmptshxsk7d.gz'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mRunError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# run\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mrun\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mopen_archive_task\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# run\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mrun\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mopen_archive_task\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, runspec, handler, name, project, params, inputs, out_path, workdir, watch, schedule)\u001b[0m\n\u001b[1;32m 266\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_post_run\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 267\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 268\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 269\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 270\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_api_server\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkfp\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36m_wrap_result\u001b[0;34m(self, result, runspec, err)\u001b[0m\n\u001b[1;32m 334\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_child\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 335\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'runtime error: {}'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 336\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mRunError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 337\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrun\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 338\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mRunError\u001b[0m: unknown url type: '/tmp/tmp14moqiew.gz'" + "\u001b[0;31mRunError\u001b[0m: unknown url type: '/tmp/tmptshxsk7d.gz'" ] } ], From 8814f73311ce0df8d0f7b0390dbb588a8942b9ee Mon Sep 17 00:00:00 2001 From: yasha Date: Wed, 22 Jan 2020 22:40:52 +0000 Subject: [PATCH 13/32] lightgbm/sklearn classifier running, load yaml only --- .../binary.py | 30 +- datagen/classification/binary.yaml | 13 + fileutils/arc_to_parquet/arc_to_parquet.yaml | 14 +- fileutils/open_archive/function.yaml | 14 +- serving/classifier_server.ipynb | 114 +- serving/lightgbm/lgbm_serving.ipynb | 398 ------ serving/lightgbm/train.yaml | 19 - tests/arc_to_parquet.ipynb | 783 ++++------- tests/create_binary_data.ipynb | 359 ++++++ tests/generate-some-classifiers.ipynb | 311 ----- tests/train_classifier.ipynb | 1144 +++++++++++++++++ train/sklearn-classifier.py | 104 ++ train/sklearn-classifier.yaml | 9 + 13 files changed, 1945 insertions(+), 1367 deletions(-) rename datagen/{binary_classes => classification}/binary.py (84%) create mode 100644 datagen/classification/binary.yaml delete mode 100644 serving/lightgbm/lgbm_serving.ipynb delete mode 100644 serving/lightgbm/train.yaml create mode 100644 tests/create_binary_data.ipynb delete mode 100644 tests/generate-some-classifiers.ipynb create mode 100644 tests/train_classifier.ipynb create mode 100644 train/sklearn-classifier.py create mode 100644 train/sklearn-classifier.yaml diff --git a/datagen/binary_classes/binary.py b/datagen/classification/binary.py similarity index 84% rename from datagen/binary_classes/binary.py rename to datagen/classification/binary.py index 715ab3d4c..895fd8702 100644 --- a/datagen/binary_classes/binary.py +++ b/datagen/classification/binary.py @@ -1,4 +1,4 @@ -n_samp# Copyright 2019 Iguazio +# Copyright 2019 Iguazio # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -22,16 +22,15 @@ def create_binary_classification( - context: MLClientCtx = None, - n_samples: int = 100_000, - m_features: int = 20, - features_hdr: Optional[List[str]] = None, - weight: float = 0.50, - random_state=1, - filename: Optional[str] = None, - target_path: str = "", - key: str = "", - **sk_params, + context : MLClientCtx = None, + n_samples : int = 100_000, + m_features : int = 20, + features_hdr : Optional[List[str]] = None, + weight : float = 0.50, + random_state : int =1, + filename : Optional[str] = None, + target_path : str = "", + key : str = "" ): """Create a binary classification sample dataset and save. If no filename is given it will default to: @@ -46,7 +45,6 @@ def create_binary_classification( :param filename: optional name for stored data file :param target_path: destimation for file :param key: key of data in artifact store - :param sk_params: keyword arguments for scikit-learn's 'make_classification' Returns filename of created data (includes path). """ # check directories exist and create filename if None: @@ -54,15 +52,15 @@ def create_binary_classification( if not filename: name = f"simdata-{n_samples:0.0e}X{m_features}.parquet".replace("+", "") filename = os.path.join(target_path, name) - + else: + filename = os.path.join(target_path, filename) + features, labels = make_classification( n_samples=n_samples, n_features=m_features, weights=[weight], # False n_classes=2, - random_state=random_state, - **sk_params, - ) + random_state=random_state) # make dataframes, add column names, concatenate (X, y) X = pd.DataFrame(features) diff --git a/datagen/classification/binary.yaml b/datagen/classification/binary.yaml new file mode 100644 index 000000000..e95e8c57d --- /dev/null +++ b/datagen/classification/binary.yaml @@ -0,0 +1,13 @@ +kind: job +metadata: + name: binary + tag: '' + hash: 35ea8daeef209d353ee6b75d3cca2b61b16b8e6a + project: '' +spec: + description: 'create binary classification data' + build: + functionSourceCode: IyBDb3B5cmlnaHQgMjAxOSBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgppbXBvcnQgb3MKaW1wb3J0IHBhbmRhcyBhcyBwZAppbXBvcnQgcHlhcnJvdyBhcyBwYQppbXBvcnQgcHlhcnJvdy5wYXJxdWV0IGFzIHBxCmZyb20gdHlwaW5nIGltcG9ydCBPcHRpb25hbCwgTGlzdCwgQW55CmZyb20gc2tsZWFybi5kYXRhc2V0cyBpbXBvcnQgbWFrZV9jbGFzc2lmaWNhdGlvbgoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CgoKZGVmIGNyZWF0ZV9iaW5hcnlfY2xhc3NpZmljYXRpb24oCiAgICBjb250ZXh0IDogTUxDbGllbnRDdHggPSBOb25lLAogICAgbl9zYW1wbGVzIDogaW50ID0gMTAwXzAwMCwKICAgIG1fZmVhdHVyZXMgOiBpbnQgPSAyMCwKICAgIGZlYXR1cmVzX2hkciA6IE9wdGlvbmFsW0xpc3Rbc3RyXV0gPSBOb25lLAogICAgd2VpZ2h0IDogZmxvYXQgPSAwLjUwLAogICAgcmFuZG9tX3N0YXRlIDogaW50ID0xLAogICAgZmlsZW5hbWUgOiBPcHRpb25hbFtzdHJdID0gTm9uZSwKICAgIHRhcmdldF9wYXRoIDogc3RyID0gIiIsCiAgICBrZXkgOiBzdHIgPSAiIgopOgogICAgIiIiQ3JlYXRlIGEgYmluYXJ5IGNsYXNzaWZpY2F0aW9uIHNhbXBsZSBkYXRhc2V0IGFuZCBzYXZlLgogICAgSWYgbm8gZmlsZW5hbWUgaXMgZ2l2ZW4gaXQgd2lsbCBkZWZhdWx0IHRvOgogICAgJ3NpbWRhdGEte25fc2FtcGxlc31Ye21fZmVhdHVyZXN9LnBhcnF1ZXQnLgogICAgQWxsIG9mIHRoZSBzY2lraXQtbGVhcm4gcGFyYW1ldGVycyBjYW4gYmUgc2V0IHVzaW5nICoqc2tfcGFyYW1zCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIG5fc2FtcGxlczogICAgIG51bWJlciBvZiByb3dzL3NhbXBsZXMKICAgIDpwYXJhbSBtX2ZlYXR1cmVzOiAgICBudW1iZXIgb2YgY29scy9mZWF0dXJlcwogICAgOnBhcmFtIGZlYXR1cmVzX2hkcjogIGhlYWRlciBmb3IgZmVhdHVyZXMgYXJyYXkKICAgIDpwYXJhbSB3ZWlnaHQ6ICAgICAgICBmcmFjdGlvbiBvZiBzYW1wbGUgKG5lZykKICAgIDpwYXJhbSByYW5kb21fc3RhdGU6ICBybmcgc2VlZCAoc2VlIGh0dHBzOi8vc2Npa2l0LWxlYXJuLm9yZy9zdGFibGUvZ2xvc3NhcnkuaHRtbCN0ZXJtLXJhbmRvbS1zdGF0ZSkKICAgIDpwYXJhbSBmaWxlbmFtZTogICAgICBvcHRpb25hbCBuYW1lIGZvciBzdG9yZWQgZGF0YSBmaWxlCiAgICA6cGFyYW0gdGFyZ2V0X3BhdGg6ICAgZGVzdGltYXRpb24gZm9yIGZpbGUKICAgIDpwYXJhbSBrZXk6ICAgICAgICAgICBrZXkgb2YgZGF0YSBpbiBhcnRpZmFjdCBzdG9yZQogICAgUmV0dXJucyBmaWxlbmFtZSBvZiBjcmVhdGVkIGRhdGEgKGluY2x1ZGVzIHBhdGgpLgogICAgIiIiCiAgICAjIGNoZWNrIGRpcmVjdG9yaWVzIGV4aXN0IGFuZCBjcmVhdGUgZmlsZW5hbWUgaWYgTm9uZToKICAgIG9zLm1ha2VkaXJzKHRhcmdldF9wYXRoLCBleGlzdF9vaz1UcnVlKQogICAgaWYgbm90IGZpbGVuYW1lOgogICAgICAgIG5hbWUgPSBmInNpbWRhdGEte25fc2FtcGxlczowLjBlfVh7bV9mZWF0dXJlc30ucGFycXVldCIucmVwbGFjZSgiKyIsICIiKQogICAgICAgIGZpbGVuYW1lID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgZWxzZToKICAgICAgICBmaWxlbmFtZSA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgZmlsZW5hbWUpCiAgICAKICAgIGZlYXR1cmVzLCBsYWJlbHMgPSBtYWtlX2NsYXNzaWZpY2F0aW9uKAogICAgICAgIG5fc2FtcGxlcz1uX3NhbXBsZXMsCiAgICAgICAgbl9mZWF0dXJlcz1tX2ZlYXR1cmVzLAogICAgICAgIHdlaWdodHM9W3dlaWdodF0sICAjIEZhbHNlCiAgICAgICAgbl9jbGFzc2VzPTIsCiAgICAgICAgcmFuZG9tX3N0YXRlPXJhbmRvbV9zdGF0ZSkKCiAgICAjIG1ha2UgZGF0YWZyYW1lcywgYWRkIGNvbHVtbiBuYW1lcywgY29uY2F0ZW5hdGUgKFgsIHkpCiAgICBYID0gcGQuRGF0YUZyYW1lKGZlYXR1cmVzKQogICAgaWYgbm90IGZlYXR1cmVzX2hkcjoKICAgICAgICBYLmNvbHVtbnMgPSBbImZlYXRfIiArIHN0cih4KSBmb3IgeCBpbiByYW5nZShtX2ZlYXR1cmVzKV0KICAgIGVsc2U6CiAgICAgICAgWC5jb2x1bW5zID0gZmVhdHVyZXNfaGRyCgogICAgeSA9IHBkLkRhdGFGcmFtZShsYWJlbHMsIGNvbHVtbnM9WyJsYWJlbHMiXSkKICAgIGRhdGEgPSBwZC5jb25jYXQoW1gsIHldLCBheGlzPTEpCgogICAgcHEud3JpdGVfdGFibGUocGEuVGFibGUuZnJvbV9wYW5kYXMoZGF0YSksIGZpbGVuYW1lKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3Qoa2V5LCB0YXJnZXRfcGF0aD1maWxlbmFtZSkK + base_image: yjbds/mlrun-ds:latest + commands: [] + code_origin: https://github.com/yjb-ds/functions.git#3f7e0c78313c0f8da3f2ae8535b625f06f5c3ee4:/User/repos/functions/datagen/classification/binary.py diff --git a/fileutils/arc_to_parquet/arc_to_parquet.yaml b/fileutils/arc_to_parquet/arc_to_parquet.yaml index 610b7e025..0233eb15a 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.yaml +++ b/fileutils/arc_to_parquet/arc_to_parquet.yaml @@ -2,7 +2,7 @@ kind: job metadata: name: arc-to-parquet tag: '' - hash: 41cf4cd59460123f71a3c50f4d399dfb84b54e3c + hash: c6826488913674ec359334111ac9612c79881a2e project: '' spec: command: '' @@ -12,12 +12,8 @@ spec: env: [] description: '' build: - functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlciBvbiAyMDIwLTAxLTIxIDE1OjEyCgppbXBvcnQgbWxydW4KbWxydW4ubWxjb25mLmRicGF0aCA9ICdodHRwOi8vbWxydW4tYXBpOjgwODAnCgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IHB5YXJyb3cucGFycXVldCBhcyBwcQppbXBvcnQgcHlhcnJvdyBhcyBwYQpmcm9tIHBpY2tsZSBpbXBvcnQgZHVtcCwgbG9hZAoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gdHlwaW5nIGltcG9ydCBJTywgQW55U3RyLCBVbmlvbiwgTGlzdCwgT3B0aW9uYWwKCgpkZWYgYXJjX3RvX3BhcnF1ZXQoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwKICAgIGFyY2hpdmVfdXJsOiBVbmlvbltzdHIsIFBhdGgsIElPW0FueVN0cl1dLAogICAgaGVhZGVyOiBPcHRpb25hbFtMaXN0W3N0cl1dID0gTm9uZSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAiIiwKICAgIG5hbWU6IHN0ciA9ICIiLAogICAgY2h1bmtzaXplOiBpbnQgPSAxMF8wMDAsCiAgICBsb2dfZGF0YTogYm9vbCA9IFRydWUsCiAgICBhZGRfdWlkOiBib29sID0gRmFsc2UsCiAgICBrZXk6IHN0ciA9ICJyYXdfZGF0YSIsCikgLT4gTm9uZToKICAgICIiIk9wZW4gYSBmaWxlL29iamVjdCBhcmNoaXZlIGFuZCBzYXZlIGFzIGEgcGFycXVldCBmaWxlLgogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBhcmNoaXZlX3VybDogYW55IHZhbGlkIHN0cmluZyBwYXRoIGNvbnNpc3RlbnQgd2l0aCB0aGUgcGF0aCB2YXJpYWJsZQogICAgICAgICAgICAgICAgICAgICAgICBvZiBwYW5kYXMucmVhZF9jc3YsIGluY2x1ZGluZyBzdHJpbmdzIGFzIGZpbGUgcGF0aHMsIGFzIHVybHMsIAogICAgICAgICAgICAgICAgICAgICAgICBwYXRobGliLlBhdGggb2JqZWN0cywgZXRjLi4uCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgIGNvbHVtbiBuYW1lcwogICAgOnBhcmFtIHRhcmdldF9wYXRoOiBkZXN0aW5hdGlvbiBmb2xkZXIgb2YgdGFibGUKICAgIDpwYXJhbSBuYW1lOiAgICAgICAgbmFtZSBmaWxlIHRvIGJlIHNhdmVkIGxvY2FsbHksIGFsc28KICAgIDpwYXJhbSBjaHVua3NpemU6ICAgKDApIHJvdyBzaXplIHJldHJpZXZlZCBwZXIgaXRlcmF0aW9uCiAgICA6cGFyYW0ga2V5OiAgICAgICAgIGtleSBpbiBhcnRpZmFjdCBzdG9yZSAod2hlbiBsb2dfZGF0YT1UcnVlKQogICAgIiIiCiAgICBpZiBub3QgbmFtZS5lbmRzd2l0aCgiLnBhcnF1ZXQiKToKICAgICAgICBuYW1lICs9ICIucGFycXVldCIKCiAgICBkZXN0X3BhdGggPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUpCiAgICBvcy5tYWtlZGlycyhvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgpLCBleGlzdF9vaz1UcnVlKQogICAgaWYgbm90IG9zLnBhdGguaXNmaWxlKGRlc3RfcGF0aCk6CiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbygiZGVzdGluYXRpb24gZmlsZSBkb2VzIG5vdCBleGlzdCwgZG93bmxvYWRpbmciKQogICAgICAgIHBxd3JpdGVyID0gTm9uZQogICAgICAgIGZvciBpLCBkZiBpbiBlbnVtZXJhdGUocGQucmVhZF9jc3YoYXJjaGl2ZV91cmwsIGNodW5rc2l6ZT1jaHVua3NpemUsIG5hbWVzPWhlYWRlcikpOgogICAgICAgICAgICB0YWJsZSA9IHBhLlRhYmxlLmZyb21fcGFuZGFzKGRmKQogICAgICAgICAgICBpZiBpID09IDA6CiAgICAgICAgICAgICAgICBwcXdyaXRlciA9IHBxLlBhcnF1ZXRXcml0ZXIoZGVzdF9wYXRoLCB0YWJsZS5zY2hlbWEpCiAgICAgICAgICAgIHBxd3JpdGVyLndyaXRlX3RhYmxlKHRhYmxlKQoKICAgICAgICBpZiBwcXdyaXRlcjoKICAgICAgICAgICAgcHF3cml0ZXIuY2xvc2UoKQoKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKGYic2F2ZWQgdGFibGUgdG8ge2Rlc3RfcGF0aH0iKQogICAgZWxzZToKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCJkZXN0aW5hdGlvbiBmaWxlIGFscmVhZHkgZXhpc3RzIikKCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWRlc3RfcGF0aCkKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCAnaGVhZGVyLnBrbCcpCiAgICBkdW1wKGhlYWRlciwgb3BlbihmaWxlcGF0aCwgJ3diJykpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgnaGVhZGVyJywgdGFyZ2V0X3BhdGg9ZmlsZXBhdGgpICAgICAgIAoK - commands: - - python -m pip uninstall mlrun - - python -m pip install -U -q mlrun - - python -m pip install -U -q pandas - - python -m pip install -U -q pyarrow - - python -m pip install -U -q numpy==1.17.4 - code_origin: https://github.com/yjb-ds/functions.git#208ab7f82c1def90cd5911679a933aa787454f88:arc + functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlciBvbiAyMDIwLTAxLTIyIDE3OjQyCgppbXBvcnQgbWxydW4KbWxydW4ubWxjb25mLmRicGF0aCA9ICdodHRwOi8vbWxydW4tYXBpOjgwODAnCgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IHB5YXJyb3cucGFycXVldCBhcyBwcQppbXBvcnQgcHlhcnJvdyBhcyBwYQpmcm9tIHBpY2tsZSBpbXBvcnQgZHVtcCwgbG9hZAoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gdHlwaW5nIGltcG9ydCBJTywgQW55U3RyLCBVbmlvbiwgTGlzdCwgT3B0aW9uYWwKCgpkZWYgYXJjX3RvX3BhcnF1ZXQoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwKICAgIGFyY2hpdmVfdXJsOiBVbmlvbltzdHIsIFBhdGgsIElPW0FueVN0cl1dLAogICAgaGVhZGVyOiBPcHRpb25hbFtMaXN0W3N0cl1dID0gTm9uZSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAiIiwKICAgIG5hbWU6IHN0ciA9ICIiLAogICAgY2h1bmtzaXplOiBpbnQgPSAxMF8wMDAsCiAgICBsb2dfZGF0YTogYm9vbCA9IFRydWUsCiAgICBhZGRfdWlkOiBib29sID0gRmFsc2UsCiAgICBrZXk6IHN0ciA9ICJyYXdfZGF0YSIsCikgLT4gTm9uZToKICAgICIiIk9wZW4gYSBmaWxlL29iamVjdCBhcmNoaXZlIGFuZCBzYXZlIGFzIGEgcGFycXVldCBmaWxlLgogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBhcmNoaXZlX3VybDogYW55IHZhbGlkIHN0cmluZyBwYXRoIGNvbnNpc3RlbnQgd2l0aCB0aGUgcGF0aCB2YXJpYWJsZQogICAgICAgICAgICAgICAgICAgICAgICBvZiBwYW5kYXMucmVhZF9jc3YsIGluY2x1ZGluZyBzdHJpbmdzIGFzIGZpbGUgcGF0aHMsIGFzIHVybHMsIAogICAgICAgICAgICAgICAgICAgICAgICBwYXRobGliLlBhdGggb2JqZWN0cywgZXRjLi4uCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgIGNvbHVtbiBuYW1lcwogICAgOnBhcmFtIHRhcmdldF9wYXRoOiBkZXN0aW5hdGlvbiBmb2xkZXIgb2YgdGFibGUKICAgIDpwYXJhbSBuYW1lOiAgICAgICAgbmFtZSBmaWxlIHRvIGJlIHNhdmVkIGxvY2FsbHksIGFsc28KICAgIDpwYXJhbSBjaHVua3NpemU6ICAgKDApIHJvdyBzaXplIHJldHJpZXZlZCBwZXIgaXRlcmF0aW9uCiAgICA6cGFyYW0ga2V5OiAgICAgICAgIGtleSBpbiBhcnRpZmFjdCBzdG9yZSAod2hlbiBsb2dfZGF0YT1UcnVlKQogICAgIiIiCiAgICBpZiBub3QgbmFtZS5lbmRzd2l0aCgiLnBhcnF1ZXQiKToKICAgICAgICBuYW1lICs9ICIucGFycXVldCIKCiAgICBkZXN0X3BhdGggPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUpCiAgICBvcy5tYWtlZGlycyhvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgpLCBleGlzdF9vaz1UcnVlKQogICAgaWYgbm90IG9zLnBhdGguaXNmaWxlKGRlc3RfcGF0aCk6CiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbygiZGVzdGluYXRpb24gZmlsZSBkb2VzIG5vdCBleGlzdCwgZG93bmxvYWRpbmciKQogICAgICAgIHBxd3JpdGVyID0gTm9uZQogICAgICAgIGZvciBpLCBkZiBpbiBlbnVtZXJhdGUocGQucmVhZF9jc3YoYXJjaGl2ZV91cmwsIGNodW5rc2l6ZT1jaHVua3NpemUsIG5hbWVzPWhlYWRlcikpOgogICAgICAgICAgICBwYXJxdWV0X3NjaGVtYSA9IHBhLlRhYmxlLmZyb21fcGFuZGFzKGRmPWRmKS5zY2hlbWEKICAgICAgICAgICAgaWYgaSA9PSAwOgogICAgICAgICAgICAgICAgcHF3cml0ZXIgPSBwcS5QYXJxdWV0V3JpdGVyKGRlc3RfcGF0aCwgcGFycXVldF9zY2hlbWEpCiAgICAgICAgICAgIHRhYmxlID0gcGEuVGFibGUuZnJvbV9wYW5kYXMoZGYsIHBhcnF1ZXRfc2NoZW1hKQogICAgICAgICAgICBwcXdyaXRlci53cml0ZV90YWJsZSh0YWJsZSkKICAgICAgICBpZiBwcXdyaXRlcjoKICAgICAgICAgICAgcHF3cml0ZXIuY2xvc2UoKQoKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKGYic2F2ZWQgdGFibGUgdG8ge2Rlc3RfcGF0aH0iKQogICAgZWxzZToKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCJkZXN0aW5hdGlvbiBmaWxlIGFscmVhZHkgZXhpc3RzIikKCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWRlc3RfcGF0aCkKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCAnaGVhZGVyLnBrbCcpCiAgICBkdW1wKGhlYWRlciwgb3BlbihmaWxlcGF0aCwgJ3diJykpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgnaGVhZGVyJywgdGFyZ2V0X3BhdGg9ZmlsZXBhdGgpICAgICAgIAoK + base_image: yjbds/mlrun-files:latest + commands: [] + code_origin: https://github.com/yjb-ds/functions.git#3f7e0c78313c0f8da3f2ae8535b625f06f5c3ee4:arc to parquet.ipynb diff --git a/fileutils/open_archive/function.yaml b/fileutils/open_archive/function.yaml index dfc733e4f..b751e79a6 100644 --- a/fileutils/open_archive/function.yaml +++ b/fileutils/open_archive/function.yaml @@ -1,18 +1,8 @@ kind: job metadata: name: open-archive - tag: '' - hash: f636d58e75f2044e010c7bfedc2ce0720eb207c5 - project: '' spec: - command: '' - args: [] - image: mlrun/mlrun:latest - volumes: [] - volume_mounts: [] - env: [] - description: '' + image: yjbds/mlrun-files:latest + description: 'retrieve archive and extract all' build: functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlciBvbiAyMDIwLTAxLTIxIDA5OjQ3CgppbXBvcnQgbWxydW4KbWxydW4ubWxjb25mLmRicGF0aCA9ICdodHRwOi8vbWxydW4tYXBpOjgwODAnCgppbXBvcnQgdXJsbGliLnJlcXVlc3QKCmltcG9ydCBvcwppbXBvcnQgemlwZmlsZQppbXBvcnQgdXJsbGliCmltcG9ydCB0YXJmaWxlCmltcG9ydCBqc29uCgpmcm9tIG1scnVuLmV4ZWN1dGlvbiBpbXBvcnQgTUxDbGllbnRDdHgKCmRlZiBvcGVuX2FyY2hpdmUoY29udGV4dDogTUxDbGllbnRDdHgsIAogICAgICAgICAgICAgICAgIHRhcmdldF9kaXI6IHN0ciA9ICdjb250ZW50JywKICAgICAgICAgICAgICAgICBhcmNoaXZlX3VybDogc3RyID0gJycpOgogICAgIiIiT3BlbiBhIGZpbGUvb2JqZWN0IGFyY2hpdmUgaW50byBhIHRhcmdldCBkaXJlY3RvcnkKICAgIAogICAgQ3VycmVudGx5IHN1cHBvcnRzIHppcCBhbmQgdGFyLmd6CiAgICAiIiIKICAgIG9zLm1ha2VkaXJzKHRhcmdldF9kaXIsIGV4aXN0X29rPVRydWUpCiAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCdWZXJpZmllZCBkaXJlY3RvcmllcycpCiAgICBwcmludChhcmNoaXZlX3VybCkKICAgIHNwbGl0cyA9IGFyY2hpdmVfdXJsLnNwbGl0KCcuJykKICAgIHByaW50KHNwbGl0cykKICAgIGlmIChzcGxpdHNbLTFdID09ICdneicpOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oJ29wZW5pbmcgdGFyX2d6JykKICAgICAgICByZWYgPSB0YXJmaWxlLm9wZW4oZmlsZW9iaj11cmxsaWIucmVxdWVzdC51cmxvcGVuKGFyY2hpdmVfdXJsKSwgbW9kZT0ncnxneicpCiAgICBlbGlmIHNwbGl0c1stMV0gPT0gJ3ppcCc6CiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbygnb3BlbmluZyB6aXAnKQogICAgICAgIHJlZiA9IHppcGZpbGUuWmlwRmlsZShhcmNoaXZlX3VybCwgJ3InKQoKICAgIHJlZi5leHRyYWN0YWxsKHRhcmdldF9kaXIpCiAgICByZWYuY2xvc2UoKQoKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KCdjb250ZW50JywgdGFyZ2V0X3BhdGg9dGFyZ2V0X2RpcikKCg== - commands: [] - code_origin: https://github.com/yjb-ds/functions.git#6deeed18dd44b0d9e5c4d67fdca1696b1eaecbac:open_archive.ipynb diff --git a/serving/classifier_server.ipynb b/serving/classifier_server.ipynb index c1a6b2abf..71e835ceb 100644 --- a/serving/classifier_server.ipynb +++ b/serving/classifier_server.ipynb @@ -81,17 +81,9 @@ "cell_type": "code", "execution_count": 4, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "%nuclio: setting spec.build.baseImage to 'python:3.6-jessie'\n" - ] - } - ], + "outputs": [], "source": [ - "%nuclio config spec.build.baseImage = \"python:3.6-jessie\"" + "# %nuclio config spec.build.baseImage = \"yjbds/mlrun-files:latest\"" ] }, { @@ -103,7 +95,7 @@ "import kfserving\n", "import os\n", "import numpy as np\n", - "from pickle import load as pload" + "from cloudpickle import load as pload" ] }, { @@ -189,19 +181,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "[I 200121 20:35:56 storage:35] Copying contents of /User/mlrun/models to local\n" - ] - }, - { - "ename": "FileNotFoundError", - "evalue": "[Errno 2] No such file or directory: '/User/mlrun/models/lgb-classifier.pkl'", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mmy_server\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mClassifierModel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'classifier'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmodel_dir\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'/User/mlrun/models'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mmy_server\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mload\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;32m\u001b[0m in \u001b[0;36mload\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 11\u001b[0m model_file = os.path.join(\n\u001b[1;32m 12\u001b[0m kfserving.Storage.download(self.model_dir), MODEL_FILE)\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclassifier\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpload\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mopen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmodel_file\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'rb'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 14\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mready\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '/User/mlrun/models/lgb-classifier.pkl'" + "[I 200122 18:51:48 storage:35] Copying contents of /User/mlrun/models to local\n" ] } ], @@ -220,7 +200,7 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ @@ -235,7 +215,7 @@ }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -252,7 +232,7 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 12, "metadata": {}, "outputs": [ { @@ -261,7 +241,7 @@ "[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0]" ] }, - "execution_count": 33, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } @@ -275,17 +255,12 @@ "metadata": {}, "source": [ "\n", - "### **deploy our serving class using as a serverless function**\n", - "in the following section we create a new model serving function which wraps our class , and specify model and other resources.\n", - "\n", - "the `models` dict store model names and the assosiated model **dir** URL (the URL can start with `S3://` and other blob store options), the faster way is to use a shared file volume, we use `.apply(mount_v3io())` to attach a v3io (iguazio data fabric) volume to our function. By default v3io will mount the current user home into the `\\User` function path.\n", - "\n", - "**verify the model dir does contain a valid `model.bst` file**" + "### **deploy our serving class using as a serverless function**" ] }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 13, "metadata": {}, "outputs": [], "source": [ @@ -295,26 +270,13 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 14, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "fn = new_model_server('some-classifier-model', \n", + "fn = new_model_server('generic', \n", " models={'classifier_gen': TARGET_PATH}, \n", - " model_class='ClassifierModel')\n", - "\n", - "fn.apply(mount_v3io()) " + " model_class='ClassifierModel').apply(mount_v3io())" ] }, { @@ -328,17 +290,14 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-21 17:56:08,987 deploy started\n", - "[nuclio] 2020-01-21 17:58:21,215 (info) Build complete\n", - "[nuclio] 2020-01-21 17:58:28,339 (info) Function deploy complete\n", - "[nuclio] 2020-01-21 17:58:28,346 done updating some-classifier-model, function address: 3.135.246.153:31127\n" + "[mlrun] 2020-01-22 18:51:52,843 deploy started\n" ] } ], @@ -351,15 +310,12 @@ "metadata": {}, "source": [ "\n", - "### **test our model server using HTTP request**\n", - "\n", - "\n", - "We invoke our model serving function using test data, the data vector is specified in the `instances` attribute." + "### **test our model server using HTTP request**" ] }, { "cell_type": "code", - "execution_count": 25, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -371,44 +327,18 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "b'Exception caught in handler \"No module named \\'lightgbm\\'\": Traceback (most recent call last):\\n File \"/opt/nuclio/_nuclio_wrapper.py\", line 176, in serve_requests\\n entrypoint_output = self._entrypoint(self._context, event)\\n File \"/opt/nuclio/classifier_server.py\", line 40, in handler\\n return context.mlrun_handler(context, event)\\n File \"/usr/local/lib/python3.6/site-packages/mlrun/runtimes/serving.py\", line 67, in nuclio_serving_handler\\n return route(context, model_name, event)\\n File \"/usr/local/lib/python3.6/site-packages/mlrun/runtimes/serving.py\", line 132, in post\\n model = self.get_model(name)\\n File \"/usr/local/lib/python3.6/site-packages/mlrun/runtimes/serving.py\", line 88, in get_model\\n model.load()\\n File \"/opt/nuclio/classifier_server.py\", line 23, in load\\n self.classifier = pload(open(model_file, \\'rb\\'))\\nModuleNotFoundError: No module named \\'lightgbm\\'\\n'" - ] - }, - "execution_count": 26, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "resp.__dict__['_content'] " ] }, { "cell_type": "code", - "execution_count": 27, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "ename": "JSONDecodeError", - "evalue": "Expecting value: line 1 column 1 (char 0)", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mJSONDecodeError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mjson\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloads\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcontent\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;32m/conda/lib/python3.6/json/__init__.py\u001b[0m in \u001b[0;36mloads\u001b[0;34m(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)\u001b[0m\n\u001b[1;32m 352\u001b[0m \u001b[0mparse_int\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mparse_float\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;32mand\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 353\u001b[0m parse_constant is None and object_pairs_hook is None and not kw):\n\u001b[0;32m--> 354\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0m_default_decoder\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 355\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcls\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 356\u001b[0m \u001b[0mcls\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mJSONDecoder\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/conda/lib/python3.6/json/decoder.py\u001b[0m in \u001b[0;36mdecode\u001b[0;34m(self, s, _w)\u001b[0m\n\u001b[1;32m 337\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 338\u001b[0m \"\"\"\n\u001b[0;32m--> 339\u001b[0;31m \u001b[0mobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mend\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mraw_decode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0midx\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0m_w\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 340\u001b[0m \u001b[0mend\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_w\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mend\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 341\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mend\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/conda/lib/python3.6/json/decoder.py\u001b[0m in \u001b[0;36mraw_decode\u001b[0;34m(self, s, idx)\u001b[0m\n\u001b[1;32m 355\u001b[0m \u001b[0mobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mend\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscan_once\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0midx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 356\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mStopIteration\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 357\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mJSONDecodeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Expecting value\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 358\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mend\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mJSONDecodeError\u001b[0m: Expecting value: line 1 column 1 (char 0)" - ] - } - ], + "outputs": [], "source": [ "json.loads(resp.content)" ] diff --git a/serving/lightgbm/lgbm_serving.ipynb b/serving/lightgbm/lgbm_serving.ipynb deleted file mode 100644 index 9ff22fff1..000000000 --- a/serving/lightgbm/lgbm_serving.ipynb +++ /dev/null @@ -1,398 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Deploy a Serverless Model Server with Nuclio-KFServing\n", - " --------------------------------------------------------------------\n", - "\n", - "The following notebook demonstrates how to deploy a **[LighGBM](https://github.com/microsoft/LightGBM)** model using **[nuclio](https://github.com/nuclio/nuclio)** + **[KFServing](https://github.com/kubeflow/kfserving)** (a.k.a Nuclio-serving)\n", - "\n", - "#### **notebook how-to's**\n", - "* Write and test model serving (KFServing) class in a notebook.\n", - "* Deploy the model server as a Nuclio-serving function.\n", - "* Invoke and test the serving function." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "#### **steps**\n", - "**[define a new function and its dependencies](#define-function)**
\n", - "**[test the model serving class locally](#test-locally)**
\n", - "**[deploy our serving class using as a serverless function](#deploy)**
\n", - "**[test our model server using HTTP request](#test-model-server)**
" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": {}, - "outputs": [], - "source": [ - "# nuclio: ignore\n", - "# if the nuclio-jupyter package is not installed run !pip install nuclio-jupyter\n", - "import nuclio" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "### **define a new function and its dependencies**" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "%nuclio: setting kind to 'nuclio:serving'\n", - "%nuclio: setting 'MODEL_CLASS' environment variable\n" - ] - } - ], - "source": [ - "%nuclio config kind=\"nuclio:serving\"\n", - "%nuclio env MODEL_CLASS=ClassifierModel" - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": {}, - "outputs": [], - "source": [ - "%%nuclio cmd -c\n", - "pip install -U -q kfserving\n", - "pip install -U -q azure\n", - "pip install -U -q numpy\n", - "pip install -U -q xgboost\n", - "pip install -U -q lightgbm\n", - "pip install -U -q mlrun" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": {}, - "outputs": [], - "source": [ - "import kfserving\n", - "import os\n", - "import numpy as np\n", - "from pickle import load\n", - "import lightgbm as lgb" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**NOTE: bring your own pickled model by changing the following variables, or run the [LightGBM demo](https://github.com/mlrun/demos/tree/master/lightgbm#instructions-for-lightgbm-demo).**" - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": {}, - "outputs": [], - "source": [ - "model = ()\n", - "TARGET_PATH = '/User/mlrun/lightgbm'\n", - "MODEL_FILE = 'lightgbm_classifier.pkl'" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "class ClassifierModel(kfserving.KFModel):\n", - " def __init__(self, name: str, model_dir: str, model = None):\n", - " super().__init__(name)\n", - " self.name = name\n", - " self.model_dir = model_dir\n", - " if not model is None:\n", - " self.classifier = model\n", - " self.ready = True\n", - "\n", - " def load(self):\n", - " model_file = os.path.join(\n", - " kfserving.Storage.download(self.model_dir), MODEL_FILE)\n", - " self.classifier = load(open(model_file, 'rb'))\n", - " self.ready = True\n", - "\n", - " def predict(self, body):\n", - " try:\n", - " feats = np.asarray(body['instances'])\n", - " result: np.ndarray = self.classifier.predict(feats)\n", - " return result.tolist()\n", - " except Exception as e:\n", - " raise Exception(\"Failed to predict %s\" % e)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The following end-code annotation tells ```nuclio``` to stop parsing the notebook from this cell. _**Please do not remove this cell**_:" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ - "# nuclio: end-code" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "______________________________________________" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "### **test the model serving class locally**\n", - "The class above can be tested locally. Just instantiate the class, `.load()` will load the model to a local dir.\n", - "\n", - "> **Verify there is a `model.bst` file in the model_dir path (generated by the training notebook)**" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "import pyarrow.parquet as pq\n", - "import pyarrow as pa\n", - "import pandas as pd" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[I 200120 17:03:17 storage:35] Copying contents of /User/mlrun/lightgbm to local\n" - ] - } - ], - "source": [ - "my_server = ClassifierModel('some-classifier-model', model_dir=TARGET_PATH)\n", - "my_server.load()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### _data_\n", - "Grab some data from the test set we prepared in the **[LightGBM demo](https://github.com/mlrun/demos/tree/master/lightgbm#instructions-for-lightgbm-demo)**:" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [], - "source": [ - "features = pq.read_table(os.path.join(TARGET_PATH, 'xtest.parquet')).to_pandas().iloc[:3, :]\n", - "labels = pq.read_table(os.path.join(TARGET_PATH, 'ytest.parquet')).to_pandas().iloc[:3, :]" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [], - "source": [ - "event = {\"instances\": features.values.tolist()}" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "We can use the `.predict(body)` method to test the model." - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[1.0, 1.0, 0.0]" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "my_server.predict(event)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "### **deploy our serving class using as a serverless function**\n", - "in the following section we create a new model serving function which wraps our class , and specify model and other resources.\n", - "\n", - "the `models` dict store model names and the assosiated model **dir** URL (the URL can start with `S3://` and other blob store options), the faster way is to use a shared file volume, we use `.apply(mount_v3io())` to attach a v3io (iguazio data fabric) volume to our function. By default v3io will mount the current user home into the `\\User` function path.\n", - "\n", - "**verify the model dir does contain a valid `model.bst` file**" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [], - "source": [ - "from mlrun import new_model_server, mount_v3io\n", - "import requests" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "fn = new_model_server('some-classifier-model', \n", - " models={'classifier_gen': TARGET_PATH}, \n", - " model_class='ClassifierModel')\n", - "\n", - "fn.apply(mount_v3io()) " - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-20 17:05:07,328 deploy started\n", - "[nuclio] 2020-01-20 17:07:37,734 (info) Build complete\n", - "[nuclio] 2020-01-20 17:07:46,843 done creating some-classifier-model, function address: 3.135.246.153:31529\n" - ] - } - ], - "source": [ - "addr = fn.deploy()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "### **test our model server using HTTP request**\n", - "\n", - "\n", - "We invoke our model serving function using test data, the data vector is specified in the `instances` attribute." - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "metadata": {}, - "outputs": [], - "source": [ - "import json\n", - "resp = requests.post(addr + '/classifier_gen/predict', json=event)" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[1.0, 1.0, 0.0]" - ] - }, - "execution_count": 26, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "json.loads(resp.content)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**[back to top](#top)**" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.8" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/serving/lightgbm/train.yaml b/serving/lightgbm/train.yaml deleted file mode 100644 index b6a2c7fbf..000000000 --- a/serving/lightgbm/train.yaml +++ /dev/null @@ -1,19 +0,0 @@ -kind: job -metadata: - name: lgbm-job -spec: - description: 'train an LGBMClassifier' - build: - functionSourceCode: # Generated by nuclio.export.NuclioExporter on 2020-01-21 21:41

from io import BytesIO
from os import path, makedirs
import json
from cloudpickle import load, dump
from pathlib import Path
from urllib.request import urlretrieve
from typing import IO, AnyStr, TypeVar, Union, List, Optional

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.metrics import (roc_curve, confusion_matrix)
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
from matplotlib.figure import Figure
import seaborn as sns

import pyarrow.parquet as pq
import pyarrow as pa
from pyarrow import Table

from mlrun.artifacts import TableArtifact, PlotArtifact
from mlrun.execution import MLClientCtx
from mlrun.datastore import DataItem

def get_context_table(ctxtable: DataItem) -> Table:
    """deserialize table in artifact store
    
    :param ctxtable:  table in the artifact store
    """
    blob = BytesIO(ctxtable.get())
    return pd.read_parquet(blob, engine='pyarrow')

def log_context_table(
    context: MLClientCtx,
    target_path: str,
    key: str,
    table: pd.DataFrame
) -> None:
    """Log a table in the artifact store.
    
    The table is written as a parquet file, and its target
    path is saved in the context.
    
    :param context:      the function context
    :param target_path:  location (folder) of our DataItem
    :param key:          name of the object in the artifact store
    :param table:        the object we wish to store
    """
    filepath = path.join(target_path, key + '.parquet')
    pq.write_table(pa.Table.from_pandas(table), filepath)    
    context.log_artifact(key, target_path=filepath)

def log_lgbm_model(
    context: MLClientCtx,
    model,
    data,
    header: List = [],
    target_path: str = '',
    name: str = '',  # with file extension
    key: str = 'model',
    exp_labels: dict = {}
):
    """log a classifier model to the artifact store
    
    :param context:       function context
    :param model:         estimated model
    :param history:       training-validation metrics
    :param data:          train and test data
    :param header:        features labels
    :param target_path:   destintion folder for file artifacts
    :param name:          name of model file (or, prefix to model files)
    :param key:           key of model in artifact store
    :param labels:        model artifact labels
    
    Save an estimated model along with metadata, it's training-validation metrics 
    history and plots, roc curve, confusion matrix and feature importances.  
    """
    def _gcf_clear(plt):
        plt.cla()
        plt.clf()
        plt.close()        
    
    def plot_validation(train_metric, valid_metric):
        """Plot train and validation loss curves from a metrics table in an
        artifact store.

        These curves represent the training round losses from the training
        and validation sets.
        :param train_metric:    train metric
        :param valid_metric:    validation metric
        """
        plt.plot(train_metric)
        plt.plot(valid_metric)
        plt.title("training validation results")
        plt.xlabel("epoch")
        plt.ylabel("")
        plt.legend(["train", "valid"])
        fig = plt.gcf()

        plotpath = path.join(target_path, "history.png")
        plt.savefig(plotpath)
        context.log_artifact(PlotArtifact('training-validation-plot', body=fig, target_path=plotpath))

        _gcf_clear(plt)

    def plot_roc(y_labels, y_probs):
        """Plot an ROC curve from test data saved in an artifact store.
        :param y_labels:        test data labels
        :param y_probs:         test data 
        """
        fpr_xg, tpr_xg, _ = roc_curve(y_labels, y_probs)
        plt.plot([0, 1], [0, 1], "k--")
        plt.plot(fpr_xg, tpr_xg, label="roc")
        plt.xlabel("false positive rate")
        plt.ylabel("true positive rate")
        plt.title("roc curve")
        plt.legend(loc="best")
        fig = plt.gcf()

        plotpath = path.join(target_path, "roc.png")
        fig.savefig(plotpath, format=fmt)
        context.log_artifact(PlotArtifact('roc', body=fig))

        _gcf_clear(plt)

    def plot_confusion_matrix(labels, predictions):
        """Create a confusion matrix.
        Plot and save a confusion matrix using test data from a
        pipeline step.  The plot is generated usung default arguments.
        The present example could be extended by including a parameters `dict`
        that is passed through to sklearn's `confusion_matrix`,
        `ConfusionMatrixDisplay`, and matplotlib `plot`.
        :param labels:          test data labels
        :param predictions:     test data predictions
        """
        cm = confusion_matrix(labels,
                              predictions,
                              sample_weight=None,
                              labels=axislabels,
                              normalize='all')
        sns.heatmap(cm, annot=True, cmap="Blues")
        plotpath = path.join(target_path, "confusion.png")
        fig = plt.gcf()
        fig.savefig(plotpath)
        context.log_artifact(PlotArtifact('confusion_matrix', body=fig))

        _gcf_clear(plt)

    def plot_importance(model, header: List = []):
        """Display estimated feature importances.

        :param model:       fitted lightgbm model
        :param header:      list of feature names
        """
        zipped = zip(model.feature_importances_, header)

        feature_imp = pd.DataFrame(sorted(zipped), columns=['freq','feature']
                                  ).sort_values(by="freq", ascending=False)

        plt.figure(figsize=(20, 10))
        sns.barplot(x="freq", y="feature", data=feature_imp)
        plt.title('LightGBM Features')
        plt.tight_layout()
        fig = plt.gcf()
        plotpath = path.join(target_path, "feature-importances.png")
        fig.savefig(plotpath)
        context.log_artifact(PlotArtifact('feature-importances-plot', body=fig))

        tablepath = path.join(target_path, "feature-importances-table.csv")
        feature_imp.to_csv(tablepath)
        context.log_artifact(TableArtifact('feature-importances-table', target_path=tablepath))

        _gcf_clear(plt)

    if callable(getattr(model, 'predict_proba')):
        ypred_probs = model.predict_proba(data['xtest'])[:, 1]
        ypred = np.where(ypred_probs >= 0.5, 1, 0)
    else:
        ypred = model.predict(data['xtest'])
        ypred_probs = None

    context.log_result("test_accuracy", float(clf.score(data['xtest'], data['ytest'])))

    loss = np.asarray(model.evals_result_['train']['binary_logloss'], dtype=np.float)
    val_loss = np.asarray(model.evals_result_['valid']['binary_logloss'], dtype=np.float)

    plot_validation(loss, val_loss)
    if ypred_probs:
        plot_roc(data['ytest'], ypred_probs)
    if ypred:
        plot_confusion_matrix(data['ytest'], ypred)
    if hasattr(model, 'feature_importances_'):
        plot_importance(model, header)
   
    filepath = path.join(target_path, name)
    dump(model, open(filepath, 'wb'))
    context.log_artifact(key,
                         target_path=filepath,
                         labels=exp_labels)    

def train(
    context: MLClientCtx,
    src_file: str,
    header: DataItem,
    test_size: float = 0.1,
    train_val_split: float = 0.75,
    sample: int = -1,
    target_path: str = '',
    name: str = '',
    key: str = '',
    exp_labels = {},  # 'lightgbm_sklearn' if this were a pipeline
    verbose: bool = False,
    random_state = np.random.RandomState(1),
    **sklearn_params
) -> None:
    """Train and save a LightGBM model.
    
    :param context:         the function context
    :param src_file:        ('raw') name of raw data file
    :param header:          header artifact
    :param test_size:       (0.1) test set size
    :param train_val_split: (0.75) Once the test set has been removed the 
                            training set gets this proportion.
    :param sample:          (-1). Selects the first n rows, or select a sample starting
                            from the first. If negative <-1, select a random sample from 
                            the entire file
    :param target_path:     folder location of files
    :param name:            destination name for model file
    :param key:             key for model artifact
    :param exp_labels:      metadata dict, some keys are required (type, framework). 'type'
                            is either classifier or regressor, 'framework' can be sklearn or not
                            (sklearn models have a generic interface)
    :param verbose :        (False) show metrics for training/validation steps.
    :param random_state:    (1) sklearn rng seed
    :param sklearn_params   sklearn keyword params 
    """
    srcfilepath = path.join(target_path, src_file)
    if (sample == -1) or (sample >= 1):
        raw = pq.read_table(srcfilepath).to_pandas()
        labels = raw.pop('labels')
        raw = raw.iloc[:sample, :]
        labels = labels.iloc[:sample]
    else:
        raw = pq.read_table(srcfilepath).to_pandas().sample(sample*-1)
        labels = raw.pop('labels')

    x, xtest, y, ytest = train_test_split(raw, labels, train_size=1-test_size, 
                                          random_state=random_state)
   
    xtrain, xvalid, ytrain, yvalid = train_test_split(x, y, 
                                                      train_size=train_val_split, 
                                                      random_state=random_state)        
    
    clf = lgb.LGBMClassifier(random_state=random_state,
                             verbose=int(verbose == True))

    eval_results = dict()

    clf.fit(xtrain, 
            ytrain,
            eval_set=[(xvalid, yvalid), (xtrain, ytrain)],
            eval_names=['valid', 'train'],
            callbacks=[lgb.record_evaluation(eval_results)],
            verbose=verbose)
    
    context.log_result("train_accuracy", float(clf.score(xtrain, ytrain)))
    
    log_lgbm_model(
        context, 
        clf, 
        data = {'xtest':xtest, 'ytest':ytest},
        target_path=target_path,
        header=load(open(str(header), 'rb')),
        name=name, 
        key=key,
        exp_labels=exp_labels)

 - base_image: python:3.6-jessie - commands: - - rm /conda/lib/python3.6/site-packages/seaborn* -rf - - pip uninstall -y mlrun - - pip install -U -q mlrun - - pip install -U -q kfp - - pip install -U -q pyarrow - - pip install -U -q pandas - - pip install -U -q matplotlib - - pip install -U -q seaborn - - pip install -U -q scikit-learn - - pip install -U -q lightgbm diff --git a/tests/arc_to_parquet.ipynb b/tests/arc_to_parquet.ipynb index 8272c98ad..354f52e7d 100644 --- a/tests/arc_to_parquet.ipynb +++ b/tests/arc_to_parquet.ipynb @@ -31,35 +31,22 @@ "cell_type": "code", "execution_count": 3, "metadata": {}, - "outputs": [], - "source": [ - "%%nuclio cmd -c\n", - "python -m pip uninstall mlrun\n", - "python -m pip install -U -q mlrun\n", - "python -m pip install -U -q pandas\n", - "python -m pip install -U -q pyarrow" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "%nuclio: setting spec.build.baseImage to 'python:3.6-jessie'\n" + "%nuclio: setting spec.build.baseImage to 'yjbds/mlrun-files:latest'\n" ] } ], "source": [ - "%nuclio config spec.build.baseImage = \"python:3.6-jessie\"" + "%nuclio config spec.build.baseImage = \"yjbds/mlrun-files:latest\"" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -69,7 +56,7 @@ "import pandas as pd\n", "import pyarrow.parquet as pq\n", "import pyarrow as pa\n", - "from pickle import dump, load\n", + "from cloudpickle import dump, load\n", "\n", "from mlrun.execution import MLClientCtx\n", "from typing import IO, AnyStr, Union, List, Optional\n", @@ -107,11 +94,11 @@ " context.logger.info(\"destination file does not exist, downloading\")\n", " pqwriter = None\n", " for i, df in enumerate(pd.read_csv(archive_url, chunksize=chunksize, names=header)):\n", - " table = pa.Table.from_pandas(df)\n", + " parquet_schema = pa.Table.from_pandas(df=df).schema\n", " if i == 0:\n", - " pqwriter = pq.ParquetWriter(dest_path, table.schema)\n", + " pqwriter = pq.ParquetWriter(dest_path, parquet_schema)\n", + " table = pa.Table.from_pandas(df, parquet_schema)\n", " pqwriter.write_table(table)\n", - "\n", " if pqwriter:\n", " pqwriter.close()\n", "\n", @@ -128,7 +115,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -137,7 +124,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -157,7 +144,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -167,25 +154,25 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-21 15:12:18,489 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" + "[mlrun] 2020-01-22 17:42:17,438 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" ] } ], "source": [ "# export function yaml\n", - "f# n.export('/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml')" + "fn.export('/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml')" ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -195,7 +182,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -204,27 +191,18 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/User/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings\n", - " InsecureRequestWarning)\n" - ] - } - ], + "outputs": [], "source": [ "# load function from Github\n", - "fn = mlrun.import_function(\n", - " 'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/fileutils/arc_to_parquet/arc_to_parquet.yaml')" + "# fn = mlrun.import_function(\n", + "# 'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/fileutils/arc_to_parquet/arc_to_parquet.yaml')" ] }, { "cell_type": "code", - "execution_count": 11, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -249,228 +227,18 @@ }, { "cell_type": "code", - "execution_count": 16, - "metadata": { - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-21 19:27:06,066 starting remote build, image: .mlrun/func-default-arc-to-parquet-latest\n", - "\u001b[36mINFO\u001b[0m[0000] Resolved base name python:3.6-jessie to python:3.6-jessie \n", - "\u001b[36mINFO\u001b[0m[0000] Resolved base name python:3.6-jessie to python:3.6-jessie \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image python:3.6-jessie \n", - "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:0318d80cb241983eda20b905d77fa0bfb06e29e5aabf075c7941ea687f1c125a: no such file or directory \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image python:3.6-jessie \n", - "\u001b[36mINFO\u001b[0m[0000] Built cross stage deps: map[] \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image python:3.6-jessie \n", - "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:0318d80cb241983eda20b905d77fa0bfb06e29e5aabf075c7941ea687f1c125a: no such file or directory \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image python:3.6-jessie \n", - "\u001b[36mINFO\u001b[0m[0001] Unpacking rootfs as cmd RUN python -m pip uninstall mlrun requires it. \n", - "\u001b[36mINFO\u001b[0m[0011] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0036] RUN python -m pip uninstall mlrun \n", - "\u001b[36mINFO\u001b[0m[0036] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0036] args: [-c python -m pip uninstall mlrun] \n", - "WARNING: Skipping mlrun as it is not installed.\n", - "\u001b[36mINFO\u001b[0m[0037] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0039] RUN python -m pip install -U -q mlrun \n", - "\u001b[36mINFO\u001b[0m[0039] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0039] args: [-c python -m pip install -U -q mlrun] \n", - "ERROR: kfp 0.2.0 has requirement urllib3<1.25,>=1.15, but you'll have urllib3 1.25.7 which is incompatible.\n", - "WARNING: You are using pip version 19.1.1, however version 20.0.1 is available.\n", - "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0087] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0112] RUN python -m pip install -U -q pandas \n", - "\u001b[36mINFO\u001b[0m[0112] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0112] args: [-c python -m pip install -U -q pandas] \n", - "WARNING: You are using pip version 19.1.1, however version 20.0.1 is available.\n", - "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0113] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0122] RUN python -m pip install -U -q pyarrow \n", - "\u001b[36mINFO\u001b[0m[0122] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0122] args: [-c python -m pip install -U -q pyarrow] \n", - "WARNING: You are using pip version 19.1.1, however version 20.0.1 is available.\n", - "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0125] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0136] RUN python -m pip install -U -q numpy==1.17.4 \n", - "\u001b[36mINFO\u001b[0m[0136] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0136] args: [-c python -m pip install -U -q numpy==1.17.4] \n", - "WARNING: You are using pip version 19.1.1, however version 20.0.1 is available.\n", - "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0139] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_mt19937.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_generator.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_sfc64.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/__init__.pxd \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/tests/__pycache__/test_extending.cpython-36.pyc \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_pcg64.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_examples \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/tests/test_extending.py \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_philox.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bounded_integers.pxd \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/__pycache__/test__exceptions.cpython-36.pyc \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_common.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/test__exceptions.py \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/test_issue14735.py \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/include/numpy/random/distributions.h \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bit_generator.pxd \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/core/tests/__pycache__/test_issue14735.cpython-36.pyc \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy-1.18.1.dist-info \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bounded_integers.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_common.pxd \n", - "\u001b[36mINFO\u001b[0m[0144] Adding whiteout for /usr/local/lib/python3.6/site-packages/numpy/random/_bit_generator.cpython-36m-x86_64-linux-gnu.so \n", - "\u001b[36mINFO\u001b[0m[0149] RUN pip install mlrun \n", - "\u001b[36mINFO\u001b[0m[0149] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0149] args: [-c pip install mlrun] \n", - "Requirement already satisfied: mlrun in /usr/local/lib/python3.6/site-packages (0.4.3)\n", - "Requirement already satisfied: requests>=2.20.1 in /usr/local/lib/python3.6/site-packages (from mlrun) (2.22.0)\n", - "Requirement already satisfied: nuclio-jupyter>=0.8.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.8.0)\n", - "Requirement already satisfied: sqlalchemy==1.3.11 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.3.11)\n", - "Requirement already satisfied: nest-asyncio>=1.0.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.2.2)\n", - "Requirement already satisfied: nuclio-sdk>=0.0.3 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.0.5)\n", - "Requirement already satisfied: Flask>=1.1.1 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.1.1)\n", - "Requirement already satisfied: gevent==1.4.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.4.0)\n", - "Requirement already satisfied: pandas>=0.23.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.25.3)\n", - "Requirement already satisfied: aiohttp>=3.5.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (3.6.2)\n", - "Requirement already satisfied: gunicorn==19.9.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (19.9.0)\n", - "Requirement already satisfied: boto3>=1.9 in /usr/local/lib/python3.6/site-packages (from mlrun) (1.11.6)\n", - "Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (7.0)\n", - "Requirement already satisfied: kfp>=0.1.29 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.2.0)\n", - "Requirement already satisfied: croniter==0.3.31 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.3.31)\n", - "Requirement already satisfied: GitPython>=2.1.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (3.0.5)\n", - "Requirement already satisfied: tabulate<=0.8.3,>=0.8.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (0.8.3)\n", - "Requirement already satisfied: pyyaml>=5.1.0 in /usr/local/lib/python3.6/site-packages (from mlrun) (5.3)\n", - "Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/site-packages (from requests>=2.20.1->mlrun) (2.8)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/site-packages (from requests>=2.20.1->mlrun) (2019.11.28)\n", - "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/site-packages (from requests>=2.20.1->mlrun) (1.25.7)\n", - "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/site-packages (from requests>=2.20.1->mlrun) (3.0.4)\n", - "Requirement already satisfied: ipython>=7.2 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (7.11.1)\n", - "Requirement already satisfied: jupyterlab>=0.35.4 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (1.2.5)\n", - "Requirement already satisfied: notebook>=5.7.2 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.3)\n", - "Requirement already satisfied: nbconvert>=5.4 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.6.1)\n", - "Requirement already satisfied: tornado<6,>=5 in /usr/local/lib/python3.6/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.1.1)\n", - "Requirement already satisfied: Jinja2>=2.10.1 in /usr/local/lib/python3.6/site-packages (from Flask>=1.1.1->mlrun) (2.10.3)\n", - "Requirement already satisfied: itsdangerous>=0.24 in /usr/local/lib/python3.6/site-packages (from Flask>=1.1.1->mlrun) (1.1.0)\n", - "Requirement already satisfied: Werkzeug>=0.15 in /usr/local/lib/python3.6/site-packages (from Flask>=1.1.1->mlrun) (0.16.0)\n", - "Requirement already satisfied: greenlet>=0.4.14; platform_python_implementation == \"CPython\" in /usr/local/lib/python3.6/site-packages (from gevent==1.4.0->mlrun) (0.4.15)\n", - "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/site-packages (from pandas>=0.23.0->mlrun) (2019.3)\n", - "Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.6/site-packages (from pandas>=0.23.0->mlrun) (1.17.4)\n", - "Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/site-packages (from pandas>=0.23.0->mlrun) (2.8.1)\n", - "Requirement already satisfied: typing-extensions>=3.6.5; python_version < \"3.7\" in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.7.4.1)\n", - "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (19.3.0)\n", - "Requirement already satisfied: idna-ssl>=1.0; python_version < \"3.7\" in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (1.1.0)\n", - "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (1.4.2)\n", - "Requirement already satisfied: multidict<5.0,>=4.5 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (4.7.4)\n", - "Requirement already satisfied: async-timeout<4.0,>=3.0 in /usr/local/lib/python3.6/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.1)\n", - "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (0.9.4)\n", - "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (0.3.1)\n", - "Requirement already satisfied: botocore<1.15.0,>=1.14.6 in /usr/local/lib/python3.6/site-packages (from boto3>=1.9->mlrun) (1.14.6)\n", - "Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.14.0)\n", - "Requirement already satisfied: requests-toolbelt>=0.8.0 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (0.9.1)\n", - "Requirement already satisfied: argo-models==2.2.1a in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2.2.1a0)\n", - "Requirement already satisfied: Deprecated in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.2.7)\n", - "Requirement already satisfied: kfp-server-api<=0.1.40,>=0.1.18 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (0.1.40)\n", - "Requirement already satisfied: cloudpickle==1.1.1 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.1.1)\n", - "Requirement already satisfied: google-cloud-storage>=1.13.0 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.25.0)\n", - "Requirement already satisfied: kubernetes<=10.0.0,>=8.0.0 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (10.0.0)\n", - "Requirement already satisfied: google-auth>=1.6.1 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.10.1)\n", - "Requirement already satisfied: cryptography>=2.4.2 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (2.8)\n", - "Requirement already satisfied: jsonschema>=3.0.1 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (3.2.0)\n", - "Requirement already satisfied: PyJWT>=1.6.4 in /usr/local/lib/python3.6/site-packages (from kfp>=0.1.29->mlrun) (1.7.1)\n", - "Requirement already satisfied: gitdb2>=2.0.0 in /usr/local/lib/python3.6/site-packages (from GitPython>=2.1.0->mlrun) (2.0.6)\n", - "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (3.0.2)\n", - "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.8.0)\n", - "Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (41.0.1)\n", - "Requirement already satisfied: backcall in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.0)\n", - "Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.5)\n", - "Requirement already satisfied: jedi>=0.10 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.15.2)\n", - "Requirement already satisfied: decorator in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.4.1)\n", - "Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.3.3)\n", - "Requirement already satisfied: pygments in /usr/local/lib/python3.6/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (2.5.2)\n", - "Requirement already satisfied: jupyterlab-server~=1.0.0 in /usr/local/lib/python3.6/site-packages (from jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (1.0.6)\n", - "Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.2.0)\n", - "Requirement already satisfied: ipykernel in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.1.3)\n", - "Requirement already satisfied: prometheus-client in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.1)\n", - "Requirement already satisfied: pyzmq>=17 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (18.1.1)\n", - "Requirement already satisfied: nbformat in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.0.3)\n", - "Requirement already satisfied: jupyter-core>=4.6.1 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (4.6.1)\n", - "Requirement already satisfied: jupyter-client>=5.3.4 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.3.4)\n", - "Requirement already satisfied: Send2Trash in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (1.5.0)\n", - "Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.6/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.8.3)\n", - "Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.4)\n", - "Requirement already satisfied: bleach in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (3.1.0)\n", - "Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (1.4.2)\n", - "Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.3)\n", - "Requirement already satisfied: testpath in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.4.4)\n", - "Requirement already satisfied: defusedxml in /usr/local/lib/python3.6/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", - "Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/site-packages (from Jinja2>=2.10.1->Flask>=1.1.1->mlrun) (1.1.1)\n", - "Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/site-packages (from botocore<1.15.0,>=1.14.6->boto3>=1.9->mlrun) (0.15.2)\n", - "Requirement already satisfied: wrapt<2,>=1.10 in /usr/local/lib/python3.6/site-packages (from Deprecated->kfp>=0.1.29->mlrun) (1.11.2)\n", - "Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /usr/local/lib/python3.6/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (0.5.0)\n", - "Requirement already satisfied: google-cloud-core<2.0dev,>=1.2.0 in /usr/local/lib/python3.6/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.2.0)\n", - "Requirement already satisfied: requests-oauthlib in /usr/local/lib/python3.6/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (1.3.0)\n", - "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /usr/local/lib/python3.6/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (0.57.0)\n", - "Requirement already satisfied: rsa<4.1,>=3.1.4 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0)\n", - "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0.0)\n", - "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.2.8)\n", - "Requirement already satisfied: cffi!=1.11.3,>=1.8 in /usr/local/lib/python3.6/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (1.13.2)\n", - "Requirement already satisfied: pyrsistent>=0.14.0 in /usr/local/lib/python3.6/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (0.15.7)\n", - "Requirement already satisfied: importlib-metadata; python_version < \"3.8\" in /usr/local/lib/python3.6/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (1.4.0)\n", - "Requirement already satisfied: smmap2>=2.0.0 in /usr/local/lib/python3.6/site-packages (from gitdb2>=2.0.0->GitPython>=2.1.0->mlrun) (2.0.5)\n", - "Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.8)\n", - "Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.6/site-packages (from pexpect; sys_platform != \"win32\"->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", - "Requirement already satisfied: parso>=0.5.2 in /usr/local/lib/python3.6/site-packages (from jedi>=0.10->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.5.2)\n", - "Requirement already satisfied: json5 in /usr/local/lib/python3.6/site-packages (from jupyterlab-server~=1.0.0->jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.5)\n", - "Requirement already satisfied: webencodings in /usr/local/lib/python3.6/site-packages (from bleach->nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.5.1)\n", - "Requirement already satisfied: google-api-core<2.0.0dev,>=1.16.0 in /usr/local/lib/python3.6/site-packages (from google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.16.0)\n", - "Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.6/site-packages (from requests-oauthlib->kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (3.1.0)\n", - "Requirement already satisfied: pyasn1>=0.1.3 in /usr/local/lib/python3.6/site-packages (from rsa<4.1,>=3.1.4->google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.4.8)\n", - "Requirement already satisfied: pycparser in /usr/local/lib/python3.6/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.4.2->kfp>=0.1.29->mlrun) (2.19)\n", - "Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/site-packages (from importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (2.0.0)\n", - "Requirement already satisfied: protobuf>=3.4.0 in /usr/local/lib/python3.6/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (3.11.2)\n", - "Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /usr/local/lib/python3.6/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.51.0)\n", - "Requirement already satisfied: more-itertools in /usr/local/lib/python3.6/site-packages (from zipp>=0.5->importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (8.1.0)\n", - "WARNING: You are using pip version 19.1.1, however version 20.0.1 is available.\n", - "You should consider upgrading via the 'pip install --upgrade pip' command.\n", - "\u001b[36mINFO\u001b[0m[0150] Taking snapshot of full filesystem... \n" - ] - }, - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "fn.deploy()" ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "# fn.with_code()" ] @@ -484,21 +252,22 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# useful constants\n", "target_path = '/User/mlrun/models'\n", - "archive = \"https://fpsignals-public.s3.amazonaws.com/higgs-small.tar.gz\"\n", + "# archive = \"https://fpsignals-public.s3.amazonaws.com/higgs-small.tar.gz\"\n", + "archive = \"https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz\"\n", "parquet_file = 'higgs.parquet' # the file extension is not necessary\n", "parquet_file_path = target_path + \"/\" + parquet_file\n", - "artifact_key = 'higgs_small'" + "artifact_key = 'higgs_large'" ] }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 19, "metadata": {}, "outputs": [], "source": [ @@ -510,232 +279,9 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-21 20:23:50,660 starting run arc2parq uid=e20e88ae28a545da90e7ded360b78d6d -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-21 20:23:50,882 Job is running in the background, pod: arc2parq-c65q8\n", - "[mlrun] 2020-01-21 20:24:05,984 destination file already exists\n", - "[mlrun] 2020-01-21 20:24:06,002 log artifact higgs_small at /User/mlrun/models/higgs.parquet, size: None, db: Y\n", - "[mlrun] 2020-01-21 20:24:06,017 log artifact header at /User/mlrun/models/header.pkl, size: None, db: Y\n", - "\n", - "[mlrun] 2020-01-21 20:24:06,029 run executed, status=completed\n", - "final state: succeeded\n" - ] - }, - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...b78d6d
0Jan 21 20:24:05completedarc-to-parquet
host=arc2parq-c65q8
kind=job
owner=admin
archive_url=https://fpsignals-public.s3.amazonaws.com/higgs-small.tar.gz
header=['labels', 'lepton pT ', 'lepton eta ', 'lepton phi ', 'missing energy magnitude ', 'missing energy phi ', 'jet 1 pt ', 'jet 1 eta ', 'jet 1 phi ', 'jet 1 b-tag ', 'jet 2 pt ', 'jet 2 eta ', 'jet 2 phi ', 'jet 2 b-tag ', 'jet 3 pt ', 'jet 3 eta ', 'jet 3 phi ', 'jet 3 b-tag ', 'jet 4 pt ', 'jet 4 eta ', 'jet 4 phi ', 'jet 4 b-tag', 'm_jj', 'm_jjj', 'm_lv ', 'm_jlv', 'm_bb ', 'm_wbb ', 'm_wwbb']
key=higgs_small
name=higgs.parquet
target_path=/User/mlrun/models
higgs_small
header
\n", - "
\n", - "
\n", - "
\n", - " Title\n", - " ×\n", - "
\n", - " \n", - "
\n", - "
\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run e20e88ae28a545da90e7ded360b78d6d , !mlrun logs e20e88ae28a545da90e7ded360b78d6d \n", - "[mlrun] 2020-01-21 20:24:10,173 run executed, status=completed\n" - ] - } - ], + "outputs": [], "source": [ "# create and run the task\n", "arc_to_parq_task = mlrun.NewTask(\n", @@ -746,7 +292,7 @@ " 'name' : parquet_file, \n", " 'key' : artifact_key,\n", " 'archive_url': archive,\n", - " 'header' : higgs_header},\n", + " 'header' : HIGGS_HEADER},\n", " outputs=[artifact_key])\n", "\n", "# run\n", @@ -769,7 +315,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ @@ -780,7 +326,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -790,7 +336,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 16, "metadata": {}, "outputs": [], "source": [ @@ -800,30 +346,247 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "copied = pd.read_parquet(parquet_file_path, engine=\"pyarrow\")" + ] + }, + { + "cell_type": "code", + "execution_count": 21, "metadata": {}, "outputs": [ { - "ename": "TypeError", - "evalue": "unhashable type: 'dict'", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0moriginal\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marchive\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mcopied\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_parquet\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mparquet_file_path\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mengine\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"pyarrow\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0;32massert\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray_equal\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moriginal\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopied\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"mlrun.functions: original and copied data not equal\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/conda/lib/python3.6/site-packages/pandas/io/parquet.py\u001b[0m in \u001b[0;36mread_parquet\u001b[0;34m(path, engine, columns, **kwargs)\u001b[0m\n", - "\u001b[0;32m/conda/lib/python3.6/site-packages/pandas/io/parquet.py\u001b[0m in \u001b[0;36mread\u001b[0;34m(self, path, columns, **kwargs)\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/pyarrow/array.pxi\u001b[0m in \u001b[0;36mpyarrow.lib._PandasConvertible.to_pandas\u001b[0;34m()\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/pyarrow/table.pxi\u001b[0m in \u001b[0;36mpyarrow.lib.Table._to_pandas\u001b[0;34m()\u001b[0m\n", - "\u001b[0;32m/conda/lib/python3.6/site-packages/pyarrow/pandas_compat.py\u001b[0m in \u001b[0;36mtable_to_blockmanager\u001b[0;34m(options, table, categories, ignore_metadata)\u001b[0m\n", - "\u001b[0;31mTypeError\u001b[0m: unhashable type: 'dict'" - ] + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
labelslepton_pTlepton_etalepton_phimissing_energy_magnitudemissing_energy_phijet_1_ptjet_1_etajet_1_phijet_1_b-tag...jet_4_etajet_4_phijet_4_b-tagm_jjm_jjjm_lvm_jlvm_bbm_wbbm_wwbb
01.00.869293-0.6350820.2256900.327470-0.6899930.754202-0.248573-1.0920640.000000...-0.010455-0.0457673.1019611.3537600.9795630.9780760.9200050.7216570.9887510.876678
11.00.9075420.3291470.3594121.497970-0.3130101.095531-0.557525-1.5882302.173076...-1.138930-0.0008190.0000000.3022200.8330480.9857000.9780980.7797320.9923560.798343
21.00.7988351.470639-1.6359750.4537730.4256291.1048751.2823221.3816640.000000...1.1288480.9004610.0000000.9097531.1083300.9856920.9513310.8032520.8659240.780118
30.01.344385-0.8766260.9359131.9920500.8824541.786066-1.646778-0.9423830.000000...-0.678379-1.3603560.0000000.9466521.0287040.9986560.7282810.8692001.0267360.957904
41.01.1050090.3213561.5224010.882808-1.2053490.681466-1.070464-0.9218710.000000...-0.3735660.1130410.0000000.7558561.3610570.9866100.8380851.1332950.8722450.808487
\n", + "

5 rows × 29 columns

\n", + "
" + ], + "text/plain": [ + " labels lepton_pT lepton_eta lepton_phi missing_energy_magnitude \\\n", + "0 1.0 0.869293 -0.635082 0.225690 0.327470 \n", + "1 1.0 0.907542 0.329147 0.359412 1.497970 \n", + "2 1.0 0.798835 1.470639 -1.635975 0.453773 \n", + "3 0.0 1.344385 -0.876626 0.935913 1.992050 \n", + "4 1.0 1.105009 0.321356 1.522401 0.882808 \n", + "\n", + " missing_energy_phi jet_1_pt jet_1_eta jet_1_phi jet_1_b-tag ... \\\n", + "0 -0.689993 0.754202 -0.248573 -1.092064 0.000000 ... \n", + "1 -0.313010 1.095531 -0.557525 -1.588230 2.173076 ... \n", + "2 0.425629 1.104875 1.282322 1.381664 0.000000 ... \n", + "3 0.882454 1.786066 -1.646778 -0.942383 0.000000 ... \n", + "4 -1.205349 0.681466 -1.070464 -0.921871 0.000000 ... \n", + "\n", + " jet_4_eta jet_4_phi jet_4_b-tag m_jj m_jjj m_lv m_jlv \\\n", + "0 -0.010455 -0.045767 3.101961 1.353760 0.979563 0.978076 0.920005 \n", + "1 -1.138930 -0.000819 0.000000 0.302220 0.833048 0.985700 0.978098 \n", + "2 1.128848 0.900461 0.000000 0.909753 1.108330 0.985692 0.951331 \n", + "3 -0.678379 -1.360356 0.000000 0.946652 1.028704 0.998656 0.728281 \n", + "4 -0.373566 0.113041 0.000000 0.755856 1.361057 0.986610 0.838085 \n", + "\n", + " m_bb m_wbb m_wwbb \n", + "0 0.721657 0.988751 0.876678 \n", + "1 0.779732 0.992356 0.798343 \n", + "2 0.803252 0.865924 0.780118 \n", + "3 0.869200 1.026736 0.957904 \n", + "4 1.133295 0.872245 0.808487 \n", + "\n", + "[5 rows x 29 columns]" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ - "original = pd.read_csv(archive).values\n", - "copied = pd.read_parquet(parquet_file_path, engine=\"pyarrow\").values\n", - "assert np.array_equal(original, copied), \"mlrun.functions: original and copied data not equal\"" + "copied.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(11000000, 29)" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "copied.shape" ] }, { @@ -835,11 +598,11 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ - "os.remove(parquet_file_path)" + "# os.remove(parquet_file_path)" ] }, { diff --git a/tests/create_binary_data.ipynb b/tests/create_binary_data.ipynb new file mode 100644 index 000000000..146dbbf47 --- /dev/null +++ b/tests/create_binary_data.ipynb @@ -0,0 +1,359 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import mlrun\n", + "import os\n", + "\n", + "mlrun.mlconf.dbpath = 'http://mlrun-api:8080'" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "TARGET_CODE_PATH = '/User/repos/functions/datagen/classification'\n", + "N_SAMPLES = 10_000\n", + "M_FEATURES = 20\n", + "NEG_WEIGHT = 0.5\n", + "TARGET_DATA_PATH = '/User/mlrun/datagen'\n", + "KEY = 'bindata'" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# mlrun.code_to_function(\n", + "# filename=os.path.join(TARGET_CODE_PATH, 'binary.py'), \n", + "# kind='job'\n", + "# ).export(os.path.join(TARGET_CODE_PATH, 'binary.yaml'))" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "binarydatagen = mlrun.import_function(\n", + " os.path.join(TARGET_CODE_PATH, 'binary.yaml')\n", + ").apply(mlrun.mount_v3io())" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "# binarydatagen.deploy()" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-22 20:01:50,178 starting run create_binary_classification uid=b52bfe49d1644faf806a3f90c288012d -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-22 20:01:50,258 Job is running in the background, pod: create-binary-classification-fcvdh\n", + "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", + " result = infer_dtype(pandas_collection)\n", + "[mlrun] 2020-01-22 20:02:02,615 log artifact bindata at /User/mlrun/datagen/simdata-1e04X20.parquet, size: None, db: Y\n", + "\n", + "[mlrun] 2020-01-22 20:02:02,629 run executed, status=completed\n", + "final state: succeeded\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...88012d
0Jan 22 20:02:02completedbinary
host=create-binary-classification-fcvdh
kind=job
owner=admin
key=bindata
m_features=20
n_samples=10000
target_path=/User/mlrun/datagen
weight=0.5
bindata
\n", + "
\n", + "
\n", + "
\n", + " Title\n", + " ×\n", + "
\n", + " \n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "to track results use .show() or .logs() or in CLI: \n", + "!mlrun get run b52bfe49d1644faf806a3f90c288012d , !mlrun logs b52bfe49d1644faf806a3f90c288012d \n", + "[mlrun] 2020-01-22 20:02:09,459 run executed, status=completed\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "task = mlrun.NewTask()\n", + "task.with_params(\n", + " n_samples=N_SAMPLES,\n", + " m_features=M_FEATURES,\n", + " weight=NEG_WEIGHT,\n", + " target_path=TARGET_DATA_PATH,\n", + " key=KEY)\n", + "\n", + "binarydatagen.run(task, handler='create_binary_classification')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# tests" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "fn = f\"simdata-{N_SAMPLES:0.0e}X{M_FEATURES}.parquet\".replace(\"+\", \"\")\n", + "df = pd.read_parquet(os.path.join(TARGET_DATA_PATH, fn), engine='pyarrow')\n", + "assert df.shape == (N_SAMPLES, M_FEATURES + 1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/tests/generate-some-classifiers.ipynb b/tests/generate-some-classifiers.ipynb deleted file mode 100644 index 4f6ef55f8..000000000 --- a/tests/generate-some-classifiers.ipynb +++ /dev/null @@ -1,311 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# generate multiple classifier models for testing\n", - "\n", - "Generate classifier models and pickle them under `/User/mlrun/models/-classifier.cpkl`.\n", - "\n", - "In principle, the pickle `load` method should give us a class instance that we can predict with. This may not work in practice, and that is the purpose of this notebook, to figure out which models work, and which don't. Several pickling packages will also be tested in case there are differences.\n", - "\n", - "### _extensions_\n", - "* cpkl for `cloudpickle`\n", - "* pkl for `pickle`\n", - "* dpkl for `dill`...\n", - "\n", - "**gbc model:** adapted from **[Probabilistic predictions with Gaussian process classification (GPC)](https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpc.html#sphx-glr-auto-examples-gaussian-process-plot-gpc-py)**\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "outputs": [], - "source": [ - "%matplotlib inline" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "from cloudpickle import dump as cdump, load as cload\n", - "from pickle import dump as pdump, load as pload\n", - "from dill import dump as ddump, load as dload" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "outputs": [], - "source": [ - "import numpy as np\n", - "\n", - "from matplotlib import pyplot as plt\n", - "\n", - "from sklearn.metrics import accuracy_score, log_loss\n", - "from sklearn.datasets import make_classification\n", - "from sklearn.model_selection import train_test_split\n", - "from sklearn.gaussian_process import GaussianProcessClassifier\n", - "from sklearn.gaussian_process.kernels import RBF" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "n_samples = 1000\n", - "train_size = 0.7\n", - "\n", - "X, y = make_classification(\n", - " n_samples=n_samples,\n", - " n_features=28, \n", - " random_state = 1)\n", - "\n", - "xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1-train_size)" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "outputs": [], - "source": [ - "kernel = 1.0*RBF(length_scale=1.0)\n", - "\n", - "clf = GaussianProcessClassifier(kernel=kernel, random_state=1)" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "GaussianProcessClassifier(copy_X_train=True, kernel=1**2 * RBF(length_scale=1),\n", - " max_iter_predict=100, multi_class='one_vs_rest',\n", - " n_jobs=None, n_restarts_optimizer=0,\n", - " optimizer='fmin_l_bfgs_b', random_state=1,\n", - " warm_start=False)" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "clf.fit(xtrain, ytrain)" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "outputs": [], - "source": [ - "accuracy = accuracy_score(ytest, clf.predict(xtest))\n", - "\n", - "logloss = log_loss(ytest, clf.predict_proba(xtest)[:, 1])" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ - "cdump(clf, open('/User/mlrun/models/gpc-classifier.cpkl', 'wb'))" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "clf_loaded = cload(open('/User/mlrun/models/gpc-classifier.cpkl', 'rb'))" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "GaussianProcessClassifier(copy_X_train=True, kernel=1**2 * RBF(length_scale=1),\n", - " max_iter_predict=100, multi_class='one_vs_rest',\n", - " n_jobs=None, n_restarts_optimizer=0,\n", - " optimizer='fmin_l_bfgs_b', random_state=1,\n", - " warm_start=False)" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "clf_loaded" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [], - "source": [ - "assert accuracy == accuracy_score(ytest, clf_loaded.predict(xtest))\n", - "assert logloss == log_loss(ytest, clf_loaded.predict_proba(xtest)[:, 1])" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.ensemble import AdaBoostClassifier" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [], - "source": [ - "clf = AdaBoostClassifier(n_estimators=100, random_state=1)" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,\n", - " n_estimators=100, random_state=1)" - ] - }, - "execution_count": 22, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "clf.fit(xtrain, ytrain)" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [], - "source": [ - "cdump(clf, open('/User/mlrun/models/ada-classifier.cpkl', 'wb'))" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": {}, - "outputs": [], - "source": [ - "clf_loaded = cload(open('/User/mlrun/models/ada-classifier.cpkl', 'rb'))" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false - } - }, - "outputs": [], - "source": [ - "accuracy = accuracy_score(ytest, clf.predict(xtest))\n", - "\n", - "logloss = log_loss(ytest, clf.predict_proba(xtest)[:, 1])" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": {}, - "outputs": [], - "source": [ - "assert accuracy == accuracy_score(ytest, clf_loaded.predict(xtest))\n", - "assert logloss == log_loss(ytest, clf_loaded.predict_proba(xtest)[:, 1])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.8" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/tests/train_classifier.ipynb b/tests/train_classifier.ipynb new file mode 100644 index 000000000..7ea3de621 --- /dev/null +++ b/tests/train_classifier.ipynb @@ -0,0 +1,1144 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# deploying yaml on optimized python images" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## imports" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import mlrun\n", + "import os\n", + "import numpy as np\n", + "mlrun.mlconf.dbpath = 'http://mlrun-api:8080'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## parameters" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "TARGET_CODE_BASE = '/User/repos/functions/' \n", + "N_SAMPLES = 10_000_000\n", + "M_FEATURES = 20\n", + "NEG_WEIGHT = 0.5\n", + "TARGET_DATA_PATH = '/User/mlrun/sklearn-classifier'\n", + "FILE_NAME = 'simdata.pqt'\n", + "KEY = 'simdata'\n", + "RNG = 1\n", + "SKLEARN_CLASSIFIER = 'lightgbm.sklearn.LGBMClassifier'\n", + "MODEL_KEY = 'model'\n", + "MODEL_NAME = MODEL_KEY\n", + "VERBOSE = True" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## generate some binary classifiaction data" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "binarydatagen = mlrun.import_function(\n", + " os.path.join(TARGET_CODE_BASE+'datagen/classification', 'binary.yaml')\n", + ").apply(mlrun.mount_v3io())" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-22 22:34:26,684 starting remote build, image: .mlrun/func-default-binary-latest\n", + "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Built cross stage deps: map[] \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Unpacking rootfs as cmd RUN pip install mlrun requires it. \n", + "\u001b[36mINFO\u001b[0m[0046] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0066] RUN pip install mlrun \n", + "\u001b[36mINFO\u001b[0m[0066] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0066] args: [-c pip install mlrun] \n", + "Requirement already satisfied: mlrun in /opt/conda/lib/python3.7/site-packages (0.4.3)\n", + "Requirement already satisfied: gunicorn==19.9.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (19.9.0)\n", + "Requirement already satisfied: boto3>=1.9 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.11.7)\n", + "Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (7.0)\n", + "Requirement already satisfied: nuclio-sdk>=0.0.3 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.0.5)\n", + "Requirement already satisfied: gevent==1.4.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.4.0)\n", + "Requirement already satisfied: sqlalchemy==1.3.11 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.3.11)\n", + "Requirement already satisfied: kfp>=0.1.29 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.2.0)\n", + "Requirement already satisfied: nuclio-jupyter>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.0)\n", + "Requirement already satisfied: croniter==0.3.31 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.3.31)\n", + "Requirement already satisfied: pyyaml>=5.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (5.1.1)\n", + "Requirement already satisfied: nest-asyncio>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.2.2)\n", + "Requirement already satisfied: pandas>=0.23.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.25.1)\n", + "Requirement already satisfied: requests>=2.20.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (2.20.1)\n", + "Requirement already satisfied: GitPython>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.0.5)\n", + "Requirement already satisfied: aiohttp>=3.5.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.6.2)\n", + "Requirement already satisfied: Flask>=1.1.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.1.1)\n", + "Requirement already satisfied: tabulate<=0.8.3,>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.3)\n", + "Requirement already satisfied: botocore<1.15.0,>=1.14.7 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (1.14.7)\n", + "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.3.1)\n", + "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.9.4)\n", + "Requirement already satisfied: greenlet>=0.4.14; platform_python_implementation == \"CPython\" in /opt/conda/lib/python3.7/site-packages (from gevent==1.4.0->mlrun) (0.4.15)\n", + "Requirement already satisfied: kubernetes<=10.0.0,>=8.0.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (10.0.0)\n", + "Requirement already satisfied: Deprecated in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.2.7)\n", + "Requirement already satisfied: requests-toolbelt>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.9.1)\n", + "Requirement already satisfied: kfp-server-api<=0.1.40,>=0.1.18 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.1.40)\n", + "Requirement already satisfied: six>=1.10 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.12.0)\n", + "Requirement already satisfied: jsonschema>=3.0.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (3.2.0)\n", + "Requirement already satisfied: cryptography>=2.4.2 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.7)\n", + "Requirement already satisfied: certifi in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2019.9.11)\n", + "Requirement already satisfied: PyJWT>=1.6.4 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.7.1)\n", + "Requirement already satisfied: cloudpickle==1.1.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.1.1)\n", + "Requirement already satisfied: argo-models==2.2.1a in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.2.1a0)\n", + "Requirement already satisfied: google-cloud-storage>=1.13.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.25.0)\n", + "Requirement already satisfied: urllib3<1.25,>=1.15 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.24.1)\n", + "Requirement already satisfied: python-dateutil in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.8.0)\n", + "Requirement already satisfied: google-auth>=1.6.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.10.1)\n", + "Requirement already satisfied: tornado<6,>=5 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.1.1)\n", + "Requirement already satisfied: jupyterlab>=0.35.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (1.2.5)\n", + "Requirement already satisfied: notebook>=5.7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.3)\n", + "Requirement already satisfied: nbconvert>=5.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.6.1)\n", + "Requirement already satisfied: ipython>=7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (7.11.1)\n", + "Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (2019.1)\n", + "Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (1.17.4)\n", + "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (3.0.4)\n", + "Requirement already satisfied: idna<2.8,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (2.6)\n", + "Requirement already satisfied: gitdb2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from GitPython>=2.1.0->mlrun) (2.0.6)\n", + "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: multidict<5.0,>=4.5 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (4.7.4)\n", + "Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (19.3.0)\n", + "Requirement already satisfied: async-timeout<4.0,>=3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.1)\n", + "Requirement already satisfied: itsdangerous>=0.24 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (1.1.0)\n", + "Requirement already satisfied: Jinja2>=2.10.1 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (2.10.3)\n", + "Requirement already satisfied: Werkzeug>=0.15 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (0.16.0)\n", + "Requirement already satisfied: docutils<0.16,>=0.10 in /opt/conda/lib/python3.7/site-packages (from botocore<1.15.0,>=1.14.7->boto3>=1.9->mlrun) (0.15.2)\n", + "Requirement already satisfied: setuptools>=21.0.0 in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (41.0.1.post20191122)\n", + "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (0.57.0)\n", + "Requirement already satisfied: requests-oauthlib in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (1.3.0)\n", + "Requirement already satisfied: wrapt<2,>=1.10 in /opt/conda/lib/python3.7/site-packages (from Deprecated->kfp>=0.1.29->mlrun) (1.11.2)\n", + "Requirement already satisfied: importlib-metadata; python_version < \"3.8\" in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (1.4.0)\n", + "Requirement already satisfied: pyrsistent>=0.14.0 in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (0.15.7)\n", + "Requirement already satisfied: asn1crypto>=0.21.0 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (0.24.0)\n", + "Requirement already satisfied: cffi!=1.11.3,>=1.8 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (1.12.3)\n", + "Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (0.5.0)\n", + "Requirement already satisfied: google-cloud-core<2.0dev,>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.2.0)\n", + "Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.2.8)\n", + "Requirement already satisfied: rsa<4.1,>=3.1.4 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0)\n", + "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0.0)\n", + "Requirement already satisfied: jupyterlab-server~=1.0.0 in /opt/conda/lib/python3.7/site-packages (from jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (1.0.6)\n", + "Requirement already satisfied: nbformat in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.0.3)\n", + "Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.2.0)\n", + "Requirement already satisfied: pyzmq>=17 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (18.1.1)\n", + "Requirement already satisfied: ipykernel in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.1.3)\n", + "Requirement already satisfied: jupyter-client>=5.3.4 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.3.4)\n", + "Requirement already satisfied: jupyter-core>=4.6.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (4.6.1)\n", + "Requirement already satisfied: prometheus-client in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.1)\n", + "Requirement already satisfied: terminado>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.8.3)\n", + "Requirement already satisfied: traitlets>=4.2.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (4.3.3)\n", + "Requirement already satisfied: Send2Trash in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (1.5.0)\n", + "Requirement already satisfied: defusedxml in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: entrypoints>=0.2.2 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.3)\n", + "Requirement already satisfied: bleach in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (3.1.0)\n", + "Requirement already satisfied: mistune<2,>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.4)\n", + "Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: testpath in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.4.4)\n", + "Requirement already satisfied: pygments in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (2.5.2)\n", + "Requirement already satisfied: decorator in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.4.1)\n", + "Requirement already satisfied: backcall in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.0)\n", + "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (3.0.2)\n", + "Requirement already satisfied: jedi>=0.10 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.15.2)\n", + "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.8.0)\n", + "Requirement already satisfied: pickleshare in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.5)\n", + "Requirement already satisfied: smmap2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from gitdb2>=2.0.0->GitPython>=2.1.0->mlrun) (2.0.5)\n", + "Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.7/site-packages (from Jinja2>=2.10.1->Flask>=1.1.1->mlrun) (1.1.1)\n", + "Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from requests-oauthlib->kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (3.1.0)\n", + "Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (2.0.0)\n", + "Requirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.4.2->kfp>=0.1.29->mlrun) (2.18)\n", + "Requirement already satisfied: google-api-core<2.0.0dev,>=1.16.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.16.0)\n", + "Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /opt/conda/lib/python3.7/site-packages (from pyasn1-modules>=0.2.1->google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.4.8)\n", + "Requirement already satisfied: json5 in /opt/conda/lib/python3.7/site-packages (from jupyterlab-server~=1.0.0->jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.5)\n", + "Requirement already satisfied: ptyprocess; os_name != \"nt\" in /opt/conda/lib/python3.7/site-packages (from terminado>=0.8.1->notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: webencodings in /opt/conda/lib/python3.7/site-packages (from bleach->nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.5.1)\n", + "Requirement already satisfied: wcwidth in /opt/conda/lib/python3.7/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.8)\n", + "Requirement already satisfied: parso>=0.5.2 in /opt/conda/lib/python3.7/site-packages (from jedi>=0.10->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.5.2)\n", + "Requirement already satisfied: more-itertools in /opt/conda/lib/python3.7/site-packages (from zipp>=0.5->importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (8.1.0)\n", + "Requirement already satisfied: protobuf>=3.4.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (3.11.2)\n", + "Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.51.0)\n", + "\u001b[36mINFO\u001b[0m[0068] Taking snapshot of full filesystem... \n" + ] + }, + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "binarydatagen.deploy()" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "task1 = mlrun.NewTask()\n", + "task1.with_params(\n", + " n_samples=N_SAMPLES,\n", + " m_features=M_FEATURES,\n", + " weight=NEG_WEIGHT,\n", + " target_path=TARGET_DATA_PATH,\n", + " filename=FILE_NAME,\n", + " key=KEY,\n", + " random_state=RNG)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-22 22:35:44,742 starting run create_binary_classification uid=9330db1734df40afabbaf41cd386930c -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-22 22:35:44,823 Job is running in the background, pod: create-binary-classification-gcwdk\n", + "[mlrun] 2020-01-22 22:36:38,971 log artifact simdata at /User/mlrun/sklearn-classifier/simdata.pqt, size: None, db: Y\n", + "\n", + "[mlrun] 2020-01-22 22:36:39,218 run executed, status=completed\n", + "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", + " result = infer_dtype(pandas_collection)\n", + "final state: succeeded\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...86930c
0Jan 22 22:36:01completedbinary
host=create-binary-classification-gcwdk
kind=job
owner=admin
filename=simdata.pqt
key=simdata
m_features=20
n_samples=10000000
random_state=1
target_path=/User/mlrun/sklearn-classifier
weight=0.5
simdata
\n", + "
\n", + "
\n", + "
\n", + " Title\n", + " ×\n", + "
\n", + " \n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "to track results use .show() or .logs() or in CLI: \n", + "!mlrun get run 9330db1734df40afabbaf41cd386930c , !mlrun logs 9330db1734df40afabbaf41cd386930c \n", + "[mlrun] 2020-01-22 22:36:45,628 run executed, status=completed\n" + ] + } + ], + "source": [ + "tsk1 = binarydatagen.run(task1, handler='create_binary_classification')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "____\n", + "# tests" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "df = pd.read_parquet(os.path.join(TARGET_DATA_PATH, FILE_NAME), engine='pyarrow')" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "assert tsk1.output(KEY) == os.path.join(TARGET_DATA_PATH, FILE_NAME), \"binary.yaml failed to create a file\"\n", + "assert df.shape== (N_SAMPLES, M_FEATURES+1), \"simulation data artifact is not of the correct dimensions\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "_____\n", + "## train a classifier" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "trainfn = mlrun.import_function(\n", + " os.path.join(TARGET_CODE_BASE+'train/sklearn-classifier.yaml')\n", + ").apply(mlrun.mount_v3io())" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-22 22:36:49,836 starting remote build, image: .mlrun/func-default-sklearn-classifier-latest\n", + "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Built cross stage deps: map[] \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Unpacking rootfs as cmd RUN pip install mlrun requires it. \n", + "\u001b[36mINFO\u001b[0m[0044] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0063] RUN pip install mlrun \n", + "\u001b[36mINFO\u001b[0m[0063] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0063] args: [-c pip install mlrun] \n", + "Requirement already satisfied: mlrun in /opt/conda/lib/python3.7/site-packages (0.4.3)\n", + "Requirement already satisfied: GitPython>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.0.5)\n", + "Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (7.0)\n", + "Requirement already satisfied: gevent==1.4.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.4.0)\n", + "Requirement already satisfied: gunicorn==19.9.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (19.9.0)\n", + "Requirement already satisfied: pandas>=0.23.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.25.1)\n", + "Requirement already satisfied: nuclio-jupyter>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.0)\n", + "Requirement already satisfied: pyyaml>=5.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (5.1.1)\n", + "Requirement already satisfied: tabulate<=0.8.3,>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.3)\n", + "Requirement already satisfied: boto3>=1.9 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.11.7)\n", + "Requirement already satisfied: nest-asyncio>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.2.2)\n", + "Requirement already satisfied: Flask>=1.1.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.1.1)\n", + "Requirement already satisfied: nuclio-sdk>=0.0.3 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.0.5)\n", + "Requirement already satisfied: requests>=2.20.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (2.20.1)\n", + "Requirement already satisfied: sqlalchemy==1.3.11 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.3.11)\n", + "Requirement already satisfied: kfp>=0.1.29 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.2.0)\n", + "Requirement already satisfied: croniter==0.3.31 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.3.31)\n", + "Requirement already satisfied: aiohttp>=3.5.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.6.2)\n", + "Requirement already satisfied: gitdb2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from GitPython>=2.1.0->mlrun) (2.0.6)\n", + "Requirement already satisfied: greenlet>=0.4.14; platform_python_implementation == \"CPython\" in /opt/conda/lib/python3.7/site-packages (from gevent==1.4.0->mlrun) (0.4.15)\n", + "Requirement already satisfied: python-dateutil>=2.6.1 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (2.8.0)\n", + "Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (2019.1)\n", + "Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (1.17.4)\n", + "Requirement already satisfied: notebook>=5.7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.3)\n", + "Requirement already satisfied: jupyterlab>=0.35.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (1.2.5)\n", + "Requirement already satisfied: nbconvert>=5.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.6.1)\n", + "Requirement already satisfied: tornado<6,>=5 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.1.1)\n", + "Requirement already satisfied: ipython>=7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (7.11.1)\n", + "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.9.4)\n", + "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.3.1)\n", + "Requirement already satisfied: botocore<1.15.0,>=1.14.7 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (1.14.7)\n", + "Requirement already satisfied: Jinja2>=2.10.1 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (2.10.3)\n", + "Requirement already satisfied: Werkzeug>=0.15 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (0.16.0)\n", + "Requirement already satisfied: itsdangerous>=0.24 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (1.1.0)\n", + "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (3.0.4)\n", + "Requirement already satisfied: idna<2.8,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (2.6)\n", + "Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (1.24.1)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (2019.9.11)\n", + "Requirement already satisfied: six>=1.10 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.12.0)\n", + "Requirement already satisfied: google-cloud-storage>=1.13.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.25.0)\n", + "Requirement already satisfied: kubernetes<=10.0.0,>=8.0.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (10.0.0)\n", + "Requirement already satisfied: jsonschema>=3.0.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (3.2.0)\n", + "Requirement already satisfied: cryptography>=2.4.2 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.7)\n", + "Requirement already satisfied: kfp-server-api<=0.1.40,>=0.1.18 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.1.40)\n", + "Requirement already satisfied: google-auth>=1.6.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.10.1)\n", + "Requirement already satisfied: cloudpickle==1.1.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.1.1)\n", + "Requirement already satisfied: PyJWT>=1.6.4 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.7.1)\n", + "Requirement already satisfied: argo-models==2.2.1a in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.2.1a0)\n", + "Requirement already satisfied: Deprecated in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.2.7)\n", + "Requirement already satisfied: requests-toolbelt>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.9.1)\n", + "Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (19.3.0)\n", + "Requirement already satisfied: multidict<5.0,>=4.5 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (4.7.4)\n", + "Requirement already satisfied: async-timeout<4.0,>=3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.1)\n", + "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: smmap2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from gitdb2>=2.0.0->GitPython>=2.1.0->mlrun) (2.0.5)\n", + "Requirement already satisfied: terminado>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.8.3)\n", + "Requirement already satisfied: traitlets>=4.2.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (4.3.3)\n", + "Requirement already satisfied: nbformat in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.0.3)\n", + "Requirement already satisfied: jupyter-core>=4.6.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (4.6.1)\n", + "Requirement already satisfied: pyzmq>=17 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (18.1.1)\n", + "Requirement already satisfied: Send2Trash in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (1.5.0)\n", + "Requirement already satisfied: prometheus-client in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.1)\n", + "Requirement already satisfied: ipykernel in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.1.3)\n", + "Requirement already satisfied: jupyter-client>=5.3.4 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.3.4)\n", + "Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.2.0)\n", + "Requirement already satisfied: jupyterlab-server~=1.0.0 in /opt/conda/lib/python3.7/site-packages (from jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (1.0.6)\n", + "Requirement already satisfied: testpath in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.4.4)\n", + "Requirement already satisfied: pygments in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (2.5.2)\n", + "Requirement already satisfied: defusedxml in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: mistune<2,>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.4)\n", + "Requirement already satisfied: entrypoints>=0.2.2 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.3)\n", + "Requirement already satisfied: bleach in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (3.1.0)\n", + "Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: pickleshare in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.5)\n", + "Requirement already satisfied: backcall in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.0)\n", + "Requirement already satisfied: setuptools>=18.5 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (41.0.1.post20191122)\n", + "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (3.0.2)\n", + "Requirement already satisfied: jedi>=0.10 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.15.2)\n", + "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.8.0)\n", + "Requirement already satisfied: decorator in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.4.1)\n", + "Requirement already satisfied: docutils<0.16,>=0.10 in /opt/conda/lib/python3.7/site-packages (from botocore<1.15.0,>=1.14.7->boto3>=1.9->mlrun) (0.15.2)\n", + "Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.7/site-packages (from Jinja2>=2.10.1->Flask>=1.1.1->mlrun) (1.1.1)\n", + "Requirement already satisfied: google-cloud-core<2.0dev,>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.2.0)\n", + "Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (0.5.0)\n", + "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (0.57.0)\n", + "Requirement already satisfied: requests-oauthlib in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (1.3.0)\n", + "Requirement already satisfied: pyrsistent>=0.14.0 in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (0.15.7)\n", + "Requirement already satisfied: importlib-metadata; python_version < \"3.8\" in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (1.4.0)\n", + "Requirement already satisfied: asn1crypto>=0.21.0 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (0.24.0)\n", + "Requirement already satisfied: cffi!=1.11.3,>=1.8 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (1.12.3)\n", + "Requirement already satisfied: rsa<4.1,>=3.1.4 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0)\n", + "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0.0)\n", + "Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.2.8)\n", + "Requirement already satisfied: wrapt<2,>=1.10 in /opt/conda/lib/python3.7/site-packages (from Deprecated->kfp>=0.1.29->mlrun) (1.11.2)\n", + "Requirement already satisfied: ptyprocess; os_name != \"nt\" in /opt/conda/lib/python3.7/site-packages (from terminado>=0.8.1->notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: json5 in /opt/conda/lib/python3.7/site-packages (from jupyterlab-server~=1.0.0->jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.5)\n", + "Requirement already satisfied: webencodings in /opt/conda/lib/python3.7/site-packages (from bleach->nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.5.1)\n", + "Requirement already satisfied: wcwidth in /opt/conda/lib/python3.7/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.8)\n", + "Requirement already satisfied: parso>=0.5.2 in /opt/conda/lib/python3.7/site-packages (from jedi>=0.10->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.5.2)\n", + "Requirement already satisfied: google-api-core<2.0.0dev,>=1.16.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.16.0)\n", + "Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from requests-oauthlib->kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (3.1.0)\n", + "Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (2.0.0)\n", + "Requirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.4.2->kfp>=0.1.29->mlrun) (2.18)\n", + "Requirement already satisfied: pyasn1>=0.1.3 in /opt/conda/lib/python3.7/site-packages (from rsa<4.1,>=3.1.4->google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.4.8)\n", + "Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.51.0)\n", + "Requirement already satisfied: protobuf>=3.4.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (3.11.2)\n", + "Requirement already satisfied: more-itertools in /opt/conda/lib/python3.7/site-packages (from zipp>=0.5->importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (8.1.0)\n", + "\u001b[36mINFO\u001b[0m[0065] Taking snapshot of full filesystem... \n" + ] + }, + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "trainfn.deploy()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "task2 = mlrun.NewTask()\n", + "task2.with_params(\n", + " src_file=tsk1.output(KEY),\n", + " SKClassifier=SKLEARN_CLASSIFIER,\n", + " name=MODEL_NAME,\n", + " key=MODEL_KEY,\n", + " verbose=VERBOSE,\n", + " random_state=RNG,\n", + " callbacks = [])" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-22 22:38:03,841 starting run train uid=fcb2e3cad46c42648f8e08b5a834dc49 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-22 22:38:03,933 Job is running in the background, pod: train-p29xk\n", + "[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the \"boost_from_average\" parameter in \"binary\" objective is true.\n", + "This may cause significantly different results comparing to the previous versions of LightGBM.\n", + "Try to set boost_from_average=false, if your old models produce bad results\n", + "[LightGBM] [Info] Number of positive: 3375747, number of negative: 3374252\n", + "[LightGBM] [Info] Total Bins 5120\n", + "[LightGBM] [Info] Number of data: 6749999, number of used features: 20\n", + "[LightGBM] [Warning] Cannot change bin_construct_sample_cnt after constructed Dataset handle.\n", + "[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500111 -> initscore=0.000443\n", + "[LightGBM] [Info] Start training from score 0.000443\n", + "[1]\ttrain's binary_logloss: 0.60911\tvalid's binary_logloss: 0.609096\n", + "[2]\ttrain's binary_logloss: 0.540195\tvalid's binary_logloss: 0.540177\n", + "[3]\ttrain's binary_logloss: 0.482774\tvalid's binary_logloss: 0.482749\n", + "[4]\ttrain's binary_logloss: 0.434212\tvalid's binary_logloss: 0.434186\n", + "[5]\ttrain's binary_logloss: 0.392909\tvalid's binary_logloss: 0.392881\n", + "[6]\ttrain's binary_logloss: 0.357139\tvalid's binary_logloss: 0.357108\n", + "[7]\ttrain's binary_logloss: 0.326295\tvalid's binary_logloss: 0.326261\n", + "[8]\ttrain's binary_logloss: 0.299304\tvalid's binary_logloss: 0.299273\n", + "[9]\ttrain's binary_logloss: 0.275675\tvalid's binary_logloss: 0.275645\n", + "[10]\ttrain's binary_logloss: 0.254966\tvalid's binary_logloss: 0.254934\n", + "[11]\ttrain's binary_logloss: 0.236862\tvalid's binary_logloss: 0.236828\n", + "[12]\ttrain's binary_logloss: 0.220836\tvalid's binary_logloss: 0.220798\n", + "[13]\ttrain's binary_logloss: 0.206668\tvalid's binary_logloss: 0.206627\n", + "[14]\ttrain's binary_logloss: 0.194208\tvalid's binary_logloss: 0.19416\n", + "[15]\ttrain's binary_logloss: 0.183073\tvalid's binary_logloss: 0.183023\n", + "[16]\ttrain's binary_logloss: 0.17317\tvalid's binary_logloss: 0.173116\n", + "[17]\ttrain's binary_logloss: 0.164385\tvalid's binary_logloss: 0.164327\n", + "[18]\ttrain's binary_logloss: 0.156572\tvalid's binary_logloss: 0.156507\n", + "[19]\ttrain's binary_logloss: 0.149617\tvalid's binary_logloss: 0.14955\n", + "[20]\ttrain's binary_logloss: 0.14341\tvalid's binary_logloss: 0.143343\n", + "[21]\ttrain's binary_logloss: 0.137916\tvalid's binary_logloss: 0.137848\n", + "[22]\ttrain's binary_logloss: 0.132986\tvalid's binary_logloss: 0.13292\n", + "[23]\ttrain's binary_logloss: 0.128592\tvalid's binary_logloss: 0.128526\n", + "[24]\ttrain's binary_logloss: 0.124697\tvalid's binary_logloss: 0.124637\n", + "[25]\ttrain's binary_logloss: 0.121201\tvalid's binary_logloss: 0.121145\n", + "[26]\ttrain's binary_logloss: 0.118092\tvalid's binary_logloss: 0.118033\n", + "[27]\ttrain's binary_logloss: 0.115328\tvalid's binary_logloss: 0.115271\n", + "[28]\ttrain's binary_logloss: 0.112852\tvalid's binary_logloss: 0.112799\n", + "[29]\ttrain's binary_logloss: 0.110664\tvalid's binary_logloss: 0.110613\n", + "[30]\ttrain's binary_logloss: 0.108725\tvalid's binary_logloss: 0.108676\n", + "[31]\ttrain's binary_logloss: 0.107009\tvalid's binary_logloss: 0.106961\n", + "[32]\ttrain's binary_logloss: 0.105486\tvalid's binary_logloss: 0.105438\n", + "[33]\ttrain's binary_logloss: 0.104131\tvalid's binary_logloss: 0.104087\n", + "[34]\ttrain's binary_logloss: 0.102953\tvalid's binary_logloss: 0.102909\n", + "[35]\ttrain's binary_logloss: 0.101899\tvalid's binary_logloss: 0.101858\n", + "[36]\ttrain's binary_logloss: 0.100973\tvalid's binary_logloss: 0.100934\n", + "[37]\ttrain's binary_logloss: 0.100167\tvalid's binary_logloss: 0.10013\n", + "[38]\ttrain's binary_logloss: 0.0994484\tvalid's binary_logloss: 0.0994155\n", + "[39]\ttrain's binary_logloss: 0.0987949\tvalid's binary_logloss: 0.0987651\n", + "[40]\ttrain's binary_logloss: 0.0982119\tvalid's binary_logloss: 0.0981823\n", + "[41]\ttrain's binary_logloss: 0.0976529\tvalid's binary_logloss: 0.0976525\n", + "[42]\ttrain's binary_logloss: 0.0972022\tvalid's binary_logloss: 0.0972023\n", + "[43]\ttrain's binary_logloss: 0.0968125\tvalid's binary_logloss: 0.0968164\n", + "[44]\ttrain's binary_logloss: 0.0964793\tvalid's binary_logloss: 0.0964842\n", + "[45]\ttrain's binary_logloss: 0.0961801\tvalid's binary_logloss: 0.0961877\n", + "[46]\ttrain's binary_logloss: 0.0959085\tvalid's binary_logloss: 0.0959191\n", + "[47]\ttrain's binary_logloss: 0.0956876\tvalid's binary_logloss: 0.0956995\n", + "[48]\ttrain's binary_logloss: 0.0954728\tvalid's binary_logloss: 0.095488\n", + "[49]\ttrain's binary_logloss: 0.0952993\tvalid's binary_logloss: 0.0953199\n", + "[50]\ttrain's binary_logloss: 0.095152\tvalid's binary_logloss: 0.0951747\n", + "[51]\ttrain's binary_logloss: 0.0950115\tvalid's binary_logloss: 0.0950384\n", + "[52]\ttrain's binary_logloss: 0.0948914\tvalid's binary_logloss: 0.0949213\n", + "[53]\ttrain's binary_logloss: 0.0947885\tvalid's binary_logloss: 0.0948198\n", + "[54]\ttrain's binary_logloss: 0.0946978\tvalid's binary_logloss: 0.0947301\n", + "[55]\ttrain's binary_logloss: 0.0946152\tvalid's binary_logloss: 0.0946518\n", + "[56]\ttrain's binary_logloss: 0.0945364\tvalid's binary_logloss: 0.0945734\n", + "[57]\ttrain's binary_logloss: 0.0944624\tvalid's binary_logloss: 0.0945029\n", + "[58]\ttrain's binary_logloss: 0.0944047\tvalid's binary_logloss: 0.0944471\n", + "[59]\ttrain's binary_logloss: 0.0943546\tvalid's binary_logloss: 0.0944008\n", + "[60]\ttrain's binary_logloss: 0.094306\tvalid's binary_logloss: 0.094354\n", + "[61]\ttrain's binary_logloss: 0.0942551\tvalid's binary_logloss: 0.0943054\n", + "[62]\ttrain's binary_logloss: 0.0942173\tvalid's binary_logloss: 0.0942695\n", + "[63]\ttrain's binary_logloss: 0.0941833\tvalid's binary_logloss: 0.0942386\n", + "[64]\ttrain's binary_logloss: 0.0941437\tvalid's binary_logloss: 0.0942044\n", + "[65]\ttrain's binary_logloss: 0.0941104\tvalid's binary_logloss: 0.0941727\n", + "[66]\ttrain's binary_logloss: 0.0940892\tvalid's binary_logloss: 0.0941538\n", + "[67]\ttrain's binary_logloss: 0.0940532\tvalid's binary_logloss: 0.0941245\n", + "[68]\ttrain's binary_logloss: 0.0940219\tvalid's binary_logloss: 0.0940966\n", + "[69]\ttrain's binary_logloss: 0.0939963\tvalid's binary_logloss: 0.0940732\n", + "[70]\ttrain's binary_logloss: 0.093974\tvalid's binary_logloss: 0.0940533\n", + "[71]\ttrain's binary_logloss: 0.093948\tvalid's binary_logloss: 0.0940288\n", + "[72]\ttrain's binary_logloss: 0.0939228\tvalid's binary_logloss: 0.0940075\n", + "[73]\ttrain's binary_logloss: 0.0939065\tvalid's binary_logloss: 0.0939924\n", + "[74]\ttrain's binary_logloss: 0.0938874\tvalid's binary_logloss: 0.093976\n", + "[75]\ttrain's binary_logloss: 0.093874\tvalid's binary_logloss: 0.093965\n", + "[76]\ttrain's binary_logloss: 0.0938594\tvalid's binary_logloss: 0.0939508\n", + "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:708: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", + " labels = getattr(columns, 'labels', None) or [\n", + "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:735: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead\n", + " return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)\n", + "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:752: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", + " labels, = index.labels\n", + "[77]\ttrain's binary_logloss: 0.0938408\tvalid's binary_logloss: 0.0939326\n", + "[78]\ttrain's binary_logloss: 0.093827\tvalid's binary_logloss: 0.0939201\n", + "[79]\ttrain's binary_logloss: 0.0938091\tvalid's binary_logloss: 0.0939058\n", + "[80]\ttrain's binary_logloss: 0.0937968\tvalid's binary_logloss: 0.0938946\n", + "[81]\ttrain's binary_logloss: 0.0937822\tvalid's binary_logloss: 0.0938844\n", + "[82]\ttrain's binary_logloss: 0.0937715\tvalid's binary_logloss: 0.0938753\n", + "[83]\ttrain's binary_logloss: 0.093761\tvalid's binary_logloss: 0.0938765\n", + "[84]\ttrain's binary_logloss: 0.0937495\tvalid's binary_logloss: 0.0938739\n", + "[85]\ttrain's binary_logloss: 0.0937354\tvalid's binary_logloss: 0.0938619\n", + "[86]\ttrain's binary_logloss: 0.0937245\tvalid's binary_logloss: 0.0938619\n", + "[87]\ttrain's binary_logloss: 0.0937136\tvalid's binary_logloss: 0.0938643\n", + "[88]\ttrain's binary_logloss: 0.0937044\tvalid's binary_logloss: 0.0938647\n", + "[89]\ttrain's binary_logloss: 0.0936888\tvalid's binary_logloss: 0.0938547\n", + "[90]\ttrain's binary_logloss: 0.0936775\tvalid's binary_logloss: 0.0938443\n", + "[91]\ttrain's binary_logloss: 0.0936707\tvalid's binary_logloss: 0.0938451\n", + "[92]\ttrain's binary_logloss: 0.0936562\tvalid's binary_logloss: 0.0938326\n", + "[93]\ttrain's binary_logloss: 0.0936513\tvalid's binary_logloss: 0.0938331\n", + "[94]\ttrain's binary_logloss: 0.0936458\tvalid's binary_logloss: 0.093835\n", + "[95]\ttrain's binary_logloss: 0.0936355\tvalid's binary_logloss: 0.0938356\n", + "[96]\ttrain's binary_logloss: 0.0936261\tvalid's binary_logloss: 0.0938356\n", + "[97]\ttrain's binary_logloss: 0.0936146\tvalid's binary_logloss: 0.0938254\n", + "[98]\ttrain's binary_logloss: 0.0936085\tvalid's binary_logloss: 0.0938261\n", + "[99]\ttrain's binary_logloss: 0.0936021\tvalid's binary_logloss: 0.0938265\n", + "[100]\ttrain's binary_logloss: 0.0935964\tvalid's binary_logloss: 0.0938274\n", + "[mlrun] 2020-01-22 22:39:24,741 log artifact model at model, size: None, db: Y\n", + "[mlrun] 2020-01-22 22:39:24,884 log artifact xtest at xtest.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-22 22:39:25,012 log artifact ytest at ytest.pkl, size: None, db: Y\n", + "\n", + "[mlrun] 2020-01-22 22:39:25,042 run executed, status=completed\n", + "final state: succeeded\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...34dc49
0Jan 22 22:38:10completedsklearn-classifier
host=train-p29xk
kind=job
owner=admin
SKClassifier=lightgbm.sklearn.LGBMClassifier
callbacks=[]
key=model
name=model
random_state=1
src_file=/User/mlrun/sklearn-classifier/simdata.pqt
verbose=True
train_accuracy=0.9671342173532174
model
xtest
ytest
\n", + "
\n", + "
\n", + "
\n", + " Title\n", + " ×\n", + "
\n", + " \n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "to track results use .show() or .logs() or in CLI: \n", + "!mlrun get run fcb2e3cad46c42648f8e08b5a834dc49 , !mlrun logs fcb2e3cad46c42648f8e08b5a834dc49 \n", + "[mlrun] 2020-01-22 22:39:32,871 run executed, status=completed\n" + ] + } + ], + "source": [ + "tsk2 = trainfn.run(task2, handler='train')" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'train_accuracy': 0.9671342173532174,\n", + " 'model': 'model',\n", + " 'xtest': 'xtest.pkl',\n", + " 'ytest': 'ytest.pkl'}" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tsk2.outputs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## evaluation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "run plots here" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## model optimization" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "onnx here" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/train/sklearn-classifier.py b/train/sklearn-classifier.py new file mode 100644 index 000000000..c103af7f4 --- /dev/null +++ b/train/sklearn-classifier.py @@ -0,0 +1,104 @@ +from mlrun.execution import MLClientCtx +from mlrun.datastore import DataItem +import pandas as pd +import lightgbm as lgb +from typing import Optional, Union +import os +from sklearn.model_selection import train_test_split +import importlib +from cloudpickle import dump + +def train( + context: Optional[MLClientCtx] = None, + src_file: Union[DataItem, str] = '', + SKClassifier: str = '', + callbacks = [], + test_size: float = 0.1, + train_val_split: float = 0.75, + sample: int = -1, + target_path: str = '', + name: str = '', + key: str = '', + verbose: bool = False, + random_state = 1 +) -> None: + """Train and save an Scikitlearn model. + + The data source can either be a string file name or an artifact item. + + The header is eith a list of column names, an artifact header item, or None. + + + :param context: the function context + :param src_file: ('raw') name of raw data file + :param sample: (-1). Selects the first n rows, or select a sample starting + from the first. If negative <-1, select a random sample from + the entire file + :param header: (None) header artifact or list of column names. + :param SKClassifier: string module and classname of classifier + :param callbacks + :param test_size: (0.1) test set size + :param train_val_split: (0.75) Once the test set has been removed the + training set gets this proportion. + :param target_path: folder location of files + :param name: destination name for model file + :param key: key for model artifact + :param verbose : (False) show metrics for training/validation steps. + :param random_state: (1) sklearn rng seed + + example callbacks: + ``` + from lightgbm import record_evaluation + eval_results = dict() + callbacks = [record_evaluation(eval_results)] + ``` + """ + # load data + if isinstance(src_file, DataItem): + src_file = str(src_file) + srcfilepath = os.path.join(target_path, src_file) + + # save only a sample, intended for debugging + if (sample == -1) or (sample >= 1): + # get all rows, or contiguous sample starting at row 1. + raw = pd.read_parquet(srcfilepath, engine='pyarrow') + labels = raw.pop('labels') + raw = raw.iloc[:sample, :] + labels = labels.iloc[:sample] + else: + # grab a random sample + raw = pd.read_parquet(srcfilepath, engine='pyarrow').sample(sample*-1) + labels = raw.pop('labels') + + # double split tp generate 3 data sets: train, validation and test + x, xtest, y, ytest = train_test_split(raw, labels, train_size=1-test_size, + random_state=random_state) + + xtrain, xvalid, ytrain, yvalid = train_test_split(x, y, + train_size=train_val_split, + random_state=random_state) + + # create classifier class from string and instantiate + splits = SKClassifier.split(".") + clfclass = getattr(importlib.import_module(".".join(splits[:-1])), splits[-1]) + clf = clfclass(random_state=random_state, verbose=int(verbose == True)) + + clf.fit(xtrain, + ytrain, + eval_set=[(xvalid, yvalid), (xtrain, ytrain)], + eval_names=['valid', 'train'], + callbacks=callbacks, + verbose=verbose) + + context.log_result("train_accuracy", float(clf.score(xtrain, ytrain))) + + # save model + filepath = os.path.join(target_path, name) + dump(clf, open(filepath, 'wb')) + context.log_artifact(key, target_path=filepath) #, labels=exp_labels) + # save test data + for t in ['x', 'y']: + fname = t + 'test.pkl' + filepath = os.path.join(target_path, fname) + dump(xtest, open(filepath, 'wb')) + context.log_artifact(t+'test', target_path=filepath) \ No newline at end of file diff --git a/train/sklearn-classifier.yaml b/train/sklearn-classifier.yaml new file mode 100644 index 000000000..77bfd49b7 --- /dev/null +++ b/train/sklearn-classifier.yaml @@ -0,0 +1,9 @@ +kind: job +metadata: + name: sklearn-classifier +spec: + build: + functionSourceCode: ZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gbWxydW4uZGF0YXN0b3JlIGltcG9ydCBEYXRhSXRlbQppbXBvcnQgcGFuZGFzIGFzIHBkCmltcG9ydCBsaWdodGdibSBhcyBsZ2IKZnJvbSB0eXBpbmcgaW1wb3J0IE9wdGlvbmFsLCBVbmlvbgppbXBvcnQgb3MKZnJvbSBza2xlYXJuLm1vZGVsX3NlbGVjdGlvbiBpbXBvcnQgdHJhaW5fdGVzdF9zcGxpdAppbXBvcnQgaW1wb3J0bGliCmZyb20gY2xvdWRwaWNrbGUgaW1wb3J0IGR1bXAKCmRlZiB0cmFpbigKICAgIGNvbnRleHQ6IE9wdGlvbmFsW01MQ2xpZW50Q3R4XSA9IE5vbmUsCiAgICBzcmNfZmlsZTogVW5pb25bRGF0YUl0ZW0sIHN0cl0gPSAnJywKICAgIFNLQ2xhc3NpZmllcjogc3RyICA9ICcnLAogICAgY2FsbGJhY2tzICA9IFtdLAogICAgdGVzdF9zaXplOiBmbG9hdCA9IDAuMSwKICAgIHRyYWluX3ZhbF9zcGxpdDogZmxvYXQgPSAwLjc1LAogICAgc2FtcGxlOiBpbnQgPSAtMSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAnJywKICAgIG5hbWU6IHN0ciA9ICcnLAogICAga2V5OiBzdHIgPSAnJywKICAgIHZlcmJvc2U6IGJvb2wgPSBGYWxzZSwKICAgIHJhbmRvbV9zdGF0ZSA9IDEKKSAtPiBOb25lOgogICAgIiIiVHJhaW4gYW5kIHNhdmUgYW4gU2Npa2l0bGVhcm4gbW9kZWwuCiAgICAKICAgIFRoZSBkYXRhIHNvdXJjZSBjYW4gZWl0aGVyIGJlIGEgc3RyaW5nIGZpbGUgbmFtZSBvciBhbiBhcnRpZmFjdCBpdGVtLgogICAgCiAgICBUaGUgaGVhZGVyIGlzIGVpdGggYSBsaXN0IG9mIGNvbHVtbiBuYW1lcywgYW4gYXJ0aWZhY3QgaGVhZGVyIGl0ZW0sIG9yIE5vbmUuCiAgICAKICAgIAogICAgOnBhcmFtIGNvbnRleHQ6ICAgICAgICAgdGhlIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBzcmNfZmlsZTogICAgICAgICgncmF3JykgbmFtZSBvZiByYXcgZGF0YSBmaWxlCiAgICA6cGFyYW0gc2FtcGxlOiAgICAgICAgICAoLTEpLiBTZWxlY3RzIHRoZSBmaXJzdCBuIHJvd3MsIG9yIHNlbGVjdCBhIHNhbXBsZSBzdGFydGluZwogICAgICAgICAgICAgICAgICAgICAgICAgICAgZnJvbSB0aGUgZmlyc3QuIElmIG5lZ2F0aXZlIDwtMSwgc2VsZWN0IGEgcmFuZG9tIHNhbXBsZSBmcm9tIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgdGhlIGVudGlyZSBmaWxlCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgICAgICAoTm9uZSkgaGVhZGVyIGFydGlmYWN0IG9yIGxpc3Qgb2YgY29sdW1uIG5hbWVzLgogICAgOnBhcmFtIFNLQ2xhc3NpZmllcjogICAgc3RyaW5nIG1vZHVsZSBhbmQgY2xhc3NuYW1lIG9mIGNsYXNzaWZpZXIKICAgIDpwYXJhbSBjYWxsYmFja3MKICAgIDpwYXJhbSB0ZXN0X3NpemU6ICAgICAgICgwLjEpIHRlc3Qgc2V0IHNpemUKICAgIDpwYXJhbSB0cmFpbl92YWxfc3BsaXQ6ICgwLjc1KSBPbmNlIHRoZSB0ZXN0IHNldCBoYXMgYmVlbiByZW1vdmVkIHRoZSAKICAgICAgICAgICAgICAgICAgICAgICAgICAgIHRyYWluaW5nIHNldCBnZXRzIHRoaXMgcHJvcG9ydGlvbi4KICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogICAgIGZvbGRlciBsb2NhdGlvbiBvZiBmaWxlcwogICAgOnBhcmFtIG5hbWU6ICAgICAgICAgICAgZGVzdGluYXRpb24gbmFtZSBmb3IgbW9kZWwgZmlsZQogICAgOnBhcmFtIGtleTogICAgICAgICAgICAga2V5IGZvciBtb2RlbCBhcnRpZmFjdAogICAgOnBhcmFtIHZlcmJvc2UgOiAgICAgICAgKEZhbHNlKSBzaG93IG1ldHJpY3MgZm9yIHRyYWluaW5nL3ZhbGlkYXRpb24gc3RlcHMuCiAgICA6cGFyYW0gcmFuZG9tX3N0YXRlOiAgICAoMSkgc2tsZWFybiBybmcgc2VlZAogICAgCiAgICBleGFtcGxlIGNhbGxiYWNrczoKICAgIGBgYAogICAgZnJvbSBsaWdodGdibSBpbXBvcnQgcmVjb3JkX2V2YWx1YXRpb24KICAgIGV2YWxfcmVzdWx0cyA9IGRpY3QoKQogICAgY2FsbGJhY2tzID0gW3JlY29yZF9ldmFsdWF0aW9uKGV2YWxfcmVzdWx0cyldCiAgICBgYGAKICAgICIiIgogICAgIyBsb2FkIGRhdGEKICAgIGlmIGlzaW5zdGFuY2Uoc3JjX2ZpbGUsIERhdGFJdGVtKToKICAgICAgICBzcmNfZmlsZSA9IHN0cihzcmNfZmlsZSkKICAgIHNyY2ZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBzcmNfZmlsZSkKCiAgICAjIHNhdmUgb25seSBhIHNhbXBsZSwgaW50ZW5kZWQgZm9yIGRlYnVnZ2luZwogICAgaWYgKHNhbXBsZSA9PSAtMSkgb3IgKHNhbXBsZSA+PSAxKToKICAgICAgICAjIGdldCBhbGwgcm93cywgb3IgY29udGlndW91cyBzYW1wbGUgc3RhcnRpbmcgYXQgcm93IDEuCiAgICAgICAgcmF3ID0gcGQucmVhZF9wYXJxdWV0KHNyY2ZpbGVwYXRoLCBlbmdpbmU9J3B5YXJyb3cnKQogICAgICAgIGxhYmVscyA9IHJhdy5wb3AoJ2xhYmVscycpCiAgICAgICAgcmF3ID0gcmF3Lmlsb2NbOnNhbXBsZSwgOl0KICAgICAgICBsYWJlbHMgPSBsYWJlbHMuaWxvY1s6c2FtcGxlXQogICAgZWxzZToKICAgICAgICAjIGdyYWIgYSByYW5kb20gc2FtcGxlCiAgICAgICAgcmF3ID0gcGQucmVhZF9wYXJxdWV0KHNyY2ZpbGVwYXRoLCBlbmdpbmU9J3B5YXJyb3cnKS5zYW1wbGUoc2FtcGxlKi0xKQogICAgICAgIGxhYmVscyA9IHJhdy5wb3AoJ2xhYmVscycpCiAgICAKICAgICMgZG91YmxlIHNwbGl0IHRwIGdlbmVyYXRlIDMgZGF0YSBzZXRzOiB0cmFpbiwgdmFsaWRhdGlvbiBhbmQgdGVzdAogICAgeCwgeHRlc3QsIHksIHl0ZXN0ID0gdHJhaW5fdGVzdF9zcGxpdChyYXcsIGxhYmVscywgdHJhaW5fc2l6ZT0xLXRlc3Rfc2l6ZSwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHJhbmRvbV9zdGF0ZT1yYW5kb21fc3RhdGUpCiAgIAogICAgeHRyYWluLCB4dmFsaWQsIHl0cmFpbiwgeXZhbGlkID0gdHJhaW5fdGVzdF9zcGxpdCh4LCB5LCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgdHJhaW5fc2l6ZT10cmFpbl92YWxfc3BsaXQsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICByYW5kb21fc3RhdGU9cmFuZG9tX3N0YXRlKSAgICAgICAgCiAgIAogICAgIyBjcmVhdGUgY2xhc3NpZmllciBjbGFzcyBmcm9tIHN0cmluZyBhbmQgaW5zdGFudGlhdGUKICAgIHNwbGl0cyA9IFNLQ2xhc3NpZmllci5zcGxpdCgiLiIpCiAgICBjbGZjbGFzcyA9IGdldGF0dHIoaW1wb3J0bGliLmltcG9ydF9tb2R1bGUoIi4iLmpvaW4oc3BsaXRzWzotMV0pKSwgc3BsaXRzWy0xXSkKICAgIGNsZiA9IGNsZmNsYXNzKHJhbmRvbV9zdGF0ZT1yYW5kb21fc3RhdGUsIHZlcmJvc2U9aW50KHZlcmJvc2UgPT0gVHJ1ZSkpCgogICAgY2xmLmZpdCh4dHJhaW4sIAogICAgICAgICAgICB5dHJhaW4sCiAgICAgICAgICAgIGV2YWxfc2V0PVsoeHZhbGlkLCB5dmFsaWQpLCAoeHRyYWluLCB5dHJhaW4pXSwKICAgICAgICAgICAgZXZhbF9uYW1lcz1bJ3ZhbGlkJywgJ3RyYWluJ10sCiAgICAgICAgICAgIGNhbGxiYWNrcz1jYWxsYmFja3MsCiAgICAgICAgICAgIHZlcmJvc2U9dmVyYm9zZSkKICAgICAKICAgIGNvbnRleHQubG9nX3Jlc3VsdCgidHJhaW5fYWNjdXJhY3kiLCBmbG9hdChjbGYuc2NvcmUoeHRyYWluLCB5dHJhaW4pKSkKCiAgICAjIHNhdmUgbW9kZWwKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgZHVtcChjbGYsIG9wZW4oZmlsZXBhdGgsICd3YicpKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3Qoa2V5LCB0YXJnZXRfcGF0aD1maWxlcGF0aCkgIywgbGFiZWxzPWV4cF9sYWJlbHMpCiAgICAjIHNhdmUgdGVzdCBkYXRhCiAgICBmb3IgdCBpbiBbJ3gnLCAneSddOgogICAgICAgIGZuYW1lID0gdCArICd0ZXN0LnBrbCcKICAgICAgICBmaWxlcGF0aCA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgZm5hbWUpCiAgICAgICAgZHVtcCh4dGVzdCwgb3BlbihmaWxlcGF0aCwgJ3diJykpCiAgICAgICAgY29udGV4dC5sb2dfYXJ0aWZhY3QodCsndGVzdCcsIHRhcmdldF9wYXRoPWZpbGVwYXRoKQ== + base_image: yjbds/mlrun-ds:latest + commands: [] + code_origin: https://github.com/yjb-ds/functions.git#3f7e0c78313c0f8da3f2ae8535b625f06f5c3ee4:/User/repos/functions/train/sklearn-classifier.py From d66ee336384c2b11ef9d6e6b5165672459bde1cf Mon Sep 17 00:00:00 2001 From: yasha Date: Wed, 22 Jan 2020 22:51:17 +0000 Subject: [PATCH 14/32] minor fixes, debugged & running --- fileutils/open_archive/function.yaml | 3 +- tests/train_classifier.ipynb | 513 ++++----------------------- 2 files changed, 64 insertions(+), 452 deletions(-) diff --git a/fileutils/open_archive/function.yaml b/fileutils/open_archive/function.yaml index b751e79a6..0fb7276fa 100644 --- a/fileutils/open_archive/function.yaml +++ b/fileutils/open_archive/function.yaml @@ -2,7 +2,8 @@ kind: job metadata: name: open-archive spec: - image: yjbds/mlrun-files:latest description: 'retrieve archive and extract all' build: functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlciBvbiAyMDIwLTAxLTIxIDA5OjQ3CgppbXBvcnQgbWxydW4KbWxydW4ubWxjb25mLmRicGF0aCA9ICdodHRwOi8vbWxydW4tYXBpOjgwODAnCgppbXBvcnQgdXJsbGliLnJlcXVlc3QKCmltcG9ydCBvcwppbXBvcnQgemlwZmlsZQppbXBvcnQgdXJsbGliCmltcG9ydCB0YXJmaWxlCmltcG9ydCBqc29uCgpmcm9tIG1scnVuLmV4ZWN1dGlvbiBpbXBvcnQgTUxDbGllbnRDdHgKCmRlZiBvcGVuX2FyY2hpdmUoY29udGV4dDogTUxDbGllbnRDdHgsIAogICAgICAgICAgICAgICAgIHRhcmdldF9kaXI6IHN0ciA9ICdjb250ZW50JywKICAgICAgICAgICAgICAgICBhcmNoaXZlX3VybDogc3RyID0gJycpOgogICAgIiIiT3BlbiBhIGZpbGUvb2JqZWN0IGFyY2hpdmUgaW50byBhIHRhcmdldCBkaXJlY3RvcnkKICAgIAogICAgQ3VycmVudGx5IHN1cHBvcnRzIHppcCBhbmQgdGFyLmd6CiAgICAiIiIKICAgIG9zLm1ha2VkaXJzKHRhcmdldF9kaXIsIGV4aXN0X29rPVRydWUpCiAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCdWZXJpZmllZCBkaXJlY3RvcmllcycpCiAgICBwcmludChhcmNoaXZlX3VybCkKICAgIHNwbGl0cyA9IGFyY2hpdmVfdXJsLnNwbGl0KCcuJykKICAgIHByaW50KHNwbGl0cykKICAgIGlmIChzcGxpdHNbLTFdID09ICdneicpOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oJ29wZW5pbmcgdGFyX2d6JykKICAgICAgICByZWYgPSB0YXJmaWxlLm9wZW4oZmlsZW9iaj11cmxsaWIucmVxdWVzdC51cmxvcGVuKGFyY2hpdmVfdXJsKSwgbW9kZT0ncnxneicpCiAgICBlbGlmIHNwbGl0c1stMV0gPT0gJ3ppcCc6CiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbygnb3BlbmluZyB6aXAnKQogICAgICAgIHJlZiA9IHppcGZpbGUuWmlwRmlsZShhcmNoaXZlX3VybCwgJ3InKQoKICAgIHJlZi5leHRyYWN0YWxsKHRhcmdldF9kaXIpCiAgICByZWYuY2xvc2UoKQoKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KCdjb250ZW50JywgdGFyZ2V0X3BhdGg9dGFyZ2V0X2RpcikKCg== + build_image: yjbds/mlrun-files:latest + commands: [] diff --git a/tests/train_classifier.ipynb b/tests/train_classifier.ipynb index 7ea3de621..d766912cd 100644 --- a/tests/train_classifier.ipynb +++ b/tests/train_classifier.ipynb @@ -4,7 +4,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# deploying yaml on optimized python images" + "# deploying yaml on optimized python images\n", + "\n", + "* one node\n", + "* lightgbm\n", + "* 10 mio samples / 20 features\n", + "* code stored as yaml in github\n", + "* precomiled images using optimized for cpu python libraries" ] }, { @@ -35,12 +41,12 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "TARGET_CODE_BASE = '/User/repos/functions/' \n", - "N_SAMPLES = 10_000_000\n", + "N_SAMPLES = 10_000_000 # size of HIGGS data\n", "M_FEATURES = 20\n", "NEG_WEIGHT = 0.5\n", "TARGET_DATA_PATH = '/User/mlrun/sklearn-classifier'\n", @@ -50,7 +56,7 @@ "SKLEARN_CLASSIFIER = 'lightgbm.sklearn.LGBMClassifier'\n", "MODEL_KEY = 'model'\n", "MODEL_NAME = MODEL_KEY\n", - "VERBOSE = True" + "VERBOSE = False" ] }, { @@ -62,7 +68,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -73,170 +79,25 @@ }, { "cell_type": "code", - "execution_count": 9, - "metadata": { - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-22 22:34:26,684 starting remote build, image: .mlrun/func-default-binary-latest\n", - "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Built cross stage deps: map[] \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Unpacking rootfs as cmd RUN pip install mlrun requires it. \n", - "\u001b[36mINFO\u001b[0m[0046] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0066] RUN pip install mlrun \n", - "\u001b[36mINFO\u001b[0m[0066] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0066] args: [-c pip install mlrun] \n", - "Requirement already satisfied: mlrun in /opt/conda/lib/python3.7/site-packages (0.4.3)\n", - "Requirement already satisfied: gunicorn==19.9.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (19.9.0)\n", - "Requirement already satisfied: boto3>=1.9 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.11.7)\n", - "Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (7.0)\n", - "Requirement already satisfied: nuclio-sdk>=0.0.3 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.0.5)\n", - "Requirement already satisfied: gevent==1.4.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.4.0)\n", - "Requirement already satisfied: sqlalchemy==1.3.11 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.3.11)\n", - "Requirement already satisfied: kfp>=0.1.29 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.2.0)\n", - "Requirement already satisfied: nuclio-jupyter>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.0)\n", - "Requirement already satisfied: croniter==0.3.31 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.3.31)\n", - "Requirement already satisfied: pyyaml>=5.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (5.1.1)\n", - "Requirement already satisfied: nest-asyncio>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.2.2)\n", - "Requirement already satisfied: pandas>=0.23.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.25.1)\n", - "Requirement already satisfied: requests>=2.20.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (2.20.1)\n", - "Requirement already satisfied: GitPython>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.0.5)\n", - "Requirement already satisfied: aiohttp>=3.5.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.6.2)\n", - "Requirement already satisfied: Flask>=1.1.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.1.1)\n", - "Requirement already satisfied: tabulate<=0.8.3,>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.3)\n", - "Requirement already satisfied: botocore<1.15.0,>=1.14.7 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (1.14.7)\n", - "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.3.1)\n", - "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.9.4)\n", - "Requirement already satisfied: greenlet>=0.4.14; platform_python_implementation == \"CPython\" in /opt/conda/lib/python3.7/site-packages (from gevent==1.4.0->mlrun) (0.4.15)\n", - "Requirement already satisfied: kubernetes<=10.0.0,>=8.0.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (10.0.0)\n", - "Requirement already satisfied: Deprecated in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.2.7)\n", - "Requirement already satisfied: requests-toolbelt>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.9.1)\n", - "Requirement already satisfied: kfp-server-api<=0.1.40,>=0.1.18 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.1.40)\n", - "Requirement already satisfied: six>=1.10 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.12.0)\n", - "Requirement already satisfied: jsonschema>=3.0.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (3.2.0)\n", - "Requirement already satisfied: cryptography>=2.4.2 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.7)\n", - "Requirement already satisfied: certifi in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2019.9.11)\n", - "Requirement already satisfied: PyJWT>=1.6.4 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.7.1)\n", - "Requirement already satisfied: cloudpickle==1.1.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.1.1)\n", - "Requirement already satisfied: argo-models==2.2.1a in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.2.1a0)\n", - "Requirement already satisfied: google-cloud-storage>=1.13.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.25.0)\n", - "Requirement already satisfied: urllib3<1.25,>=1.15 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.24.1)\n", - "Requirement already satisfied: python-dateutil in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.8.0)\n", - "Requirement already satisfied: google-auth>=1.6.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.10.1)\n", - "Requirement already satisfied: tornado<6,>=5 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.1.1)\n", - "Requirement already satisfied: jupyterlab>=0.35.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (1.2.5)\n", - "Requirement already satisfied: notebook>=5.7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.3)\n", - "Requirement already satisfied: nbconvert>=5.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.6.1)\n", - "Requirement already satisfied: ipython>=7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (7.11.1)\n", - "Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (2019.1)\n", - "Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (1.17.4)\n", - "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (3.0.4)\n", - "Requirement already satisfied: idna<2.8,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (2.6)\n", - "Requirement already satisfied: gitdb2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from GitPython>=2.1.0->mlrun) (2.0.6)\n", - "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (1.4.2)\n", - "Requirement already satisfied: multidict<5.0,>=4.5 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (4.7.4)\n", - "Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (19.3.0)\n", - "Requirement already satisfied: async-timeout<4.0,>=3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.1)\n", - "Requirement already satisfied: itsdangerous>=0.24 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (1.1.0)\n", - "Requirement already satisfied: Jinja2>=2.10.1 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (2.10.3)\n", - "Requirement already satisfied: Werkzeug>=0.15 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (0.16.0)\n", - "Requirement already satisfied: docutils<0.16,>=0.10 in /opt/conda/lib/python3.7/site-packages (from botocore<1.15.0,>=1.14.7->boto3>=1.9->mlrun) (0.15.2)\n", - "Requirement already satisfied: setuptools>=21.0.0 in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (41.0.1.post20191122)\n", - "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (0.57.0)\n", - "Requirement already satisfied: requests-oauthlib in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (1.3.0)\n", - "Requirement already satisfied: wrapt<2,>=1.10 in /opt/conda/lib/python3.7/site-packages (from Deprecated->kfp>=0.1.29->mlrun) (1.11.2)\n", - "Requirement already satisfied: importlib-metadata; python_version < \"3.8\" in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (1.4.0)\n", - "Requirement already satisfied: pyrsistent>=0.14.0 in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (0.15.7)\n", - "Requirement already satisfied: asn1crypto>=0.21.0 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (0.24.0)\n", - "Requirement already satisfied: cffi!=1.11.3,>=1.8 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (1.12.3)\n", - "Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (0.5.0)\n", - "Requirement already satisfied: google-cloud-core<2.0dev,>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.2.0)\n", - "Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.2.8)\n", - "Requirement already satisfied: rsa<4.1,>=3.1.4 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0)\n", - "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0.0)\n", - "Requirement already satisfied: jupyterlab-server~=1.0.0 in /opt/conda/lib/python3.7/site-packages (from jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (1.0.6)\n", - "Requirement already satisfied: nbformat in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.0.3)\n", - "Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.2.0)\n", - "Requirement already satisfied: pyzmq>=17 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (18.1.1)\n", - "Requirement already satisfied: ipykernel in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.1.3)\n", - "Requirement already satisfied: jupyter-client>=5.3.4 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.3.4)\n", - "Requirement already satisfied: jupyter-core>=4.6.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (4.6.1)\n", - "Requirement already satisfied: prometheus-client in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.1)\n", - "Requirement already satisfied: terminado>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.8.3)\n", - "Requirement already satisfied: traitlets>=4.2.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (4.3.3)\n", - "Requirement already satisfied: Send2Trash in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (1.5.0)\n", - "Requirement already satisfied: defusedxml in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", - "Requirement already satisfied: entrypoints>=0.2.2 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.3)\n", - "Requirement already satisfied: bleach in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (3.1.0)\n", - "Requirement already satisfied: mistune<2,>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.4)\n", - "Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (1.4.2)\n", - "Requirement already satisfied: testpath in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.4.4)\n", - "Requirement already satisfied: pygments in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (2.5.2)\n", - "Requirement already satisfied: decorator in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.4.1)\n", - "Requirement already satisfied: backcall in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.0)\n", - "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (3.0.2)\n", - "Requirement already satisfied: jedi>=0.10 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.15.2)\n", - "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.8.0)\n", - "Requirement already satisfied: pickleshare in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.5)\n", - "Requirement already satisfied: smmap2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from gitdb2>=2.0.0->GitPython>=2.1.0->mlrun) (2.0.5)\n", - "Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.7/site-packages (from Jinja2>=2.10.1->Flask>=1.1.1->mlrun) (1.1.1)\n", - "Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from requests-oauthlib->kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (3.1.0)\n", - "Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (2.0.0)\n", - "Requirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.4.2->kfp>=0.1.29->mlrun) (2.18)\n", - "Requirement already satisfied: google-api-core<2.0.0dev,>=1.16.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.16.0)\n", - "Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /opt/conda/lib/python3.7/site-packages (from pyasn1-modules>=0.2.1->google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.4.8)\n", - "Requirement already satisfied: json5 in /opt/conda/lib/python3.7/site-packages (from jupyterlab-server~=1.0.0->jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.5)\n", - "Requirement already satisfied: ptyprocess; os_name != \"nt\" in /opt/conda/lib/python3.7/site-packages (from terminado>=0.8.1->notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", - "Requirement already satisfied: webencodings in /opt/conda/lib/python3.7/site-packages (from bleach->nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.5.1)\n", - "Requirement already satisfied: wcwidth in /opt/conda/lib/python3.7/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.8)\n", - "Requirement already satisfied: parso>=0.5.2 in /opt/conda/lib/python3.7/site-packages (from jedi>=0.10->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.5.2)\n", - "Requirement already satisfied: more-itertools in /opt/conda/lib/python3.7/site-packages (from zipp>=0.5->importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (8.1.0)\n", - "Requirement already satisfied: protobuf>=3.4.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (3.11.2)\n", - "Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.51.0)\n", - "\u001b[36mINFO\u001b[0m[0068] Taking snapshot of full filesystem... \n" - ] - }, - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": 4, + "metadata": {}, + "outputs": [], "source": [ - "binarydatagen.deploy()" + "# binarydatagen.deploy()" ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 10, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -255,18 +116,18 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-22 22:35:44,742 starting run create_binary_classification uid=9330db1734df40afabbaf41cd386930c -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-22 22:35:44,823 Job is running in the background, pod: create-binary-classification-gcwdk\n", - "[mlrun] 2020-01-22 22:36:38,971 log artifact simdata at /User/mlrun/sklearn-classifier/simdata.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-22 22:48:03,974 starting run create_binary_classification uid=ad9df1228d034fd5a11d732502f64aa2 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-22 22:48:04,072 Job is running in the background, pod: create-binary-classification-bnqlx\n", + "[mlrun] 2020-01-22 22:48:53,079 log artifact simdata at /User/mlrun/sklearn-classifier/simdata.pqt, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-22 22:36:39,218 run executed, status=completed\n", + "[mlrun] 2020-01-22 22:48:53,341 run executed, status=completed\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", " result = infer_dtype(pandas_collection)\n", "final state: succeeded\n" @@ -441,12 +302,12 @@ " \n", " \n", " \n", - "
...86930c
\n", + "
...f64aa2
\n", " 0\n", - " Jan 22 22:36:01\n", + " Jan 22 22:48:16\n", " completed\n", " binary\n", - "
host=create-binary-classification-gcwdk
kind=job
owner=admin
\n", + "
host=create-binary-classification-bnqlx
kind=job
owner=admin
\n", " \n", "
filename=simdata.pqt
key=simdata
m_features=20
n_samples=10000000
random_state=1
target_path=/User/mlrun/sklearn-classifier
weight=0.5
\n", " \n", @@ -455,12 +316,12 @@ " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -476,8 +337,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 9330db1734df40afabbaf41cd386930c , !mlrun logs 9330db1734df40afabbaf41cd386930c \n", - "[mlrun] 2020-01-22 22:36:45,628 run executed, status=completed\n" + "!mlrun get run ad9df1228d034fd5a11d732502f64aa2 , !mlrun logs ad9df1228d034fd5a11d732502f64aa2 \n", + "[mlrun] 2020-01-22 22:48:56,442 run executed, status=completed\n" ] } ], @@ -495,7 +356,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -505,7 +366,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -523,7 +384,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ @@ -534,170 +395,25 @@ }, { "cell_type": "code", - "execution_count": 15, - "metadata": { - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-22 22:36:49,836 starting remote build, image: .mlrun/func-default-sklearn-classifier-latest\n", - "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Built cross stage deps: map[] \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Unpacking rootfs as cmd RUN pip install mlrun requires it. \n", - "\u001b[36mINFO\u001b[0m[0044] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0063] RUN pip install mlrun \n", - "\u001b[36mINFO\u001b[0m[0063] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0063] args: [-c pip install mlrun] \n", - "Requirement already satisfied: mlrun in /opt/conda/lib/python3.7/site-packages (0.4.3)\n", - "Requirement already satisfied: GitPython>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.0.5)\n", - "Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (7.0)\n", - "Requirement already satisfied: gevent==1.4.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.4.0)\n", - "Requirement already satisfied: gunicorn==19.9.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (19.9.0)\n", - "Requirement already satisfied: pandas>=0.23.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.25.1)\n", - "Requirement already satisfied: nuclio-jupyter>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.0)\n", - "Requirement already satisfied: pyyaml>=5.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (5.1.1)\n", - "Requirement already satisfied: tabulate<=0.8.3,>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.3)\n", - "Requirement already satisfied: boto3>=1.9 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.11.7)\n", - "Requirement already satisfied: nest-asyncio>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.2.2)\n", - "Requirement already satisfied: Flask>=1.1.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.1.1)\n", - "Requirement already satisfied: nuclio-sdk>=0.0.3 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.0.5)\n", - "Requirement already satisfied: requests>=2.20.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (2.20.1)\n", - "Requirement already satisfied: sqlalchemy==1.3.11 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.3.11)\n", - "Requirement already satisfied: kfp>=0.1.29 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.2.0)\n", - "Requirement already satisfied: croniter==0.3.31 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.3.31)\n", - "Requirement already satisfied: aiohttp>=3.5.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.6.2)\n", - "Requirement already satisfied: gitdb2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from GitPython>=2.1.0->mlrun) (2.0.6)\n", - "Requirement already satisfied: greenlet>=0.4.14; platform_python_implementation == \"CPython\" in /opt/conda/lib/python3.7/site-packages (from gevent==1.4.0->mlrun) (0.4.15)\n", - "Requirement already satisfied: python-dateutil>=2.6.1 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (2.8.0)\n", - "Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (2019.1)\n", - "Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (1.17.4)\n", - "Requirement already satisfied: notebook>=5.7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.3)\n", - "Requirement already satisfied: jupyterlab>=0.35.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (1.2.5)\n", - "Requirement already satisfied: nbconvert>=5.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.6.1)\n", - "Requirement already satisfied: tornado<6,>=5 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.1.1)\n", - "Requirement already satisfied: ipython>=7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (7.11.1)\n", - "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.9.4)\n", - "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.3.1)\n", - "Requirement already satisfied: botocore<1.15.0,>=1.14.7 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (1.14.7)\n", - "Requirement already satisfied: Jinja2>=2.10.1 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (2.10.3)\n", - "Requirement already satisfied: Werkzeug>=0.15 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (0.16.0)\n", - "Requirement already satisfied: itsdangerous>=0.24 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (1.1.0)\n", - "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (3.0.4)\n", - "Requirement already satisfied: idna<2.8,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (2.6)\n", - "Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (1.24.1)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (2019.9.11)\n", - "Requirement already satisfied: six>=1.10 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.12.0)\n", - "Requirement already satisfied: google-cloud-storage>=1.13.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.25.0)\n", - "Requirement already satisfied: kubernetes<=10.0.0,>=8.0.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (10.0.0)\n", - "Requirement already satisfied: jsonschema>=3.0.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (3.2.0)\n", - "Requirement already satisfied: cryptography>=2.4.2 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.7)\n", - "Requirement already satisfied: kfp-server-api<=0.1.40,>=0.1.18 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.1.40)\n", - "Requirement already satisfied: google-auth>=1.6.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.10.1)\n", - "Requirement already satisfied: cloudpickle==1.1.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.1.1)\n", - "Requirement already satisfied: PyJWT>=1.6.4 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.7.1)\n", - "Requirement already satisfied: argo-models==2.2.1a in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.2.1a0)\n", - "Requirement already satisfied: Deprecated in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.2.7)\n", - "Requirement already satisfied: requests-toolbelt>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.9.1)\n", - "Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (19.3.0)\n", - "Requirement already satisfied: multidict<5.0,>=4.5 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (4.7.4)\n", - "Requirement already satisfied: async-timeout<4.0,>=3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.1)\n", - "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (1.4.2)\n", - "Requirement already satisfied: smmap2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from gitdb2>=2.0.0->GitPython>=2.1.0->mlrun) (2.0.5)\n", - "Requirement already satisfied: terminado>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.8.3)\n", - "Requirement already satisfied: traitlets>=4.2.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (4.3.3)\n", - "Requirement already satisfied: nbformat in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.0.3)\n", - "Requirement already satisfied: jupyter-core>=4.6.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (4.6.1)\n", - "Requirement already satisfied: pyzmq>=17 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (18.1.1)\n", - "Requirement already satisfied: Send2Trash in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (1.5.0)\n", - "Requirement already satisfied: prometheus-client in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.1)\n", - "Requirement already satisfied: ipykernel in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.1.3)\n", - "Requirement already satisfied: jupyter-client>=5.3.4 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.3.4)\n", - "Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.2.0)\n", - "Requirement already satisfied: jupyterlab-server~=1.0.0 in /opt/conda/lib/python3.7/site-packages (from jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (1.0.6)\n", - "Requirement already satisfied: testpath in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.4.4)\n", - "Requirement already satisfied: pygments in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (2.5.2)\n", - "Requirement already satisfied: defusedxml in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", - "Requirement already satisfied: mistune<2,>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.4)\n", - "Requirement already satisfied: entrypoints>=0.2.2 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.3)\n", - "Requirement already satisfied: bleach in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (3.1.0)\n", - "Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (1.4.2)\n", - "Requirement already satisfied: pickleshare in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.5)\n", - "Requirement already satisfied: backcall in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.0)\n", - "Requirement already satisfied: setuptools>=18.5 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (41.0.1.post20191122)\n", - "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (3.0.2)\n", - "Requirement already satisfied: jedi>=0.10 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.15.2)\n", - "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.8.0)\n", - "Requirement already satisfied: decorator in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.4.1)\n", - "Requirement already satisfied: docutils<0.16,>=0.10 in /opt/conda/lib/python3.7/site-packages (from botocore<1.15.0,>=1.14.7->boto3>=1.9->mlrun) (0.15.2)\n", - "Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.7/site-packages (from Jinja2>=2.10.1->Flask>=1.1.1->mlrun) (1.1.1)\n", - "Requirement already satisfied: google-cloud-core<2.0dev,>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.2.0)\n", - "Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (0.5.0)\n", - "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (0.57.0)\n", - "Requirement already satisfied: requests-oauthlib in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (1.3.0)\n", - "Requirement already satisfied: pyrsistent>=0.14.0 in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (0.15.7)\n", - "Requirement already satisfied: importlib-metadata; python_version < \"3.8\" in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (1.4.0)\n", - "Requirement already satisfied: asn1crypto>=0.21.0 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (0.24.0)\n", - "Requirement already satisfied: cffi!=1.11.3,>=1.8 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (1.12.3)\n", - "Requirement already satisfied: rsa<4.1,>=3.1.4 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0)\n", - "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0.0)\n", - "Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.2.8)\n", - "Requirement already satisfied: wrapt<2,>=1.10 in /opt/conda/lib/python3.7/site-packages (from Deprecated->kfp>=0.1.29->mlrun) (1.11.2)\n", - "Requirement already satisfied: ptyprocess; os_name != \"nt\" in /opt/conda/lib/python3.7/site-packages (from terminado>=0.8.1->notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", - "Requirement already satisfied: json5 in /opt/conda/lib/python3.7/site-packages (from jupyterlab-server~=1.0.0->jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.5)\n", - "Requirement already satisfied: webencodings in /opt/conda/lib/python3.7/site-packages (from bleach->nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.5.1)\n", - "Requirement already satisfied: wcwidth in /opt/conda/lib/python3.7/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.8)\n", - "Requirement already satisfied: parso>=0.5.2 in /opt/conda/lib/python3.7/site-packages (from jedi>=0.10->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.5.2)\n", - "Requirement already satisfied: google-api-core<2.0.0dev,>=1.16.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.16.0)\n", - "Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from requests-oauthlib->kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (3.1.0)\n", - "Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (2.0.0)\n", - "Requirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.4.2->kfp>=0.1.29->mlrun) (2.18)\n", - "Requirement already satisfied: pyasn1>=0.1.3 in /opt/conda/lib/python3.7/site-packages (from rsa<4.1,>=3.1.4->google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.4.8)\n", - "Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.51.0)\n", - "Requirement already satisfied: protobuf>=3.4.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (3.11.2)\n", - "Requirement already satisfied: more-itertools in /opt/conda/lib/python3.7/site-packages (from zipp>=0.5->importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (8.1.0)\n", - "\u001b[36mINFO\u001b[0m[0065] Taking snapshot of full filesystem... \n" - ] - }, - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": 10, + "metadata": {}, + "outputs": [], "source": [ - "trainfn.deploy()" + "# trainfn.deploy()" ] }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 16, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } @@ -716,135 +432,30 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-22 22:38:03,841 starting run train uid=fcb2e3cad46c42648f8e08b5a834dc49 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-22 22:38:03,933 Job is running in the background, pod: train-p29xk\n", + "[mlrun] 2020-01-22 22:49:00,573 starting run train uid=902dec5bdd8a4d4baeb9333ac6d5e15e -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-22 22:49:00,663 Job is running in the background, pod: train-99tsp\n", "[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the \"boost_from_average\" parameter in \"binary\" objective is true.\n", "This may cause significantly different results comparing to the previous versions of LightGBM.\n", "Try to set boost_from_average=false, if your old models produce bad results\n", - "[LightGBM] [Info] Number of positive: 3375747, number of negative: 3374252\n", - "[LightGBM] [Info] Total Bins 5120\n", - "[LightGBM] [Info] Number of data: 6749999, number of used features: 20\n", "[LightGBM] [Warning] Cannot change bin_construct_sample_cnt after constructed Dataset handle.\n", - "[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500111 -> initscore=0.000443\n", - "[LightGBM] [Info] Start training from score 0.000443\n", - "[1]\ttrain's binary_logloss: 0.60911\tvalid's binary_logloss: 0.609096\n", - "[2]\ttrain's binary_logloss: 0.540195\tvalid's binary_logloss: 0.540177\n", - "[3]\ttrain's binary_logloss: 0.482774\tvalid's binary_logloss: 0.482749\n", - "[4]\ttrain's binary_logloss: 0.434212\tvalid's binary_logloss: 0.434186\n", - "[5]\ttrain's binary_logloss: 0.392909\tvalid's binary_logloss: 0.392881\n", - "[6]\ttrain's binary_logloss: 0.357139\tvalid's binary_logloss: 0.357108\n", - "[7]\ttrain's binary_logloss: 0.326295\tvalid's binary_logloss: 0.326261\n", - "[8]\ttrain's binary_logloss: 0.299304\tvalid's binary_logloss: 0.299273\n", - "[9]\ttrain's binary_logloss: 0.275675\tvalid's binary_logloss: 0.275645\n", - "[10]\ttrain's binary_logloss: 0.254966\tvalid's binary_logloss: 0.254934\n", - "[11]\ttrain's binary_logloss: 0.236862\tvalid's binary_logloss: 0.236828\n", - "[12]\ttrain's binary_logloss: 0.220836\tvalid's binary_logloss: 0.220798\n", - "[13]\ttrain's binary_logloss: 0.206668\tvalid's binary_logloss: 0.206627\n", - "[14]\ttrain's binary_logloss: 0.194208\tvalid's binary_logloss: 0.19416\n", - "[15]\ttrain's binary_logloss: 0.183073\tvalid's binary_logloss: 0.183023\n", - "[16]\ttrain's binary_logloss: 0.17317\tvalid's binary_logloss: 0.173116\n", - "[17]\ttrain's binary_logloss: 0.164385\tvalid's binary_logloss: 0.164327\n", - "[18]\ttrain's binary_logloss: 0.156572\tvalid's binary_logloss: 0.156507\n", - "[19]\ttrain's binary_logloss: 0.149617\tvalid's binary_logloss: 0.14955\n", - "[20]\ttrain's binary_logloss: 0.14341\tvalid's binary_logloss: 0.143343\n", - "[21]\ttrain's binary_logloss: 0.137916\tvalid's binary_logloss: 0.137848\n", - "[22]\ttrain's binary_logloss: 0.132986\tvalid's binary_logloss: 0.13292\n", - "[23]\ttrain's binary_logloss: 0.128592\tvalid's binary_logloss: 0.128526\n", - "[24]\ttrain's binary_logloss: 0.124697\tvalid's binary_logloss: 0.124637\n", - "[25]\ttrain's binary_logloss: 0.121201\tvalid's binary_logloss: 0.121145\n", - "[26]\ttrain's binary_logloss: 0.118092\tvalid's binary_logloss: 0.118033\n", - "[27]\ttrain's binary_logloss: 0.115328\tvalid's binary_logloss: 0.115271\n", - "[28]\ttrain's binary_logloss: 0.112852\tvalid's binary_logloss: 0.112799\n", - "[29]\ttrain's binary_logloss: 0.110664\tvalid's binary_logloss: 0.110613\n", - "[30]\ttrain's binary_logloss: 0.108725\tvalid's binary_logloss: 0.108676\n", - "[31]\ttrain's binary_logloss: 0.107009\tvalid's binary_logloss: 0.106961\n", - "[32]\ttrain's binary_logloss: 0.105486\tvalid's binary_logloss: 0.105438\n", - "[33]\ttrain's binary_logloss: 0.104131\tvalid's binary_logloss: 0.104087\n", - "[34]\ttrain's binary_logloss: 0.102953\tvalid's binary_logloss: 0.102909\n", - "[35]\ttrain's binary_logloss: 0.101899\tvalid's binary_logloss: 0.101858\n", - "[36]\ttrain's binary_logloss: 0.100973\tvalid's binary_logloss: 0.100934\n", - "[37]\ttrain's binary_logloss: 0.100167\tvalid's binary_logloss: 0.10013\n", - "[38]\ttrain's binary_logloss: 0.0994484\tvalid's binary_logloss: 0.0994155\n", - "[39]\ttrain's binary_logloss: 0.0987949\tvalid's binary_logloss: 0.0987651\n", - "[40]\ttrain's binary_logloss: 0.0982119\tvalid's binary_logloss: 0.0981823\n", - "[41]\ttrain's binary_logloss: 0.0976529\tvalid's binary_logloss: 0.0976525\n", - "[42]\ttrain's binary_logloss: 0.0972022\tvalid's binary_logloss: 0.0972023\n", - "[43]\ttrain's binary_logloss: 0.0968125\tvalid's binary_logloss: 0.0968164\n", - "[44]\ttrain's binary_logloss: 0.0964793\tvalid's binary_logloss: 0.0964842\n", - "[45]\ttrain's binary_logloss: 0.0961801\tvalid's binary_logloss: 0.0961877\n", - "[46]\ttrain's binary_logloss: 0.0959085\tvalid's binary_logloss: 0.0959191\n", - "[47]\ttrain's binary_logloss: 0.0956876\tvalid's binary_logloss: 0.0956995\n", - "[48]\ttrain's binary_logloss: 0.0954728\tvalid's binary_logloss: 0.095488\n", - "[49]\ttrain's binary_logloss: 0.0952993\tvalid's binary_logloss: 0.0953199\n", - "[50]\ttrain's binary_logloss: 0.095152\tvalid's binary_logloss: 0.0951747\n", - "[51]\ttrain's binary_logloss: 0.0950115\tvalid's binary_logloss: 0.0950384\n", - "[52]\ttrain's binary_logloss: 0.0948914\tvalid's binary_logloss: 0.0949213\n", - "[53]\ttrain's binary_logloss: 0.0947885\tvalid's binary_logloss: 0.0948198\n", - "[54]\ttrain's binary_logloss: 0.0946978\tvalid's binary_logloss: 0.0947301\n", - "[55]\ttrain's binary_logloss: 0.0946152\tvalid's binary_logloss: 0.0946518\n", - "[56]\ttrain's binary_logloss: 0.0945364\tvalid's binary_logloss: 0.0945734\n", - "[57]\ttrain's binary_logloss: 0.0944624\tvalid's binary_logloss: 0.0945029\n", - "[58]\ttrain's binary_logloss: 0.0944047\tvalid's binary_logloss: 0.0944471\n", - "[59]\ttrain's binary_logloss: 0.0943546\tvalid's binary_logloss: 0.0944008\n", - "[60]\ttrain's binary_logloss: 0.094306\tvalid's binary_logloss: 0.094354\n", - "[61]\ttrain's binary_logloss: 0.0942551\tvalid's binary_logloss: 0.0943054\n", - "[62]\ttrain's binary_logloss: 0.0942173\tvalid's binary_logloss: 0.0942695\n", - "[63]\ttrain's binary_logloss: 0.0941833\tvalid's binary_logloss: 0.0942386\n", - "[64]\ttrain's binary_logloss: 0.0941437\tvalid's binary_logloss: 0.0942044\n", - "[65]\ttrain's binary_logloss: 0.0941104\tvalid's binary_logloss: 0.0941727\n", - "[66]\ttrain's binary_logloss: 0.0940892\tvalid's binary_logloss: 0.0941538\n", - "[67]\ttrain's binary_logloss: 0.0940532\tvalid's binary_logloss: 0.0941245\n", - "[68]\ttrain's binary_logloss: 0.0940219\tvalid's binary_logloss: 0.0940966\n", - "[69]\ttrain's binary_logloss: 0.0939963\tvalid's binary_logloss: 0.0940732\n", - "[70]\ttrain's binary_logloss: 0.093974\tvalid's binary_logloss: 0.0940533\n", - "[71]\ttrain's binary_logloss: 0.093948\tvalid's binary_logloss: 0.0940288\n", - "[72]\ttrain's binary_logloss: 0.0939228\tvalid's binary_logloss: 0.0940075\n", - "[73]\ttrain's binary_logloss: 0.0939065\tvalid's binary_logloss: 0.0939924\n", - "[74]\ttrain's binary_logloss: 0.0938874\tvalid's binary_logloss: 0.093976\n", - "[75]\ttrain's binary_logloss: 0.093874\tvalid's binary_logloss: 0.093965\n", - "[76]\ttrain's binary_logloss: 0.0938594\tvalid's binary_logloss: 0.0939508\n", + "[mlrun] 2020-01-22 22:50:23,162 log artifact model at model, size: None, db: Y\n", + "[mlrun] 2020-01-22 22:50:23,333 log artifact xtest at xtest.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-22 22:50:23,454 log artifact ytest at ytest.pkl, size: None, db: Y\n", + "\n", + "[mlrun] 2020-01-22 22:50:23,466 run executed, status=completed\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:708: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", " labels = getattr(columns, 'labels', None) or [\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:735: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead\n", " return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:752: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", " labels, = index.labels\n", - "[77]\ttrain's binary_logloss: 0.0938408\tvalid's binary_logloss: 0.0939326\n", - "[78]\ttrain's binary_logloss: 0.093827\tvalid's binary_logloss: 0.0939201\n", - "[79]\ttrain's binary_logloss: 0.0938091\tvalid's binary_logloss: 0.0939058\n", - "[80]\ttrain's binary_logloss: 0.0937968\tvalid's binary_logloss: 0.0938946\n", - "[81]\ttrain's binary_logloss: 0.0937822\tvalid's binary_logloss: 0.0938844\n", - "[82]\ttrain's binary_logloss: 0.0937715\tvalid's binary_logloss: 0.0938753\n", - "[83]\ttrain's binary_logloss: 0.093761\tvalid's binary_logloss: 0.0938765\n", - "[84]\ttrain's binary_logloss: 0.0937495\tvalid's binary_logloss: 0.0938739\n", - "[85]\ttrain's binary_logloss: 0.0937354\tvalid's binary_logloss: 0.0938619\n", - "[86]\ttrain's binary_logloss: 0.0937245\tvalid's binary_logloss: 0.0938619\n", - "[87]\ttrain's binary_logloss: 0.0937136\tvalid's binary_logloss: 0.0938643\n", - "[88]\ttrain's binary_logloss: 0.0937044\tvalid's binary_logloss: 0.0938647\n", - "[89]\ttrain's binary_logloss: 0.0936888\tvalid's binary_logloss: 0.0938547\n", - "[90]\ttrain's binary_logloss: 0.0936775\tvalid's binary_logloss: 0.0938443\n", - "[91]\ttrain's binary_logloss: 0.0936707\tvalid's binary_logloss: 0.0938451\n", - "[92]\ttrain's binary_logloss: 0.0936562\tvalid's binary_logloss: 0.0938326\n", - "[93]\ttrain's binary_logloss: 0.0936513\tvalid's binary_logloss: 0.0938331\n", - "[94]\ttrain's binary_logloss: 0.0936458\tvalid's binary_logloss: 0.093835\n", - "[95]\ttrain's binary_logloss: 0.0936355\tvalid's binary_logloss: 0.0938356\n", - "[96]\ttrain's binary_logloss: 0.0936261\tvalid's binary_logloss: 0.0938356\n", - "[97]\ttrain's binary_logloss: 0.0936146\tvalid's binary_logloss: 0.0938254\n", - "[98]\ttrain's binary_logloss: 0.0936085\tvalid's binary_logloss: 0.0938261\n", - "[99]\ttrain's binary_logloss: 0.0936021\tvalid's binary_logloss: 0.0938265\n", - "[100]\ttrain's binary_logloss: 0.0935964\tvalid's binary_logloss: 0.0938274\n", - "[mlrun] 2020-01-22 22:39:24,741 log artifact model at model, size: None, db: Y\n", - "[mlrun] 2020-01-22 22:39:24,884 log artifact xtest at xtest.pkl, size: None, db: Y\n", - "[mlrun] 2020-01-22 22:39:25,012 log artifact ytest at ytest.pkl, size: None, db: Y\n", - "\n", - "[mlrun] 2020-01-22 22:39:25,042 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -1017,26 +628,26 @@ " \n", " \n", " \n", - "
...34dc49
\n", + "
...d5e15e
\n", " 0\n", - " Jan 22 22:38:10\n", + " Jan 22 22:49:07\n", " completed\n", " sklearn-classifier\n", - "
host=train-p29xk
kind=job
owner=admin
\n", + "
host=train-99tsp
kind=job
owner=admin
\n", " \n", - "
SKClassifier=lightgbm.sklearn.LGBMClassifier
callbacks=[]
key=model
name=model
random_state=1
src_file=/User/mlrun/sklearn-classifier/simdata.pqt
verbose=True
\n", + "
SKClassifier=lightgbm.sklearn.LGBMClassifier
callbacks=[]
key=model
name=model
random_state=1
src_file=/User/mlrun/sklearn-classifier/simdata.pqt
verbose=False
\n", "
train_accuracy=0.9671342173532174
\n", "
model
xtest
ytest
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -1052,8 +663,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run fcb2e3cad46c42648f8e08b5a834dc49 , !mlrun logs fcb2e3cad46c42648f8e08b5a834dc49 \n", - "[mlrun] 2020-01-22 22:39:32,871 run executed, status=completed\n" + "!mlrun get run 902dec5bdd8a4d4baeb9333ac6d5e15e , !mlrun logs 902dec5bdd8a4d4baeb9333ac6d5e15e \n", + "[mlrun] 2020-01-22 22:50:29,828 run executed, status=completed\n" ] } ], @@ -1063,7 +674,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 13, "metadata": {}, "outputs": [ { @@ -1075,7 +686,7 @@ " 'ytest': 'ytest.pkl'}" ] }, - "execution_count": 19, + "execution_count": 13, "metadata": {}, "output_type": "execute_result" } From 8f99ee9b821064232026e10c79f37e335c6a086f Mon Sep 17 00:00:00 2001 From: yasha Date: Thu, 23 Jan 2020 15:13:25 +0000 Subject: [PATCH 15/32] added simple splitter function --- datagen/classification/binary.py | 1 + datagen/splitters/train_valid_test.py | 97 ++++ datagen/splitters/train_valid_test.yaml | 8 + tests/train_classifier.ipynb | 575 +++++++++++++++++-- tests/train_valid_test_split.ipynb | 731 ++++++++++++++++++++++++ 5 files changed, 1359 insertions(+), 53 deletions(-) create mode 100644 datagen/splitters/train_valid_test.py create mode 100644 datagen/splitters/train_valid_test.yaml create mode 100644 tests/train_valid_test_split.ipynb diff --git a/datagen/classification/binary.py b/datagen/classification/binary.py index 895fd8702..d6806f007 100644 --- a/datagen/classification/binary.py +++ b/datagen/classification/binary.py @@ -36,6 +36,7 @@ def create_binary_classification( If no filename is given it will default to: 'simdata-{n_samples}X{m_features}.parquet'. All of the scikit-learn parameters can be set using **sk_params + :param context: function context :param n_samples: number of rows/samples :param m_features: number of cols/features diff --git a/datagen/splitters/train_valid_test.py b/datagen/splitters/train_valid_test.py new file mode 100644 index 000000000..dc1deb5e3 --- /dev/null +++ b/datagen/splitters/train_valid_test.py @@ -0,0 +1,97 @@ +import pandas as pd +import os +import numpy as np +import pyarrow.parquet as pq +import pyarrow as pa +from cloudpickle import dump + +from sklearn.model_selection import train_test_split +from typing import Optional, Union +from mlrun.execution import MLClientCtx +from mlrun.datastore import DataItem + +def train_valid_test_splitter( + context: Optional[MLClientCtx] = None, + src_file: Union[DataItem, str] = '', + header: Union[DataItem, str, list] = '', + sample: int = -1, + label_column: str = 'labels', + test_size: float = 0.1, + train_val_split: float = 0.75, + target_path: str = '', + name: str = '', + key: str = '', + random_state = 1 +) -> None: + """Split raw data input into train, validation and test sets. + + :param context: the function context + :param src_file: ('raw') name of raw data file + :param header: (None) header artifact or list of column names. + :param sample: (-1). Selects the first n rows, or select a sample starting + from the first. If negative <-1, select a random sample from + the entire file + :param label_column: ground-truth (y) labels + :param test_size: (0.1) test set size + :param train_val_split: (0.75) Once the test set has been removed the + training set gets this proportion. + :param target_path: folder location of files + :param name: destination prefix name for model files + :param key: key for model artifact + :param random_state: (1) sklearn rng seed + """ + if isinstance(src_file, DataItem): + src_file = str(src_file) + srcfilepath = os.path.join(target_path, src_file) + + if (sample == -1) or (sample >= 1): + # get all rows, or contiguous sample starting at row 1. + raw = pd.read_parquet(srcfilepath, engine='pyarrow') + labels = raw.pop(label_column) + raw = raw.iloc[:sample, :] + labels = labels.iloc[:sample] + else: + # grab a random sample + raw = pd.read_parquet(srcfilepath, engine='pyarrow').sample(sample*-1) + labels = raw.pop(label_column) + + # double split tp generate 3 data sets: train, validation and test + x, xtest, y, ytest = train_test_split(raw, labels, test_size=test_size, + random_state=random_state) + + xtrain, xvalid, ytrain, yvalid = train_test_split(x, y, + train_size=train_val_split, + random_state=random_state) + + if name: + name = '-' + name + + # save header + f = os.path.join(target_path, name + 'header.pkl') + dump(raw.columns.values, open(f, 'wb')) + context.log_artifact('header', target_path=f) + + # save data sets + f = os.path.join(target_path, name + 'xtrain.pqt') + xtrain.to_parquet(f) + context.log_artifact('xtrain', target_path=f) + + f = os.path.join(target_path, name + 'xvalid.pqt') + xvalid.to_parquet(f) + context.log_artifact('xvalid', target_path=f) + + f = os.path.join(target_path, name + 'xtest.pqt') + xtest.to_parquet(f) + context.log_artifact('xtest', target_path=f) + + f = os.path.join(target_path, name + 'ytrain.pqt') + pd.DataFrame({'labels': ytrain}).to_parquet(f) + context.log_artifact('ytrain', target_path=f) + + f = os.path.join(target_path, name + 'yvalid.pqt') + pd.DataFrame({'labels': yvalid}).to_parquet(f) + context.log_artifact('yvalid', target_path=f) + + f = os.path.join(target_path, name + 'ytest.pqt') + pd.DataFrame({'labels': ytest}).to_parquet(f) + context.log_artifact('ytest', target_path=f) \ No newline at end of file diff --git a/datagen/splitters/train_valid_test.yaml b/datagen/splitters/train_valid_test.yaml new file mode 100644 index 000000000..c3ffa7e97 --- /dev/null +++ b/datagen/splitters/train_valid_test.yaml @@ -0,0 +1,8 @@ +kind: job +metadata: + name: train-valid-test +spec: + build: + functionSourceCode: aW1wb3J0IHBhbmRhcyBhcyBwZAppbXBvcnQgb3MKaW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBweWFycm93LnBhcnF1ZXQgYXMgcHEKaW1wb3J0IHB5YXJyb3cgYXMgcGEKZnJvbSBjbG91ZHBpY2tsZSBpbXBvcnQgZHVtcAoKZnJvbSBza2xlYXJuLm1vZGVsX3NlbGVjdGlvbiBpbXBvcnQgdHJhaW5fdGVzdF9zcGxpdApmcm9tIHR5cGluZyBpbXBvcnQgT3B0aW9uYWwsIFVuaW9uCmZyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIG1scnVuLmRhdGFzdG9yZSBpbXBvcnQgRGF0YUl0ZW0KCmRlZiB0cmFpbl92YWxpZF90ZXN0X3NwbGl0dGVyKAogICAgY29udGV4dDogT3B0aW9uYWxbTUxDbGllbnRDdHhdID0gTm9uZSwKICAgIHNyY19maWxlOiBVbmlvbltEYXRhSXRlbSwgc3RyXSA9ICcnLAogICAgaGVhZGVyOiBVbmlvbltEYXRhSXRlbSwgc3RyLCBsaXN0XSA9ICcnLAogICAgc2FtcGxlOiBpbnQgPSAtMSwKICAgIGxhYmVsX2NvbHVtbjogc3RyID0gJ2xhYmVscycsCiAgICB0ZXN0X3NpemU6IGZsb2F0ID0gMC4xLAogICAgdHJhaW5fdmFsX3NwbGl0OiBmbG9hdCA9IDAuNzUsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gJycsCiAgICBuYW1lOiBzdHIgPSAnJywKICAgIGtleTogc3RyID0gJycsCiAgICByYW5kb21fc3RhdGUgPSAxCikgLT4gTm9uZToKICAgICIiIlNwbGl0IHJhdyBkYXRhIGlucHV0IGludG8gdHJhaW4sIHZhbGlkYXRpb24gYW5kIHRlc3Qgc2V0cy4KCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgICB0aGUgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIHNyY19maWxlOiAgICAgICAgKCdyYXcnKSBuYW1lIG9mIHJhdyBkYXRhIGZpbGUKICAgIDpwYXJhbSBoZWFkZXI6ICAgICAgICAgIChOb25lKSBoZWFkZXIgYXJ0aWZhY3Qgb3IgbGlzdCBvZiBjb2x1bW4gbmFtZXMuCiAgICA6cGFyYW0gc2FtcGxlOiAgICAgICAgICAoLTEpLiBTZWxlY3RzIHRoZSBmaXJzdCBuIHJvd3MsIG9yIHNlbGVjdCBhIHNhbXBsZSBzdGFydGluZwogICAgICAgICAgICAgICAgICAgICAgICAgICAgZnJvbSB0aGUgZmlyc3QuIElmIG5lZ2F0aXZlIDwtMSwgc2VsZWN0IGEgcmFuZG9tIHNhbXBsZSBmcm9tIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgdGhlIGVudGlyZSBmaWxlCiAgICA6cGFyYW0gbGFiZWxfY29sdW1uOiAgICBncm91bmQtdHJ1dGggKHkpIGxhYmVscwogICAgOnBhcmFtIHRlc3Rfc2l6ZTogICAgICAgKDAuMSkgdGVzdCBzZXQgc2l6ZQogICAgOnBhcmFtIHRyYWluX3ZhbF9zcGxpdDogKDAuNzUpIE9uY2UgdGhlIHRlc3Qgc2V0IGhhcyBiZWVuIHJlbW92ZWQgdGhlIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgdHJhaW5pbmcgc2V0IGdldHMgdGhpcyBwcm9wb3J0aW9uLgogICAgOnBhcmFtIHRhcmdldF9wYXRoOiAgICAgZm9sZGVyIGxvY2F0aW9uIG9mIGZpbGVzCiAgICA6cGFyYW0gbmFtZTogICAgICAgICAgICBkZXN0aW5hdGlvbiBwcmVmaXggbmFtZSBmb3IgbW9kZWwgZmlsZXMKICAgIDpwYXJhbSBrZXk6ICAgICAgICAgICAgIGtleSBmb3IgbW9kZWwgYXJ0aWZhY3QKICAgIDpwYXJhbSByYW5kb21fc3RhdGU6ICAgICgxKSBza2xlYXJuIHJuZyBzZWVkCiAgICAiIiIKICAgIGlmIGlzaW5zdGFuY2Uoc3JjX2ZpbGUsIERhdGFJdGVtKToKICAgICAgICBzcmNfZmlsZSA9IHN0cihzcmNfZmlsZSkKICAgIHNyY2ZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBzcmNfZmlsZSkKCiAgICBpZiAoc2FtcGxlID09IC0xKSBvciAoc2FtcGxlID49IDEpOgogICAgICAgICMgZ2V0IGFsbCByb3dzLCBvciBjb250aWd1b3VzIHNhbXBsZSBzdGFydGluZyBhdCByb3cgMS4KICAgICAgICByYXcgPSBwZC5yZWFkX3BhcnF1ZXQoc3JjZmlsZXBhdGgsIGVuZ2luZT0ncHlhcnJvdycpCiAgICAgICAgbGFiZWxzID0gcmF3LnBvcChsYWJlbF9jb2x1bW4pCiAgICAgICAgcmF3ID0gcmF3Lmlsb2NbOnNhbXBsZSwgOl0KICAgICAgICBsYWJlbHMgPSBsYWJlbHMuaWxvY1s6c2FtcGxlXQogICAgZWxzZToKICAgICAgICAjIGdyYWIgYSByYW5kb20gc2FtcGxlCiAgICAgICAgcmF3ID0gcGQucmVhZF9wYXJxdWV0KHNyY2ZpbGVwYXRoLCBlbmdpbmU9J3B5YXJyb3cnKS5zYW1wbGUoc2FtcGxlKi0xKQogICAgICAgIGxhYmVscyA9IHJhdy5wb3AobGFiZWxfY29sdW1uKQogICAgCiAgICAjIGRvdWJsZSBzcGxpdCB0cCBnZW5lcmF0ZSAzIGRhdGEgc2V0czogdHJhaW4sIHZhbGlkYXRpb24gYW5kIHRlc3QKICAgIHgsIHh0ZXN0LCB5LCB5dGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQocmF3LCBsYWJlbHMsIHRlc3Rfc2l6ZT10ZXN0X3NpemUsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICByYW5kb21fc3RhdGU9cmFuZG9tX3N0YXRlKQogICAKICAgIHh0cmFpbiwgeHZhbGlkLCB5dHJhaW4sIHl2YWxpZCA9IHRyYWluX3Rlc3Rfc3BsaXQoeCwgeSwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHRyYWluX3NpemU9dHJhaW5fdmFsX3NwbGl0LCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgcmFuZG9tX3N0YXRlPXJhbmRvbV9zdGF0ZSkgICAgICAgIAoKICAgIGlmIG5hbWU6CiAgICAgICAgbmFtZSA9ICctJyArIG5hbWUKICAgIAogICAgIyBzYXZlIGhlYWRlcgogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICdoZWFkZXIucGtsJykKICAgIGR1bXAocmF3LmNvbHVtbnMudmFsdWVzLCBvcGVuKGYsICd3YicpKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ2hlYWRlcicsIHRhcmdldF9wYXRoPWYpCiAgICAKICAgICMgc2F2ZSBkYXRhIHNldHMKICAgIGYgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUgKyAneHRyYWluLnBxdCcpCiAgICB4dHJhaW4udG9fcGFycXVldChmKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ3h0cmFpbicsIHRhcmdldF9wYXRoPWYpCiAgICAKICAgIGYgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUgKyAneHZhbGlkLnBxdCcpCiAgICB4dmFsaWQudG9fcGFycXVldChmKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ3h2YWxpZCcsIHRhcmdldF9wYXRoPWYpCiAgICAKICAgIGYgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUgKyAneHRlc3QucHF0JykKICAgIHh0ZXN0LnRvX3BhcnF1ZXQoZikKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KCd4dGVzdCcsIHRhcmdldF9wYXRoPWYpCiAgICAKICAgIGYgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUgKyAneXRyYWluLnBxdCcpCiAgICBwZC5EYXRhRnJhbWUoeydsYWJlbHMnOiB5dHJhaW59KS50b19wYXJxdWV0KGYpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgneXRyYWluJywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICd5dmFsaWQucHF0JykKICAgIHBkLkRhdGFGcmFtZSh7J2xhYmVscyc6IHl2YWxpZH0pLnRvX3BhcnF1ZXQoZikKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KCd5dmFsaWQnLCB0YXJnZXRfcGF0aD1mKQogICAgCiAgICBmID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lICsgJ3l0ZXN0LnBxdCcpCiAgICBwZC5EYXRhRnJhbWUoeydsYWJlbHMnOiB5dGVzdH0pLnRvX3BhcnF1ZXQoZikKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KCd5dGVzdCcsIHRhcmdldF9wYXRoPWYp + base_image: yjbds/mlrun-ds:latest + commands: [] diff --git a/tests/train_classifier.ipynb b/tests/train_classifier.ipynb index d766912cd..332b0cbac 100644 --- a/tests/train_classifier.ipynb +++ b/tests/train_classifier.ipynb @@ -10,7 +10,9 @@ "* lightgbm\n", "* 10 mio samples / 20 features\n", "* code stored as yaml in github\n", - "* precomiled images using optimized for cpu python libraries" + "* precomiled images using optimized for cpu python libraries \n", + " * **[yjbds/mlrun-ds](https://hub.docker.com/repository/docker/yjbds/mlrun-ds)** a data science stack\n", + " * **[yjbds/mlrun-files](https://hub.docker.com/repository/docker/yjbds/mlrun-files)** a parquet/pandas stack" ] }, { @@ -46,7 +48,7 @@ "outputs": [], "source": [ "TARGET_CODE_BASE = '/User/repos/functions/' \n", - "N_SAMPLES = 10_000_000 # size of HIGGS data\n", + "N_SAMPLES = 100_000 # size of HIGGS data\n", "M_FEATURES = 20\n", "NEG_WEIGHT = 0.5\n", "TARGET_DATA_PATH = '/User/mlrun/sklearn-classifier'\n", @@ -79,25 +81,36 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 5, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "'ready'" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "# binarydatagen.deploy()" + "binarydatagen.deploy(skip_deployed=True)" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 5, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } @@ -116,18 +129,18 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-22 22:48:03,974 starting run create_binary_classification uid=ad9df1228d034fd5a11d732502f64aa2 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-22 22:48:04,072 Job is running in the background, pod: create-binary-classification-bnqlx\n", - "[mlrun] 2020-01-22 22:48:53,079 log artifact simdata at /User/mlrun/sklearn-classifier/simdata.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-23 11:46:49,385 starting run create_binary_classification uid=e1164e49ef22478791f5b23fea2de60b -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-23 11:46:49,486 Job is running in the background, pod: create-binary-classification-j6gng\n", + "[mlrun] 2020-01-23 11:47:00,255 log artifact simdata at /User/mlrun/sklearn-classifier/simdata.pqt, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-22 22:48:53,341 run executed, status=completed\n", + "[mlrun] 2020-01-23 11:47:00,268 run executed, status=completed\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", " result = infer_dtype(pandas_collection)\n", "final state: succeeded\n" @@ -302,26 +315,26 @@ " \n", " \n", " \n", - "
...f64aa2
\n", + "
...2de60b
\n", " 0\n", - " Jan 22 22:48:16\n", + " Jan 23 11:46:59\n", " completed\n", " binary\n", - "
host=create-binary-classification-bnqlx
kind=job
owner=admin
\n", + "
host=create-binary-classification-j6gng
kind=job
owner=admin
\n", " \n", - "
filename=simdata.pqt
key=simdata
m_features=20
n_samples=10000000
random_state=1
target_path=/User/mlrun/sklearn-classifier
weight=0.5
\n", + "
filename=simdata.pqt
key=simdata
m_features=20
n_samples=100000
random_state=1
target_path=/User/mlrun/sklearn-classifier
weight=0.5
\n", " \n", "
simdata
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -337,8 +350,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run ad9df1228d034fd5a11d732502f64aa2 , !mlrun logs ad9df1228d034fd5a11d732502f64aa2 \n", - "[mlrun] 2020-01-22 22:48:56,442 run executed, status=completed\n" + "!mlrun get run e1164e49ef22478791f5b23fea2de60b , !mlrun logs e1164e49ef22478791f5b23fea2de60b \n", + "[mlrun] 2020-01-23 11:47:08,728 run executed, status=completed\n" ] } ], @@ -356,7 +369,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -366,7 +379,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ @@ -384,7 +397,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 17, "metadata": {}, "outputs": [], "source": [ @@ -395,25 +408,170 @@ }, { "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], + "execution_count": 18, + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-23 11:49:20,163 starting remote build, image: .mlrun/func-default-sklearn-classifier-latest\n", + "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Built cross stage deps: map[] \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Unpacking rootfs as cmd RUN pip install mlrun requires it. \n", + "\u001b[36mINFO\u001b[0m[0044] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0065] RUN pip install mlrun \n", + "\u001b[36mINFO\u001b[0m[0065] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0065] args: [-c pip install mlrun] \n", + "Requirement already satisfied: mlrun in /opt/conda/lib/python3.7/site-packages (0.4.3)\n", + "Requirement already satisfied: pyyaml>=5.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (5.1.1)\n", + "Requirement already satisfied: Flask>=1.1.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.1.1)\n", + "Requirement already satisfied: GitPython>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.0.5)\n", + "Requirement already satisfied: requests>=2.20.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (2.20.1)\n", + "Requirement already satisfied: croniter==0.3.31 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.3.31)\n", + "Requirement already satisfied: sqlalchemy==1.3.11 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.3.11)\n", + "Requirement already satisfied: nuclio-sdk>=0.0.3 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.0.5)\n", + "Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (7.0)\n", + "Requirement already satisfied: boto3>=1.9 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.11.7)\n", + "Requirement already satisfied: gunicorn==19.9.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (19.9.0)\n", + "Requirement already satisfied: tabulate<=0.8.3,>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.3)\n", + "Requirement already satisfied: gevent==1.4.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.4.0)\n", + "Requirement already satisfied: kfp>=0.1.29 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.2.0)\n", + "Requirement already satisfied: nest-asyncio>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.2.2)\n", + "Requirement already satisfied: aiohttp>=3.5.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.6.2)\n", + "Requirement already satisfied: pandas>=0.23.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.25.1)\n", + "Requirement already satisfied: nuclio-jupyter>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.0)\n", + "Requirement already satisfied: itsdangerous>=0.24 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (1.1.0)\n", + "Requirement already satisfied: Werkzeug>=0.15 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (0.16.0)\n", + "Requirement already satisfied: Jinja2>=2.10.1 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (2.10.3)\n", + "Requirement already satisfied: gitdb2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from GitPython>=2.1.0->mlrun) (2.0.6)\n", + "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (3.0.4)\n", + "Requirement already satisfied: idna<2.8,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (2.6)\n", + "Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (1.24.1)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (2019.9.11)\n", + "Requirement already satisfied: python-dateutil in /opt/conda/lib/python3.7/site-packages (from croniter==0.3.31->mlrun) (2.8.0)\n", + "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.3.1)\n", + "Requirement already satisfied: botocore<1.15.0,>=1.14.7 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (1.14.7)\n", + "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.9.4)\n", + "Requirement already satisfied: greenlet>=0.4.14; platform_python_implementation == \"CPython\" in /opt/conda/lib/python3.7/site-packages (from gevent==1.4.0->mlrun) (0.4.15)\n", + "Requirement already satisfied: google-cloud-storage>=1.13.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.25.0)\n", + "Requirement already satisfied: cryptography>=2.4.2 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.7)\n", + "Requirement already satisfied: cloudpickle==1.1.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.1.1)\n", + "Requirement already satisfied: jsonschema>=3.0.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (3.2.0)\n", + "Requirement already satisfied: Deprecated in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.2.7)\n", + "Requirement already satisfied: requests-toolbelt>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.9.1)\n", + "Requirement already satisfied: kfp-server-api<=0.1.40,>=0.1.18 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.1.40)\n", + "Requirement already satisfied: argo-models==2.2.1a in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.2.1a0)\n", + "Requirement already satisfied: kubernetes<=10.0.0,>=8.0.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (10.0.0)\n", + "Requirement already satisfied: google-auth>=1.6.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.10.1)\n", + "Requirement already satisfied: PyJWT>=1.6.4 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.7.1)\n", + "Requirement already satisfied: six>=1.10 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.12.0)\n", + "Requirement already satisfied: multidict<5.0,>=4.5 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (4.7.4)\n", + "Requirement already satisfied: async-timeout<4.0,>=3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.1)\n", + "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (19.3.0)\n", + "Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (2019.1)\n", + "Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (1.17.4)\n", + "Requirement already satisfied: jupyterlab>=0.35.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (1.2.5)\n", + "Requirement already satisfied: tornado<6,>=5 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.1.1)\n", + "Requirement already satisfied: ipython>=7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (7.11.1)\n", + "Requirement already satisfied: nbconvert>=5.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.6.1)\n", + "Requirement already satisfied: notebook>=5.7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.3)\n", + "Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.7/site-packages (from Jinja2>=2.10.1->Flask>=1.1.1->mlrun) (1.1.1)\n", + "Requirement already satisfied: smmap2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from gitdb2>=2.0.0->GitPython>=2.1.0->mlrun) (2.0.5)\n", + "Requirement already satisfied: docutils<0.16,>=0.10 in /opt/conda/lib/python3.7/site-packages (from botocore<1.15.0,>=1.14.7->boto3>=1.9->mlrun) (0.15.2)\n", + "Requirement already satisfied: google-cloud-core<2.0dev,>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.2.0)\n", + "Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (0.5.0)\n", + "Requirement already satisfied: asn1crypto>=0.21.0 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (0.24.0)\n", + "Requirement already satisfied: cffi!=1.11.3,>=1.8 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (1.12.3)\n", + "Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (41.0.1.post20191122)\n", + "Requirement already satisfied: importlib-metadata; python_version < \"3.8\" in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (1.4.0)\n", + "Requirement already satisfied: pyrsistent>=0.14.0 in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (0.15.7)\n", + "Requirement already satisfied: wrapt<2,>=1.10 in /opt/conda/lib/python3.7/site-packages (from Deprecated->kfp>=0.1.29->mlrun) (1.11.2)\n", + "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (0.57.0)\n", + "Requirement already satisfied: requests-oauthlib in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (1.3.0)\n", + "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0.0)\n", + "Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.2.8)\n", + "Requirement already satisfied: rsa<4.1,>=3.1.4 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0)\n", + "Requirement already satisfied: jupyterlab-server~=1.0.0 in /opt/conda/lib/python3.7/site-packages (from jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (1.0.6)\n", + "Requirement already satisfied: decorator in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.4.1)\n", + "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.8.0)\n", + "Requirement already satisfied: pickleshare in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.5)\n", + "Requirement already satisfied: jedi>=0.10 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.15.2)\n", + "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (3.0.2)\n", + "Requirement already satisfied: backcall in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.0)\n", + "Requirement already satisfied: traitlets>=4.2 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.3.3)\n", + "Requirement already satisfied: pygments in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (2.5.2)\n", + "Requirement already satisfied: bleach in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (3.1.0)\n", + "Requirement already satisfied: mistune<2,>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.4)\n", + "Requirement already satisfied: jupyter-core in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (4.6.1)\n", + "Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: nbformat>=4.4 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (5.0.3)\n", + "Requirement already satisfied: testpath in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.4.4)\n", + "Requirement already satisfied: defusedxml in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: entrypoints>=0.2.2 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.3)\n", + "Requirement already satisfied: jupyter-client>=5.3.4 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.3.4)\n", + "Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.2.0)\n", + "Requirement already satisfied: prometheus-client in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.1)\n", + "Requirement already satisfied: Send2Trash in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (1.5.0)\n", + "Requirement already satisfied: terminado>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.8.3)\n", + "Requirement already satisfied: pyzmq>=17 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (18.1.1)\n", + "Requirement already satisfied: ipykernel in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.1.3)\n", + "Requirement already satisfied: google-api-core<2.0.0dev,>=1.16.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.16.0)\n", + "Requirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.4.2->kfp>=0.1.29->mlrun) (2.18)\n", + "Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (2.0.0)\n", + "Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from requests-oauthlib->kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (3.1.0)\n", + "Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /opt/conda/lib/python3.7/site-packages (from pyasn1-modules>=0.2.1->google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.4.8)\n", + "Requirement already satisfied: json5 in /opt/conda/lib/python3.7/site-packages (from jupyterlab-server~=1.0.0->jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.5)\n", + "Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.7/site-packages (from pexpect; sys_platform != \"win32\"->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: parso>=0.5.2 in /opt/conda/lib/python3.7/site-packages (from jedi>=0.10->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.5.2)\n", + "Requirement already satisfied: wcwidth in /opt/conda/lib/python3.7/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.8)\n", + "Requirement already satisfied: webencodings in /opt/conda/lib/python3.7/site-packages (from bleach->nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.5.1)\n", + "Requirement already satisfied: protobuf>=3.4.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (3.11.2)\n", + "Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.51.0)\n", + "Requirement already satisfied: more-itertools in /opt/conda/lib/python3.7/site-packages (from zipp>=0.5->importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (8.1.0)\n", + "\u001b[36mINFO\u001b[0m[0067] Taking snapshot of full filesystem... \n" + ] + }, + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "# trainfn.deploy()" + "trainfn.deploy()" ] }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 11, + "execution_count": 19, "metadata": {}, "output_type": "execute_result" } @@ -432,24 +590,24 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-22 22:49:00,573 starting run train uid=902dec5bdd8a4d4baeb9333ac6d5e15e -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-22 22:49:00,663 Job is running in the background, pod: train-99tsp\n", + "[mlrun] 2020-01-23 11:50:58,444 starting run train uid=d7118c8161b9487ea79b136cd2d4a0cc -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-23 11:50:58,533 Job is running in the background, pod: train-s9w4j\n", "[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the \"boost_from_average\" parameter in \"binary\" objective is true.\n", "This may cause significantly different results comparing to the previous versions of LightGBM.\n", "Try to set boost_from_average=false, if your old models produce bad results\n", "[LightGBM] [Warning] Cannot change bin_construct_sample_cnt after constructed Dataset handle.\n", - "[mlrun] 2020-01-22 22:50:23,162 log artifact model at model, size: None, db: Y\n", - "[mlrun] 2020-01-22 22:50:23,333 log artifact xtest at xtest.pkl, size: None, db: Y\n", - "[mlrun] 2020-01-22 22:50:23,454 log artifact ytest at ytest.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-23 11:51:12,955 log artifact model at model, size: None, db: Y\n", + "[mlrun] 2020-01-23 11:51:12,974 log artifact xtest at xtest.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-23 11:51:12,998 log artifact ytest at ytest.pkl, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-22 22:50:23,466 run executed, status=completed\n", + "[mlrun] 2020-01-23 11:51:13,022 run executed, status=completed\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:708: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", " labels = getattr(columns, 'labels', None) or [\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:735: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead\n", @@ -628,26 +786,26 @@ " \n", " \n", " \n", - "
...d5e15e
\n", + "
...d4a0cc
\n", " 0\n", - " Jan 22 22:49:07\n", + " Jan 23 11:51:07\n", " completed\n", " sklearn-classifier\n", - "
host=train-99tsp
kind=job
owner=admin
\n", + "
host=train-s9w4j
kind=job
owner=admin
\n", " \n", "
SKClassifier=lightgbm.sklearn.LGBMClassifier
callbacks=[]
key=model
name=model
random_state=1
src_file=/User/mlrun/sklearn-classifier/simdata.pqt
verbose=False
\n", - "
train_accuracy=0.9671342173532174
\n", + "
train_accuracy=0.9546808100860753
\n", "
model
xtest
ytest
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -663,8 +821,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 902dec5bdd8a4d4baeb9333ac6d5e15e , !mlrun logs 902dec5bdd8a4d4baeb9333ac6d5e15e \n", - "[mlrun] 2020-01-22 22:50:29,828 run executed, status=completed\n" + "!mlrun get run d7118c8161b9487ea79b136cd2d4a0cc , !mlrun logs d7118c8161b9487ea79b136cd2d4a0cc \n", + "[mlrun] 2020-01-23 11:51:17,725 run executed, status=completed\n" ] } ], @@ -674,19 +832,19 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'train_accuracy': 0.9671342173532174,\n", + "{'train_accuracy': 0.9546808100860753,\n", " 'model': 'model',\n", " 'xtest': 'xtest.pkl',\n", " 'ytest': 'ytest.pkl'}" ] }, - "execution_count": 13, + "execution_count": 21, "metadata": {}, "output_type": "execute_result" } @@ -695,6 +853,317 @@ "tsk2.outputs" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "_____\n", + "## train another classifier" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "____" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "task3 = mlrun.NewTask()\n", + "task3.with_params(\n", + " src_file=tsk1.output(KEY),\n", + " SKClassifier='xgboost.XGBClassifier',\n", + " name='xgb-classifier.pkl',\n", + " key='xgb-classifier',\n", + " verbose=VERBOSE,\n", + " random_state=RNG,\n", + " callbacks = [])" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-23 11:52:46,121 starting run train uid=3539274893904935adea979b410bf135 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-23 11:52:46,218 Job is running in the background, pod: train-qwzg9\n", + "[mlrun] 2020-01-23 11:52:56,785 Traceback (most recent call last):\n", + " File \"/opt/conda/lib/python3.7/site-packages/mlrun/runtimes/local.py\", line 174, in exec_from_params\n", + " val = handler(*args_list)\n", + " File \"main.py\", line 91, in train\n", + " verbose=verbose)\n", + "TypeError: fit() got an unexpected keyword argument 'eval_names'\n", + "\n", + "\n", + "[mlrun] 2020-01-23 11:52:56,796 exec error - fit() got an unexpected keyword argument 'eval_names'\n", + "[mlrun] 2020-01-23 11:52:56,830 run executed, status=error\n", + "runtime error: fit() got an unexpected keyword argument 'eval_names'\n", + "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:708: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", + " labels = getattr(columns, 'labels', None) or [\n", + "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:735: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead\n", + " return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)\n", + "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:752: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", + " labels, = index.labels\n", + "fit() got an unexpected keyword argument 'eval_names'\n", + "final state: failed\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...0bf135
0Jan 23 11:52:52
error
sklearn-classifier
host=train-qwzg9
kind=job
owner=admin
SKClassifier=xgboost.XGBClassifier
callbacks=[]
key=xgb-classifier
name=xgb-classifier.pkl
random_state=1
src_file=/User/mlrun/sklearn-classifier/simdata.pqt
verbose=False
\n", + "
\n", + "
\n", + "
\n", + " Title\n", + " ×\n", + "
\n", + " \n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "to track results use .show() or .logs() or in CLI: \n", + "!mlrun get run 3539274893904935adea979b410bf135 , !mlrun logs 3539274893904935adea979b410bf135 \n", + "[mlrun] 2020-01-23 11:53:05,425 run executed, status=error\n", + "runtime error: fit() got an unexpected keyword argument 'eval_names'\n" + ] + }, + { + "ename": "RunError", + "evalue": "fit() got an unexpected keyword argument 'eval_names'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mRunError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mtsk3\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtrainfn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask3\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhandler\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'train'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, runspec, handler, name, project, params, inputs, out_path, workdir, watch, schedule)\u001b[0m\n\u001b[1;32m 266\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_post_run\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 267\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 268\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 269\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 270\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_api_server\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkfp\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36m_wrap_result\u001b[0;34m(self, result, runspec, err)\u001b[0m\n\u001b[1;32m 334\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_child\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 335\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'runtime error: {}'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 336\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mRunError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 337\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrun\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 338\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mRunError\u001b[0m: fit() got an unexpected keyword argument 'eval_names'" + ] + } + ], + "source": [ + "tsk3 = trainfn.run(task3, handler='train')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tsk3.outputs" + ] + }, { "cell_type": "markdown", "metadata": {}, diff --git a/tests/train_valid_test_split.ipynb b/tests/train_valid_test_split.ipynb new file mode 100644 index 000000000..9753ded7b --- /dev/null +++ b/tests/train_valid_test_split.ipynb @@ -0,0 +1,731 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## split data into train, validation and test sets" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import mlrun\n", + "import os\n", + "import numpy as np\n", + "mlrun.mlconf.dbpath = 'http://mlrun-api:8080'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## parameters" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "CODE_BASE = '/User/repos/functions/datagen' \n", + "N_SAMPLES = 10_000_000 # size of HIGGS data\n", + "M_FEATURES = 20\n", + "NEG_WEIGHT = 0.5\n", + "RNG = 1\n", + "TARGET_DATA_PATH = '/User/mlrun/splitter'\n", + "SRC_FILE = 'simdata.pqt'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## generate some binary classification data" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-23 15:05:06,305 starting run create_binary_classification uid=971b0eb4b2b64d3a948cb29ad8735dd2 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-23 15:05:06,427 Job is running in the background, pod: create-binary-classification-dx457\n", + "[mlrun] 2020-01-23 15:05:53,240 log artifact simdata at /User/mlrun/splitter/simdata.pqt, size: None, db: Y\n", + "\n", + "[mlrun] 2020-01-23 15:05:53,450 run executed, status=completed\n", + "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", + " result = infer_dtype(pandas_collection)\n", + "final state: succeeded\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...735dd2
0Jan 23 15:05:18completedbinary
host=create-binary-classification-dx457
kind=job
owner=admin
filename=simdata.pqt
key=simdata
m_features=20
n_samples=10000000
random_state=1
target_path=/User/mlrun/splitter
weight=0.5
simdata
\n", + "
\n", + "
\n", + "
\n", + " Title\n", + " ×\n", + "
\n", + " \n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "to track results use .show() or .logs() or in CLI: \n", + "!mlrun get run 971b0eb4b2b64d3a948cb29ad8735dd2 , !mlrun logs 971b0eb4b2b64d3a948cb29ad8735dd2 \n", + "[mlrun] 2020-01-23 15:05:55,758 run executed, status=completed\n" + ] + } + ], + "source": [ + "binarydatagen = mlrun.import_function(\n", + " os.path.join(CODE_BASE, 'classification', 'binary.yaml')\n", + ").apply(mlrun.mount_v3io())\n", + "\n", + "binarydatagen.deploy(skip_deployed=True)\n", + "\n", + "task1 = mlrun.NewTask()\n", + "task1.with_params(\n", + " n_samples=N_SAMPLES,\n", + " m_features=M_FEATURES,\n", + " weight=NEG_WEIGHT,\n", + " target_path=TARGET_DATA_PATH,\n", + " filename='simdata.pqt',\n", + " key='simdata',\n", + " random_state=RNG)\n", + "\n", + "tsk1 = binarydatagen.run(task1, handler='create_binary_classification')" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'simdata': '/User/mlrun/splitter/simdata.pqt'}" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tsk1.outputs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## split the data" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "# splitfn = mlrun.code_to_function(\n", + "# kind='job', \n", + "# filename=os.path.join(CODE_BASE, 'splitters', 'train_valid_test.py'))\n", + "\n", + "# splitfn.export(os.path.join(CODE_BASE, 'splitters', 'train_valid_test.yaml'))" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "splitter = mlrun.import_function(\n", + " os.path.join(CODE_BASE, 'splitters', 'train_valid_test.yaml')\n", + ").apply(mlrun.mount_v3io())" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'ready'" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "splitter.deploy(skip_deployed=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-23 15:05:55,845 starting run train_valid_test_splitter uid=d32826574ee84beab365d3acc30e4b31 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-23 15:05:55,923 Job is running in the background, pod: train-valid-test-splitter-b4gxz\n", + "[mlrun] 2020-01-23 15:06:21,321 log artifact header at /User/mlrun/splitter/header.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-23 15:06:30,458 log artifact xtrain at /User/mlrun/splitter/xtrain.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-23 15:06:33,972 log artifact xvalid at /User/mlrun/splitter/xvalid.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-23 15:06:35,603 log artifact xtest at /User/mlrun/splitter/xtest.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-23 15:06:36,264 log artifact ytrain at /User/mlrun/splitter/ytrain.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-23 15:06:36,705 log artifact yvalid at /User/mlrun/splitter/yvalid.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-23 15:06:36,932 log artifact ytest at /User/mlrun/splitter/ytest.pqt, size: None, db: Y\n", + "\n", + "[mlrun] 2020-01-23 15:06:37,271 run executed, status=completed\n", + "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:708: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", + " labels = getattr(columns, 'labels', None) or [\n", + "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:735: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead\n", + " return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)\n", + "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:752: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", + " labels, = index.labels\n", + "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", + " result = infer_dtype(pandas_collection)\n", + "final state: succeeded\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...0e4b31
0Jan 23 15:06:05completedtrain-valid-test
host=train-valid-test-splitter-b4gxz
kind=job
owner=admin
random_state=1
src_file=/User/mlrun/splitter/simdata.pqt
target_path=/User/mlrun/splitter
header
xtrain
xvalid
xtest
ytrain
yvalid
ytest
\n", + "
\n", + "
\n", + "
\n", + " Title\n", + " ×\n", + "
\n", + " \n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "to track results use .show() or .logs() or in CLI: \n", + "!mlrun get run d32826574ee84beab365d3acc30e4b31 , !mlrun logs d32826574ee84beab365d3acc30e4b31 \n", + "[mlrun] 2020-01-23 15:06:45,221 run executed, status=completed\n" + ] + } + ], + "source": [ + "task2 = mlrun.NewTask()\n", + "task2.with_params(\n", + " src_file=tsk1.outputs['simdata'],\n", + " target_path=TARGET_DATA_PATH,\n", + " random_state=RNG)\n", + "\n", + "tsk2 = splitter.run(task2, handler='train_valid_test_splitter')" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'header': '/User/mlrun/splitter/header.pkl',\n", + " 'xtrain': '/User/mlrun/splitter/xtrain.pqt',\n", + " 'xvalid': '/User/mlrun/splitter/xvalid.pqt',\n", + " 'xtest': '/User/mlrun/splitter/xtest.pqt',\n", + " 'ytrain': '/User/mlrun/splitter/ytrain.pqt',\n", + " 'yvalid': '/User/mlrun/splitter/yvalid.pqt',\n", + " 'ytest': '/User/mlrun/splitter/ytest.pqt'}" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tsk2.outputs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## tests" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "# rounding error of one sample\n", + "ERROR = -1\n", + "xtrain_shape = pd.read_parquet(tsk2.outputs['xtrain'], engine='pyarrow').shape\n", + "ytrain_shape = pd.read_parquet(tsk2.outputs['ytrain'], engine='pyarrow').shape\n", + "\n", + "assert (int(.75*(N_SAMPLES*(1-.1)))+ERROR, M_FEATURES) == xtrain_shape, \"xtrain doesn't have the expected shape\"\n", + "assert ytrain_shape[0] == xtrain_shape[0], \"ytrain and xtrain have different shapes\"\n", + "assert ytrain_shape[1] == 1, \"ytrain (labels) has more than 1 column\"" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "xtest_shape = pd.read_parquet(tsk2.outputs['xtest'], engine='pyarrow').shape\n", + "ytest_shape = pd.read_parquet(tsk2.outputs['ytest'], engine='pyarrow').shape\n", + "assert (int(N_SAMPLES*.1), M_FEATURES) == xtest_shape, \"xtest doesn't have the expected shape\"\n", + "assert ytest_shape[0] == xtest_shape[0], \"ytest and xtest have different shapes\"\n", + "assert ytest_shape[1] == 1, \"ytest (test labels) has more than 1 column\"" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "from cloudpickle import load" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "assert len(load(open(tsk2.outputs['header'], 'rb'))) == M_FEATURES" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 767dea3f6c764cdd3748fef1ea73a4c30982b1e7 Mon Sep 17 00:00:00 2001 From: yasha Date: Sun, 26 Jan 2020 00:46:34 +0000 Subject: [PATCH 16/32] acquire-train-test completed --- datagen/README.md | 9 + evaluation/test-classifier.py | 178 +++++++++++++ evaluation/test-classifier.yaml | 8 + fileutils/README.md | 0 serving/README.md | 4 + tests/test_model.ipynb | 401 +++++++++++++++++++++++++++++ tests/train_valid_test_split.ipynb | 248 ++++++++++++++---- train/README.md | 5 + train/sklearn-classifier.py | 52 +++- 9 files changed, 843 insertions(+), 62 deletions(-) create mode 100644 datagen/README.md create mode 100644 evaluation/test-classifier.py create mode 100644 evaluation/test-classifier.yaml create mode 100644 fileutils/README.md create mode 100644 serving/README.md create mode 100644 tests/test_model.ipynb create mode 100644 train/README.md diff --git a/datagen/README.md b/datagen/README.md new file mode 100644 index 000000000..a0bafad15 --- /dev/null +++ b/datagen/README.md @@ -0,0 +1,9 @@ +# data generators + +## classification + +**`binary`** generate binary classification data + +## splitters + +**`train_valid_test`** given a raw dataset, create 3 splits and save the results \ No newline at end of file diff --git a/evaluation/test-classifier.py b/evaluation/test-classifier.py new file mode 100644 index 000000000..b0e5cc5d1 --- /dev/null +++ b/evaluation/test-classifier.py @@ -0,0 +1,178 @@ +import os +import importlib +from cloudpickle import load + +import numpy as np +import pandas as pd +import lightgbm as lgb + +from sklearn.metrics import (roc_curve, confusion_matrix) +from sklearn.model_selection import train_test_split + +import matplotlib.pyplot as plt +from matplotlib.figure import Figure +import seaborn as sns + +from typing import Optional, Union, List + +from mlrun.execution import MLClientCtx +from mlrun.datastore import DataItem +from mlrun.artifacts import TableArtifact, PlotArtifact + + +def test_model( + context: Optional[MLClientCtx], + model: Union[DataItem, str], + xtest, + ytest, + target_path: str = '', + name: str = '', + key: str = '', + random_state = 1 +) -> None: + """Test a classifier model + + Using held-out test features, calls `model.predict(xtest)` and evaluates the accuracy of the + estimated model. + + Can be part of a kubeflow pipeline as a test step or called + + :param context: the function context + :param model: estimated model file name as artifact store item + or pickle file name + :param xtest: test features file name as artifact store item + or pickle file name + :param header: (Optional) use if xtest does not have a header + :param ytest: test labels file name as artifact store + item or pickle file name + :param target_path: folder location of files + :param name: destination name for test results + :param key: key for model artifact + """ + # load model and data + if isinstance(model, DataItem): + clf = load(open(str(model), 'rb')) + else: + clf = load(open(model, 'rb')) + + if isinstance(xtest, DataItem): + xtest = pd.read_parquet(str(xtest)) + ytest = pd.read_parquet(str(ytest)) + else: + xtest = pd.read_parquet(xtest) + ytest = pd.read_parquet(ytest) + + if callable(getattr(clf, 'predict_proba')): + ypred_probs = clf.predict_proba(xtest)[:, 1] + ypred = np.where(ypred_probs >= 0.5, 1, 0) + plot_roc(context, ytest, ypred_probs, target_path) + else: + ypred = clf.predict(xtest) + ypred_probs = None + + plot_confusion_matrix(context, ytest, ypred, target_path) + + if hasattr(clf, 'feature_importances_'): + plot_importance(context, clf, xtest.columns.values, target_path) + +def _gcf_clear(plt): + plt.cla() + plt.clf() + plt.close() + +def plot_roc( + context: MLClientCtx, + y_labels, + y_probs, + target_path: str = '', + name='roc.png', + key='roc', + fmt='png' +): + """Plot an ROC curve from test data saved in an artifact store. + + :param context: function context + :param y_labels: test data labels + :param y_probs: test data + """ + fpr_xg, tpr_xg, _ = roc_curve(y_labels, y_probs) + plt.plot([0, 1], [0, 1], "k--") + plt.plot(fpr_xg, tpr_xg, label="roc") + plt.xlabel("false positive rate") + plt.ylabel("true positive rate") + plt.title("roc curve") + plt.legend(loc="best") + fig = plt.gcf() + + plotpath = os.path.join(target_path, name) + fig.savefig(plotpath, format=fmt) + context.log_artifact(PlotArtifact(key, body=fig)) + + _gcf_clear(plt) + +def plot_confusion_matrix( + context: MLClientCtx, + labels, + predictions, + target_path: str = '', + name: str ="confusion.png", + key: str ='confusion_matrix', + fmt: str = 'png' +): + """Create a confusion matrix. + Plot and save a confusion matrix using test data from a + pipeline step. + + :param context: function context + :param labels: test data labels + :param predictions: test data predictions + """ + cm = confusion_matrix(labels, + predictions, + sample_weight=None, + normalize='all') + sns.heatmap(cm, annot=True, cmap="Blues") + plotpath = os.path.join(target_path, name) + fig = plt.gcf() + fig.savefig(plotpath, format=fmt) + context.log_artifact(PlotArtifact(key, body=fig)) + + _gcf_clear(plt) + +def plot_importance( + context, + model, + header: List = [], + target_path: str = '', + name: str = 'feature-importances.png', + key: str = 'feature-importances', + fmt = 'png' +): + """Display estimated feature importances. + + :param context: function context + :param model: fitted lightgbm model + :param header: list of feature names + """ + # create a feature importance table with desired labels + zipped = zip(model.feature_importances_, header) + + feature_imp = pd.DataFrame(sorted(zipped), columns=['freq','feature'] + ).sort_values(by="freq", ascending=False) + + plt.figure(figsize=(20, 10)) + sns.barplot(x="freq", y="feature", data=feature_imp) + plt.title('LightGBM Features') + plt.tight_layout() + fig = plt.gcf() + plotpath = os.path.join(target_path, name) + fig.savefig(plotpath, format='png') + context.log_artifact(PlotArtifact(key + '-plot', body=fig)) + + # feature importances are also saved as a table: + tablepath = os.path.join(target_path, key + '-table.csv') + feature_imp.to_csv(tablepath) + context.log_artifact(TableArtifact(key + '-table', target_path=tablepath)) + + # to ensure we don't overwrite this figure when creating the next: + _gcf_clear(plt) diff --git a/evaluation/test-classifier.yaml b/evaluation/test-classifier.yaml new file mode 100644 index 000000000..52d0ac16e --- /dev/null +++ b/evaluation/test-classifier.yaml @@ -0,0 +1,8 @@ +kind: job +metadata: + name: test-classifier +spec: + build: + functionSourceCode: aW1wb3J0IG9zCmltcG9ydCBpbXBvcnRsaWIKZnJvbSBjbG91ZHBpY2tsZSBpbXBvcnQgbG9hZAoKaW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IGxpZ2h0Z2JtIGFzIGxnYgoKZnJvbSBza2xlYXJuLm1ldHJpY3MgaW1wb3J0IChyb2NfY3VydmUsIGNvbmZ1c2lvbl9tYXRyaXgpCmZyb20gc2tsZWFybi5tb2RlbF9zZWxlY3Rpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXQKCmltcG9ydCBtYXRwbG90bGliLnB5cGxvdCBhcyBwbHQKZnJvbSBtYXRwbG90bGliLmZpZ3VyZSBpbXBvcnQgRmlndXJlCmltcG9ydCBzZWFib3JuIGFzIHNucwoKZnJvbSB0eXBpbmcgaW1wb3J0IE9wdGlvbmFsLCBVbmlvbiwgTGlzdAoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gbWxydW4uZGF0YXN0b3JlIGltcG9ydCBEYXRhSXRlbQpmcm9tIG1scnVuLmFydGlmYWN0cyBpbXBvcnQgVGFibGVBcnRpZmFjdCwgUGxvdEFydGlmYWN0CgoKZGVmIHRlc3RfbW9kZWwoCiAgICBjb250ZXh0OiBPcHRpb25hbFtNTENsaWVudEN0eF0sCiAgICBtb2RlbDogVW5pb25bRGF0YUl0ZW0sIHN0cl0sCiAgICB4dGVzdCwgCiAgICB5dGVzdCwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAnJywKICAgIG5hbWU6IHN0ciA9ICcnLAogICAga2V5OiBzdHIgPSAnJywKICAgIHJhbmRvbV9zdGF0ZSA9IDEKKSAtPiBOb25lOgogICAgIiIiVGVzdCBhIGNsYXNzaWZpZXIgbW9kZWwKICAgIAogICAgVXNpbmcgaGVsZC1vdXQgdGVzdCBmZWF0dXJlcywgY2FsbHMgYG1vZGVsLnByZWRpY3QoeHRlc3QpYCBhbmQgZXZhbHVhdGVzIHRoZSBhY2N1cmFjeSBvZiB0aGUgCiAgICBlc3RpbWF0ZWQgbW9kZWwuCiAgICAKICAgIENhbiBiZSBwYXJ0IG9mIGEga3ViZWZsb3cgcGlwZWxpbmUgYXMgYSB0ZXN0IHN0ZXAgb3IgY2FsbGVkCiAgICAKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgICAgIHRoZSBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gbW9kZWw6ICAgICAgICAgICBlc3RpbWF0ZWQgbW9kZWwgZmlsZSBuYW1lIGFzIGFydGlmYWN0IHN0b3JlIGl0ZW0KICAgICAgICAgICAgICAgICAgICAgICAgICAgIG9yIHBpY2tsZSBmaWxlIG5hbWUKICAgIDpwYXJhbSB4dGVzdDogICAgICAgICAgIHRlc3QgZmVhdHVyZXMgZmlsZSBuYW1lIGFzIGFydGlmYWN0IHN0b3JlIGl0ZW0KICAgICAgICAgICAgICAgICAgICAgICAgICAgIG9yIHBpY2tsZSBmaWxlIG5hbWUKICAgIDpwYXJhbSBoZWFkZXI6ICAgICAgICAgIChPcHRpb25hbCkgdXNlIGlmIHh0ZXN0IGRvZXMgbm90IGhhdmUgYSBoZWFkZXIKICAgIDpwYXJhbSB5dGVzdDogICAgICAgICAgIHRlc3QgbGFiZWxzIGZpbGUgbmFtZSBhcyBhcnRpZmFjdCBzdG9yZSAKICAgICAgICAgICAgICAgICAgICAgICAgICAgIGl0ZW0gb3IgcGlja2xlIGZpbGUgbmFtZQogICAgOnBhcmFtIHRhcmdldF9wYXRoOiAgICAgZm9sZGVyIGxvY2F0aW9uIG9mIGZpbGVzCiAgICA6cGFyYW0gbmFtZTogICAgICAgICAgICBkZXN0aW5hdGlvbiBuYW1lIGZvciB0ZXN0IHJlc3VsdHMKICAgIDpwYXJhbSBrZXk6ICAgICAgICAgICAgIGtleSBmb3IgbW9kZWwgYXJ0aWZhY3QKICAgICIiIgogICAgIyBsb2FkIG1vZGVsIGFuZCBkYXRhCiAgICBpZiBpc2luc3RhbmNlKG1vZGVsLCBEYXRhSXRlbSk6CiAgICAgICAgY2xmID0gbG9hZChvcGVuKHN0cihtb2RlbCksICdyYicpKQogICAgZWxzZToKICAgICAgICBjbGYgPSBsb2FkKG9wZW4obW9kZWwsICdyYicpKQoKICAgIGlmIGlzaW5zdGFuY2UoeHRlc3QsIERhdGFJdGVtKToKICAgICAgICB4dGVzdCA9IHBkLnJlYWRfcGFycXVldChzdHIoeHRlc3QpKQogICAgICAgIHl0ZXN0ID0gcGQucmVhZF9wYXJxdWV0KHN0cih5dGVzdCkpCiAgICBlbHNlOgogICAgICAgIHh0ZXN0ID0gcGQucmVhZF9wYXJxdWV0KHh0ZXN0KQogICAgICAgIHl0ZXN0ID0gcGQucmVhZF9wYXJxdWV0KHl0ZXN0KQogICAgCiAgICBpZiBjYWxsYWJsZShnZXRhdHRyKGNsZiwgJ3ByZWRpY3RfcHJvYmEnKSk6CiAgICAgICAgeXByZWRfcHJvYnMgPSBjbGYucHJlZGljdF9wcm9iYSh4dGVzdClbOiwgMV0KICAgICAgICB5cHJlZCA9IG5wLndoZXJlKHlwcmVkX3Byb2JzID49IDAuNSwgMSwgMCkKICAgICAgICBwbG90X3JvYyhjb250ZXh0LCB5dGVzdCwgeXByZWRfcHJvYnMsIHRhcmdldF9wYXRoKQogICAgZWxzZToKICAgICAgICB5cHJlZCA9IGNsZi5wcmVkaWN0KHh0ZXN0KQogICAgICAgIHlwcmVkX3Byb2JzID0gTm9uZQogICAgCiAgICBwbG90X2NvbmZ1c2lvbl9tYXRyaXgoY29udGV4dCwgeXRlc3QsIHlwcmVkLCB0YXJnZXRfcGF0aCkKCiAgICBpZiBoYXNhdHRyKGNsZiwgJ2ZlYXR1cmVfaW1wb3J0YW5jZXNfJyk6CiAgICAgICAgcGxvdF9pbXBvcnRhbmNlKGNvbnRleHQsIGNsZiwgeHRlc3QuY29sdW1ucy52YWx1ZXMsIHRhcmdldF9wYXRoKQoKZGVmIF9nY2ZfY2xlYXIocGx0KToKICAgIHBsdC5jbGEoKQogICAgcGx0LmNsZigpCiAgICBwbHQuY2xvc2UoKSAgICAgICAgCgpkZWYgcGxvdF9yb2MoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwgCiAgICB5X2xhYmVscywKICAgIHlfcHJvYnMsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gJycsCiAgICBuYW1lPSdyb2MucG5nJywKICAgIGtleT0ncm9jJywKICAgIGZtdD0ncG5nJwopOgogICAgIiIiUGxvdCBhbiBST0MgY3VydmUgZnJvbSB0ZXN0IGRhdGEgc2F2ZWQgaW4gYW4gYXJ0aWZhY3Qgc3RvcmUuCiAgICAKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSB5X2xhYmVsczogICAgICAgIHRlc3QgZGF0YSBsYWJlbHMKICAgIDpwYXJhbSB5X3Byb2JzOiAgICAgICAgIHRlc3QgZGF0YSAKICAgICIiIgogICAgZnByX3hnLCB0cHJfeGcsIF8gPSByb2NfY3VydmUoeV9sYWJlbHMsIHlfcHJvYnMpCiAgICBwbHQucGxvdChbMCwgMV0sIFswLCAxXSwgImstLSIpCiAgICBwbHQucGxvdChmcHJfeGcsIHRwcl94ZywgbGFiZWw9InJvYyIpCiAgICBwbHQueGxhYmVsKCJmYWxzZSBwb3NpdGl2ZSByYXRlIikKICAgIHBsdC55bGFiZWwoInRydWUgcG9zaXRpdmUgcmF0ZSIpCiAgICBwbHQudGl0bGUoInJvYyBjdXJ2ZSIpCiAgICBwbHQubGVnZW5kKGxvYz0iYmVzdCIpCiAgICBmaWcgPSBwbHQuZ2NmKCkKCiAgICBwbG90cGF0aCA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSkKICAgIGZpZy5zYXZlZmlnKHBsb3RwYXRoLCBmb3JtYXQ9Zm10KQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoUGxvdEFydGlmYWN0KGtleSwgYm9keT1maWcpKQoKICAgIF9nY2ZfY2xlYXIocGx0KQoKZGVmIHBsb3RfY29uZnVzaW9uX21hdHJpeCgKICAgIGNvbnRleHQ6IE1MQ2xpZW50Q3R4LCAKICAgIGxhYmVscywgCiAgICBwcmVkaWN0aW9ucywKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAnJywgCiAgICBuYW1lOiBzdHIgPSJjb25mdXNpb24ucG5nIiwgCiAgICBrZXk6IHN0ciA9J2NvbmZ1c2lvbl9tYXRyaXgnLAogICAgZm10OiBzdHIgPSAncG5nJwopOgogICAgIiIiQ3JlYXRlIGEgY29uZnVzaW9uIG1hdHJpeC4KICAgIFBsb3QgYW5kIHNhdmUgYSBjb25mdXNpb24gbWF0cml4IHVzaW5nIHRlc3QgZGF0YSBmcm9tIGEKICAgIHBpcGVsaW5lIHN0ZXAuCgogICAgOnBhcmFtIGNvbnRleHQ6ICAgICAgICAgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIGxhYmVsczogICAgICAgICAgdGVzdCBkYXRhIGxhYmVscwogICAgOnBhcmFtIHByZWRpY3Rpb25zOiAgICAgdGVzdCBkYXRhIHByZWRpY3Rpb25zCiAgICAiIiIKICAgIGNtID0gY29uZnVzaW9uX21hdHJpeChsYWJlbHMsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBwcmVkaWN0aW9ucywKICAgICAgICAgICAgICAgICAgICAgICAgICAgIHNhbXBsZV93ZWlnaHQ9Tm9uZSwKICAgICAgICAgICAgICAgICAgICAgICAgICAgIG5vcm1hbGl6ZT0nYWxsJykKICAgIHNucy5oZWF0bWFwKGNtLCBhbm5vdD1UcnVlLCBjbWFwPSJCbHVlcyIpCiAgICBwbG90cGF0aCA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSkKICAgIGZpZyA9IHBsdC5nY2YoKQogICAgZmlnLnNhdmVmaWcocGxvdHBhdGgsIGZvcm1hdD1mbXQpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChQbG90QXJ0aWZhY3Qoa2V5LCBib2R5PWZpZykpCgogICAgX2djZl9jbGVhcihwbHQpCgpkZWYgcGxvdF9pbXBvcnRhbmNlKAogICAgY29udGV4dCwKICAgIG1vZGVsLAogICAgaGVhZGVyOiBMaXN0ID0gW10sCiAgICB0YXJnZXRfcGF0aDogc3RyID0gJycsCiAgICBuYW1lOiBzdHIgPSAnZmVhdHVyZS1pbXBvcnRhbmNlcy5wbmcnLAogICAga2V5OiBzdHIgPSAnZmVhdHVyZS1pbXBvcnRhbmNlcycsCiAgICBmbXQgPSAncG5nJwopOgogICAgIiIiRGlzcGxheSBlc3RpbWF0ZWQgZmVhdHVyZSBpbXBvcnRhbmNlcy4KCiAgICA6cGFyYW0gY29udGV4dDogICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBtb2RlbDogICAgICAgZml0dGVkIGxpZ2h0Z2JtIG1vZGVsCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgIGxpc3Qgb2YgZmVhdHVyZSBuYW1lcwogICAgIiIiCiAgICAjIGNyZWF0ZSBhIGZlYXR1cmUgaW1wb3J0YW5jZSB0YWJsZSB3aXRoIGRlc2lyZWQgbGFiZWxzCiAgICB6aXBwZWQgPSB6aXAobW9kZWwuZmVhdHVyZV9pbXBvcnRhbmNlc18sIGhlYWRlcikKCiAgICBmZWF0dXJlX2ltcCA9IHBkLkRhdGFGcmFtZShzb3J0ZWQoemlwcGVkKSwgY29sdW1ucz1bJ2ZyZXEnLCdmZWF0dXJlJ10KICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICApLnNvcnRfdmFsdWVzKGJ5PSJmcmVxIiwgYXNjZW5kaW5nPUZhbHNlKQoKICAgIHBsdC5maWd1cmUoZmlnc2l6ZT0oMjAsIDEwKSkKICAgIHNucy5iYXJwbG90KHg9ImZyZXEiLCB5PSJmZWF0dXJlIiwgZGF0YT1mZWF0dXJlX2ltcCkKICAgIHBsdC50aXRsZSgnTGlnaHRHQk0gRmVhdHVyZXMnKQogICAgcGx0LnRpZ2h0X2xheW91dCgpCiAgICBmaWcgPSBwbHQuZ2NmKCkKICAgIHBsb3RwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgZmlnLnNhdmVmaWcocGxvdHBhdGgsIGZvcm1hdD0ncG5nJykKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KFBsb3RBcnRpZmFjdChrZXkgKyAnLXBsb3QnLCBib2R5PWZpZykpCgogICAgIyBmZWF0dXJlIGltcG9ydGFuY2VzIGFyZSBhbHNvIHNhdmVkIGFzIGEgdGFibGU6CiAgICB0YWJsZXBhdGggPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIGtleSArICctdGFibGUuY3N2JykKICAgIGZlYXR1cmVfaW1wLnRvX2Nzdih0YWJsZXBhdGgpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChUYWJsZUFydGlmYWN0KGtleSArICctdGFibGUnLCB0YXJnZXRfcGF0aD10YWJsZXBhdGgpKQoKICAgICMgdG8gZW5zdXJlIHdlIGRvbid0IG92ZXJ3cml0ZSB0aGlzIGZpZ3VyZSB3aGVuIGNyZWF0aW5nIHRoZSBuZXh0OgogICAgX2djZl9jbGVhcihwbHQpCg== + base_image: yjbds/mlrun-ds:latest + commands: [] diff --git a/fileutils/README.md b/fileutils/README.md new file mode 100644 index 000000000..e69de29bb diff --git a/serving/README.md b/serving/README.md new file mode 100644 index 000000000..d6f2e8a36 --- /dev/null +++ b/serving/README.md @@ -0,0 +1,4 @@ +# serving models + +**`xgboost/xgb-serving.ipynb`** deploy an xgboost server model
+**`classifier_server.ipynb`** deploy any classifier model that has been pickled (cloudpickle) \ No newline at end of file diff --git a/tests/test_model.ipynb b/tests/test_model.ipynb new file mode 100644 index 000000000..233b043c9 --- /dev/null +++ b/tests/test_model.ipynb @@ -0,0 +1,401 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# test a model\n", + "\n", + "Test youy rmodel right after training in a kubeflow pipeline, or run this function independently. In addition, the plotting components in **[test_classifier.py]()** can also be run independently." + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "metadata": {}, + "outputs": [], + "source": [ + "import mlrun\n", + "import os\n", + "import numpy as np\n", + "mlrun.mlconf.dbpath = 'http://mlrun-api:8080'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## parameters" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": {}, + "outputs": [], + "source": [ + "CODE_BASE = '/User/repos/functions'\n", + "\n", + "MODEL_FILE = '/User/mlrun/models/lgb-classifier.pkl'\n", + "\n", + "TARGET_DATA_PATH = '/User/mlrun/splitter'\n", + "XTEST_FILE = '/User/mlrun/splitter/xtest.pqt'\n", + "YTEST_FILE = '/User/mlrun/splitter/ytest.pqt'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## run tests" + ] + }, + { + "cell_type": "code", + "execution_count": 122, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-26 00:40:51,865 function spec saved to path: /User/repos/functions/evaluation/test-classifier.yaml\n" + ] + } + ], + "source": [ + "yaml_name = os.path.join(CODE_BASE, 'evaluation', 'test-classifier.yaml')\n", + "if not os.path.isfile(yaml_name):\n", + " testfn = mlrun.code_to_function(\n", + " kind='job', \n", + " image='yjbds/mlrun-ds:latest',\n", + " filename=os.path.join(CODE_BASE, 'evaluation', 'test-classifier.py'))\n", + "\n", + " testfn.export(os.path.join(CODE_BASE, 'evaluation', 'test-classifier.yaml'))" + ] + }, + { + "cell_type": "code", + "execution_count": 123, + "metadata": {}, + "outputs": [], + "source": [ + "testfn = mlrun.import_function(\n", + " os.path.join(CODE_BASE, 'evaluation', 'test-classifier.yaml')\n", + ").apply(mlrun.mount_v3io())" + ] + }, + { + "cell_type": "code", + "execution_count": 124, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'ready'" + ] + }, + "execution_count": 124, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "testfn.deploy(skip_deployed=True, with_mlrun=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 125, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 125, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "task = mlrun.NewTask()\n", + "task.with_params(\n", + " model=MODEL_FILE,\n", + " xtest=XTEST_FILE,\n", + " ytest=YTEST_FILE,\n", + " target_path=TARGET_DATA_PATH)" + ] + }, + { + "cell_type": "code", + "execution_count": 126, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-26 00:41:47,446 starting run test_model uid=ea859673d08b4eccbd7746b9d36fb8e8 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 00:41:47,550 Job is running in the background, pod: test-model-shb4b\n", + "[mlrun] 2020-01-26 00:42:01,686 log artifact roc.html at roc.html, size: 40483, db: Y\n", + "[mlrun] 2020-01-26 00:42:02,895 log artifact confusion_matrix.html at confusion_matrix.html, size: 15292, db: Y\n", + "[mlrun] 2020-01-26 00:42:03,498 log artifact feature-importances-plot.html at feature-importances-plot.html, size: 67516, db: Y\n", + "[mlrun] 2020-01-26 00:42:03,525 log artifact feature-importances-table at /User/mlrun/splitter/feature-importances-table.csv, size: None, db: Y\n", + "\n", + "[mlrun] 2020-01-26 00:42:03,596 run executed, status=completed\n", + "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:708: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", + " labels = getattr(columns, 'labels', None) or [\n", + "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:735: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead\n", + " return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)\n", + "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:752: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", + " labels, = index.labels\n", + "final state: succeeded\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...6fb8e8
0Jan 26 00:41:54completedtest-classifier
host=test-model-shb4b
kind=job
owner=admin
model=/User/mlrun/models/lgb-classifier.pkl
target_path=/User/mlrun/splitter
xtest=/User/mlrun/splitter/xtest.pqt
ytest=/User/mlrun/splitter/ytest.pqt
roc.html
confusion_matrix.html
feature-importances-plot.html
feature-importances-table
\n", + "
\n", + "
\n", + "
\n", + " Title\n", + " ×\n", + "
\n", + " \n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "to track results use .show() or .logs() or in CLI: \n", + "!mlrun get run ea859673d08b4eccbd7746b9d36fb8e8 , !mlrun logs ea859673d08b4eccbd7746b9d36fb8e8 \n", + "[mlrun] 2020-01-26 00:42:06,762 run executed, status=completed\n" + ] + } + ], + "source": [ + "tsk_run = testfn.run(task, handler='test_model')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/tests/train_valid_test_split.ipynb b/tests/train_valid_test_split.ipynb index 9753ded7b..41c4064cc 100644 --- a/tests/train_valid_test_split.ipynb +++ b/tests/train_valid_test_split.ipynb @@ -28,13 +28,13 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "CODE_BASE = '/User/repos/functions/datagen' \n", "N_SAMPLES = 10_000_000 # size of HIGGS data\n", - "M_FEATURES = 20\n", + "M_FEATURES = 28\n", "NEG_WEIGHT = 0.5\n", "RNG = 1\n", "TARGET_DATA_PATH = '/User/mlrun/splitter'\n", @@ -50,18 +50,146 @@ }, { "cell_type": "code", - "execution_count": 3, - "metadata": {}, + "execution_count": 4, + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-23 15:05:06,305 starting run create_binary_classification uid=971b0eb4b2b64d3a948cb29ad8735dd2 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-23 15:05:06,427 Job is running in the background, pod: create-binary-classification-dx457\n", - "[mlrun] 2020-01-23 15:05:53,240 log artifact simdata at /User/mlrun/splitter/simdata.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-25 23:30:01,679 starting remote build, image: .mlrun/func-default-binary-latest\n", + "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Built cross stage deps: map[] \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Unpacking rootfs as cmd RUN pip install mlrun requires it. \n", + "\u001b[36mINFO\u001b[0m[0047] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0067] RUN pip install mlrun \n", + "\u001b[36mINFO\u001b[0m[0067] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0067] args: [-c pip install mlrun] \n", + "Requirement already satisfied: mlrun in /opt/conda/lib/python3.7/site-packages (0.4.3)\n", + "Requirement already satisfied: aiohttp>=3.5.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.6.2)\n", + "Requirement already satisfied: croniter==0.3.31 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.3.31)\n", + "Requirement already satisfied: gunicorn==19.9.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (19.9.0)\n", + "Requirement already satisfied: gevent==1.4.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.4.0)\n", + "Requirement already satisfied: kfp>=0.1.29 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.2.0)\n", + "Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (7.0)\n", + "Requirement already satisfied: nuclio-sdk>=0.0.3 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.0.5)\n", + "Requirement already satisfied: pandas>=0.23.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.25.1)\n", + "Requirement already satisfied: tabulate<=0.8.3,>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.3)\n", + "Requirement already satisfied: requests>=2.20.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (2.20.1)\n", + "Requirement already satisfied: sqlalchemy==1.3.11 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.3.11)\n", + "Requirement already satisfied: boto3>=1.9 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.11.7)\n", + "Requirement already satisfied: GitPython>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.0.5)\n", + "Requirement already satisfied: Flask>=1.1.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.1.1)\n", + "Requirement already satisfied: nest-asyncio>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.2.2)\n", + "Requirement already satisfied: nuclio-jupyter>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.0)\n", + "Requirement already satisfied: pyyaml>=5.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (5.1.1)\n", + "Requirement already satisfied: multidict<5.0,>=4.5 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (4.7.4)\n", + "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: chardet<4.0,>=2.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.4)\n", + "Requirement already satisfied: async-timeout<4.0,>=3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.1)\n", + "Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (19.3.0)\n", + "Requirement already satisfied: python-dateutil in /opt/conda/lib/python3.7/site-packages (from croniter==0.3.31->mlrun) (2.8.0)\n", + "Requirement already satisfied: greenlet>=0.4.14; platform_python_implementation == \"CPython\" in /opt/conda/lib/python3.7/site-packages (from gevent==1.4.0->mlrun) (0.4.15)\n", + "Requirement already satisfied: google-cloud-storage>=1.13.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.25.0)\n", + "Requirement already satisfied: requests-toolbelt>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.9.1)\n", + "Requirement already satisfied: cryptography>=2.4.2 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.7)\n", + "Requirement already satisfied: PyJWT>=1.6.4 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.7.1)\n", + "Requirement already satisfied: kfp-server-api<=0.1.40,>=0.1.18 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.1.40)\n", + "Requirement already satisfied: Deprecated in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.2.7)\n", + "Requirement already satisfied: jsonschema>=3.0.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (3.2.0)\n", + "Requirement already satisfied: cloudpickle==1.1.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.1.1)\n", + "Requirement already satisfied: argo-models==2.2.1a in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.2.1a0)\n", + "Requirement already satisfied: google-auth>=1.6.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.10.1)\n", + "Requirement already satisfied: urllib3<1.25,>=1.15 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.24.1)\n", + "Requirement already satisfied: certifi in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2019.9.11)\n", + "Requirement already satisfied: six>=1.10 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.12.0)\n", + "Requirement already satisfied: kubernetes<=10.0.0,>=8.0.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (10.0.0)\n", + "Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (2019.1)\n", + "Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (1.17.4)\n", + "Requirement already satisfied: idna<2.8,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (2.6)\n", + "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.9.4)\n", + "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.3.1)\n", + "Requirement already satisfied: botocore<1.15.0,>=1.14.7 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (1.14.7)\n", + "Requirement already satisfied: gitdb2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from GitPython>=2.1.0->mlrun) (2.0.6)\n", + "Requirement already satisfied: Jinja2>=2.10.1 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (2.10.3)\n", + "Requirement already satisfied: itsdangerous>=0.24 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (1.1.0)\n", + "Requirement already satisfied: Werkzeug>=0.15 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (0.16.0)\n", + "Requirement already satisfied: tornado<6,>=5 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.1.1)\n", + "Requirement already satisfied: jupyterlab>=0.35.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (1.2.5)\n", + "Requirement already satisfied: ipython>=7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (7.11.1)\n", + "Requirement already satisfied: nbconvert>=5.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.6.1)\n", + "Requirement already satisfied: notebook>=5.7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.3)\n", + "Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (0.5.0)\n", + "Requirement already satisfied: google-cloud-core<2.0dev,>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.2.0)\n", + "Requirement already satisfied: asn1crypto>=0.21.0 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (0.24.0)\n", + "Requirement already satisfied: cffi!=1.11.3,>=1.8 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (1.12.3)\n", + "Requirement already satisfied: wrapt<2,>=1.10 in /opt/conda/lib/python3.7/site-packages (from Deprecated->kfp>=0.1.29->mlrun) (1.11.2)\n", + "Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (41.0.1.post20191122)\n", + "Requirement already satisfied: pyrsistent>=0.14.0 in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (0.15.7)\n", + "Requirement already satisfied: importlib-metadata; python_version < \"3.8\" in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (1.4.0)\n", + "Requirement already satisfied: rsa<4.1,>=3.1.4 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0)\n", + "Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.2.8)\n", + "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0.0)\n", + "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (0.57.0)\n", + "Requirement already satisfied: requests-oauthlib in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (1.3.0)\n", + "Requirement already satisfied: docutils<0.16,>=0.10 in /opt/conda/lib/python3.7/site-packages (from botocore<1.15.0,>=1.14.7->boto3>=1.9->mlrun) (0.15.2)\n", + "Requirement already satisfied: smmap2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from gitdb2>=2.0.0->GitPython>=2.1.0->mlrun) (2.0.5)\n", + "Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.7/site-packages (from Jinja2>=2.10.1->Flask>=1.1.1->mlrun) (1.1.1)\n", + "Requirement already satisfied: jupyterlab-server~=1.0.0 in /opt/conda/lib/python3.7/site-packages (from jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (1.0.6)\n", + "Requirement already satisfied: traitlets>=4.2 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.3.3)\n", + "Requirement already satisfied: pickleshare in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.5)\n", + "Requirement already satisfied: decorator in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.4.1)\n", + "Requirement already satisfied: pygments in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (2.5.2)\n", + "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.8.0)\n", + "Requirement already satisfied: backcall in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.0)\n", + "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (3.0.2)\n", + "Requirement already satisfied: jedi>=0.10 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.15.2)\n", + "Requirement already satisfied: entrypoints>=0.2.2 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.3)\n", + "Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: defusedxml in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: bleach in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (3.1.0)\n", + "Requirement already satisfied: jupyter-core in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (4.6.1)\n", + "Requirement already satisfied: mistune<2,>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.4)\n", + "Requirement already satisfied: nbformat>=4.4 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (5.0.3)\n", + "Requirement already satisfied: testpath in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.4.4)\n", + "Requirement already satisfied: terminado>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.8.3)\n", + "Requirement already satisfied: prometheus-client in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.1)\n", + "Requirement already satisfied: jupyter-client>=5.3.4 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.3.4)\n", + "Requirement already satisfied: pyzmq>=17 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (18.1.1)\n", + "Requirement already satisfied: Send2Trash in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (1.5.0)\n", + "Requirement already satisfied: ipykernel in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.1.3)\n", + "Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.2.0)\n", + "Requirement already satisfied: google-api-core<2.0.0dev,>=1.16.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.16.0)\n", + "Requirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.4.2->kfp>=0.1.29->mlrun) (2.18)\n", + "Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (2.0.0)\n", + "Requirement already satisfied: pyasn1>=0.1.3 in /opt/conda/lib/python3.7/site-packages (from rsa<4.1,>=3.1.4->google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.4.8)\n", + "Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from requests-oauthlib->kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (3.1.0)\n", + "Requirement already satisfied: json5 in /opt/conda/lib/python3.7/site-packages (from jupyterlab-server~=1.0.0->jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.5)\n", + "Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.7/site-packages (from pexpect; sys_platform != \"win32\"->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: wcwidth in /opt/conda/lib/python3.7/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.8)\n", + "Requirement already satisfied: parso>=0.5.2 in /opt/conda/lib/python3.7/site-packages (from jedi>=0.10->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.5.2)\n", + "Requirement already satisfied: webencodings in /opt/conda/lib/python3.7/site-packages (from bleach->nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.5.1)\n", + "Requirement already satisfied: protobuf>=3.4.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (3.11.2)\n", + "Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.51.0)\n", + "Requirement already satisfied: more-itertools in /opt/conda/lib/python3.7/site-packages (from zipp>=0.5->importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (8.1.0)\n", + "\u001b[36mINFO\u001b[0m[0069] Taking snapshot of full filesystem... \n", + "[mlrun] 2020-01-25 23:31:19,830 starting run create_binary_classification uid=c0e0d32541bb4312aaba3c223860ca7d -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-25 23:31:19,909 Job is running in the background, pod: create-binary-classification-j8pzv\n", + "[mlrun] 2020-01-25 23:32:19,610 log artifact simdata at /User/mlrun/splitter/simdata.pqt, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-23 15:05:53,450 run executed, status=completed\n", + "[mlrun] 2020-01-25 23:32:19,918 run executed, status=completed\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", " result = infer_dtype(pandas_collection)\n", "final state: succeeded\n" @@ -236,26 +364,26 @@ " \n", " \n", " \n", - "
...735dd2
\n", + "
...60ca7d
\n", " 0\n", - " Jan 23 15:05:18\n", + " Jan 25 23:31:30\n", " completed\n", " binary\n", - "
host=create-binary-classification-dx457
kind=job
owner=admin
\n", + "
host=create-binary-classification-j8pzv
kind=job
owner=admin
\n", " \n", - "
filename=simdata.pqt
key=simdata
m_features=20
n_samples=10000000
random_state=1
target_path=/User/mlrun/splitter
weight=0.5
\n", + "
filename=simdata.pqt
key=simdata
m_features=28
n_samples=10000000
random_state=1
target_path=/User/mlrun/splitter
weight=0.5
\n", " \n", "
simdata
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -271,8 +399,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 971b0eb4b2b64d3a948cb29ad8735dd2 , !mlrun logs 971b0eb4b2b64d3a948cb29ad8735dd2 \n", - "[mlrun] 2020-01-23 15:05:55,758 run executed, status=completed\n" + "!mlrun get run c0e0d32541bb4312aaba3c223860ca7d , !mlrun logs c0e0d32541bb4312aaba3c223860ca7d \n", + "[mlrun] 2020-01-25 23:32:29,316 run executed, status=completed\n" ] } ], @@ -298,7 +426,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -307,7 +435,7 @@ "{'simdata': '/User/mlrun/splitter/simdata.pqt'}" ] }, - "execution_count": 4, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -325,7 +453,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -338,7 +466,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -349,44 +477,51 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 8, "metadata": {}, "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-25 23:32:48,510 starting remote build, image: .mlrun/func-default-train-valid-test-latest\n" + ] + }, { "data": { "text/plain": [ - "'ready'" + "True" ] }, - "execution_count": 7, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "splitter.deploy(skip_deployed=True)" + "splitter.deploy(skip_deployed=True, with_mlrun=False)" ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-23 15:05:55,845 starting run train_valid_test_splitter uid=d32826574ee84beab365d3acc30e4b31 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-23 15:05:55,923 Job is running in the background, pod: train-valid-test-splitter-b4gxz\n", - "[mlrun] 2020-01-23 15:06:21,321 log artifact header at /User/mlrun/splitter/header.pkl, size: None, db: Y\n", - "[mlrun] 2020-01-23 15:06:30,458 log artifact xtrain at /User/mlrun/splitter/xtrain.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-23 15:06:33,972 log artifact xvalid at /User/mlrun/splitter/xvalid.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-23 15:06:35,603 log artifact xtest at /User/mlrun/splitter/xtest.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-23 15:06:36,264 log artifact ytrain at /User/mlrun/splitter/ytrain.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-23 15:06:36,705 log artifact yvalid at /User/mlrun/splitter/yvalid.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-23 15:06:36,932 log artifact ytest at /User/mlrun/splitter/ytest.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-25 23:32:50,077 starting run train_valid_test_splitter uid=eaf9cfd6724b437da1e926bbd2daa040 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-25 23:32:50,176 Job is running in the background, pod: train-valid-test-splitter-kqxj9\n", + "[mlrun] 2020-01-25 23:33:23,093 log artifact header at /User/mlrun/splitter/header.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-25 23:33:36,963 log artifact xtrain at /User/mlrun/splitter/xtrain.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-25 23:33:42,116 log artifact xvalid at /User/mlrun/splitter/xvalid.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-25 23:33:44,494 log artifact xtest at /User/mlrun/splitter/xtest.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-25 23:33:45,314 log artifact ytrain at /User/mlrun/splitter/ytrain.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-25 23:33:45,599 log artifact yvalid at /User/mlrun/splitter/yvalid.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-25 23:33:45,811 log artifact ytest at /User/mlrun/splitter/ytest.pqt, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-23 15:06:37,271 run executed, status=completed\n", + "[mlrun] 2020-01-25 23:33:46,258 run executed, status=completed\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:708: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", " labels = getattr(columns, 'labels', None) or [\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:735: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead\n", @@ -567,12 +702,12 @@ " \n", " \n", " \n", - "
...0e4b31
\n", + "
...daa040
\n", " 0\n", - " Jan 23 15:06:05\n", + " Jan 25 23:33:02\n", " completed\n", " train-valid-test\n", - "
host=train-valid-test-splitter-b4gxz
kind=job
owner=admin
\n", + "
host=train-valid-test-splitter-kqxj9
kind=job
owner=admin
\n", " \n", "
random_state=1
src_file=/User/mlrun/splitter/simdata.pqt
target_path=/User/mlrun/splitter
\n", " \n", @@ -581,12 +716,12 @@ " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -602,8 +737,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run d32826574ee84beab365d3acc30e4b31 , !mlrun logs d32826574ee84beab365d3acc30e4b31 \n", - "[mlrun] 2020-01-23 15:06:45,221 run executed, status=completed\n" + "!mlrun get run eaf9cfd6724b437da1e926bbd2daa040 , !mlrun logs eaf9cfd6724b437da1e926bbd2daa040 \n", + "[mlrun] 2020-01-25 23:33:49,482 run executed, status=completed\n" ] } ], @@ -619,7 +754,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 10, "metadata": {}, "outputs": [ { @@ -634,7 +769,7 @@ " 'ytest': '/User/mlrun/splitter/ytest.pqt'}" ] }, - "execution_count": 9, + "execution_count": 10, "metadata": {}, "output_type": "execute_result" } @@ -652,7 +787,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -661,7 +796,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 12, "metadata": {}, "outputs": [], "source": [ @@ -677,7 +812,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 13, "metadata": {}, "outputs": [], "source": [ @@ -690,7 +825,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 14, "metadata": {}, "outputs": [], "source": [ @@ -699,12 +834,19 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "assert len(load(open(tsk2.outputs['header'], 'rb'))) == M_FEATURES" ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { diff --git a/train/README.md b/train/README.md new file mode 100644 index 000000000..6aa7c6d4b --- /dev/null +++ b/train/README.md @@ -0,0 +1,5 @@ +# training functions + +1. **`sklearn-classify`**
+train any sklearn classifier model + \ No newline at end of file diff --git a/train/sklearn-classifier.py b/train/sklearn-classifier.py index c103af7f4..795d38ace 100644 --- a/train/sklearn-classifier.py +++ b/train/sklearn-classifier.py @@ -81,17 +81,22 @@ def train( # create classifier class from string and instantiate splits = SKClassifier.split(".") clfclass = getattr(importlib.import_module(".".join(splits[:-1])), splits[-1]) - clf = clfclass(random_state=random_state, verbose=int(verbose == True)) + model = clfclass(random_state=random_state, verbose=int(verbose == True)) - clf.fit(xtrain, - ytrain, - eval_set=[(xvalid, yvalid), (xtrain, ytrain)], - eval_names=['valid', 'train'], - callbacks=callbacks, - verbose=verbose) + model.fit(xtrain, + ytrain, + eval_set=[(xvalid, yvalid), (xtrain, ytrain)], + eval_names=['valid', 'train'], + callbacks=callbacks, + verbose=verbose) context.log_result("train_accuracy", float(clf.score(xtrain, ytrain))) - + + # plot train and validation history, save and log + loss = np.asarray(model.evals_result_['train']['binary_logloss'], dtype=np.float) + val_loss = np.asarray(model.evals_result_['valid']['binary_logloss'], dtype=np.float) + plot_validation(loss, val_loss) + # save model filepath = os.path.join(target_path, name) dump(clf, open(filepath, 'wb')) @@ -101,4 +106,33 @@ def train( fname = t + 'test.pkl' filepath = os.path.join(target_path, fname) dump(xtest, open(filepath, 'wb')) - context.log_artifact(t+'test', target_path=filepath) \ No newline at end of file + context.log_artifact(t+'test', target_path=filepath) + + +def plot_validation(train_metric, valid_metric): + """Plot train and validation loss curves + + These curves represent the training round losses from the training + and validation sets. + + :param train_metric: train metric + :param valid_metric: validation metric + """ + # generate plot + plt.plot(train_metric) + plt.plot(valid_metric) + plt.title("training validation results") + plt.xlabel("epoch") + plt.ylabel("") + plt.legend(["train", "valid"]) + fig = plt.gcf() + + # save figure and log artifact + plotpath = path.join(target_path, "history.png") + plt.savefig(plotpath) + context.log_artifact(PlotArtifact('training-validation-plot', body=fig, target_path=plotpath)) + + # plot cleanup + plt.cla() + plt.clf() + plt.close() From 25e611e4bd05320d342708ce786522bfecaa0e51 Mon Sep 17 00:00:00 2001 From: yasha Date: Sun, 26 Jan 2020 00:47:25 +0000 Subject: [PATCH 17/32] rename file --- tests/{test_model.ipynb => test_classifier.ipynb} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename tests/{test_model.ipynb => test_classifier.ipynb} (100%) diff --git a/tests/test_model.ipynb b/tests/test_classifier.ipynb similarity index 100% rename from tests/test_model.ipynb rename to tests/test_classifier.ipynb From e4d74d784d42fb25cc75cbcab6d817bb1d2b150c Mon Sep 17 00:00:00 2001 From: yasha Date: Sun, 26 Jan 2020 13:27:43 +0000 Subject: [PATCH 18/32] minor fixes --- datagen/classification/binary.yaml | 13 +- datagen/splitters/train_valid_test.yaml | 11 + evaluation/test-classifier.py | 2 + evaluation/test-classifier.yaml | 13 +- tests/arc_to_parquet.ipynb | 193 +-------- tests/create_binary_data.ipynb | 90 ++-- tests/test_classifier.ipynb | 99 +++-- tests/train_classifier.ipynb | 519 +++++++++--------------- tests/train_valid_test_split.ipynb | 269 ++++-------- train/sklearn-classifier.py | 96 ++--- train/sklearn-classifier.yaml | 14 +- 11 files changed, 474 insertions(+), 845 deletions(-) diff --git a/datagen/classification/binary.yaml b/datagen/classification/binary.yaml index e95e8c57d..90434e184 100644 --- a/datagen/classification/binary.yaml +++ b/datagen/classification/binary.yaml @@ -2,12 +2,17 @@ kind: job metadata: name: binary tag: '' - hash: 35ea8daeef209d353ee6b75d3cca2b61b16b8e6a + hash: 0a0a5369f0fcf38a0f26b29aa8295046e8fcb4a7 project: '' spec: - description: 'create binary classification data' + command: '' + args: [] + volumes: [] + volume_mounts: [] + env: [] + description: '' build: - functionSourceCode: IyBDb3B5cmlnaHQgMjAxOSBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgppbXBvcnQgb3MKaW1wb3J0IHBhbmRhcyBhcyBwZAppbXBvcnQgcHlhcnJvdyBhcyBwYQppbXBvcnQgcHlhcnJvdy5wYXJxdWV0IGFzIHBxCmZyb20gdHlwaW5nIGltcG9ydCBPcHRpb25hbCwgTGlzdCwgQW55CmZyb20gc2tsZWFybi5kYXRhc2V0cyBpbXBvcnQgbWFrZV9jbGFzc2lmaWNhdGlvbgoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CgoKZGVmIGNyZWF0ZV9iaW5hcnlfY2xhc3NpZmljYXRpb24oCiAgICBjb250ZXh0IDogTUxDbGllbnRDdHggPSBOb25lLAogICAgbl9zYW1wbGVzIDogaW50ID0gMTAwXzAwMCwKICAgIG1fZmVhdHVyZXMgOiBpbnQgPSAyMCwKICAgIGZlYXR1cmVzX2hkciA6IE9wdGlvbmFsW0xpc3Rbc3RyXV0gPSBOb25lLAogICAgd2VpZ2h0IDogZmxvYXQgPSAwLjUwLAogICAgcmFuZG9tX3N0YXRlIDogaW50ID0xLAogICAgZmlsZW5hbWUgOiBPcHRpb25hbFtzdHJdID0gTm9uZSwKICAgIHRhcmdldF9wYXRoIDogc3RyID0gIiIsCiAgICBrZXkgOiBzdHIgPSAiIgopOgogICAgIiIiQ3JlYXRlIGEgYmluYXJ5IGNsYXNzaWZpY2F0aW9uIHNhbXBsZSBkYXRhc2V0IGFuZCBzYXZlLgogICAgSWYgbm8gZmlsZW5hbWUgaXMgZ2l2ZW4gaXQgd2lsbCBkZWZhdWx0IHRvOgogICAgJ3NpbWRhdGEte25fc2FtcGxlc31Ye21fZmVhdHVyZXN9LnBhcnF1ZXQnLgogICAgQWxsIG9mIHRoZSBzY2lraXQtbGVhcm4gcGFyYW1ldGVycyBjYW4gYmUgc2V0IHVzaW5nICoqc2tfcGFyYW1zCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIG5fc2FtcGxlczogICAgIG51bWJlciBvZiByb3dzL3NhbXBsZXMKICAgIDpwYXJhbSBtX2ZlYXR1cmVzOiAgICBudW1iZXIgb2YgY29scy9mZWF0dXJlcwogICAgOnBhcmFtIGZlYXR1cmVzX2hkcjogIGhlYWRlciBmb3IgZmVhdHVyZXMgYXJyYXkKICAgIDpwYXJhbSB3ZWlnaHQ6ICAgICAgICBmcmFjdGlvbiBvZiBzYW1wbGUgKG5lZykKICAgIDpwYXJhbSByYW5kb21fc3RhdGU6ICBybmcgc2VlZCAoc2VlIGh0dHBzOi8vc2Npa2l0LWxlYXJuLm9yZy9zdGFibGUvZ2xvc3NhcnkuaHRtbCN0ZXJtLXJhbmRvbS1zdGF0ZSkKICAgIDpwYXJhbSBmaWxlbmFtZTogICAgICBvcHRpb25hbCBuYW1lIGZvciBzdG9yZWQgZGF0YSBmaWxlCiAgICA6cGFyYW0gdGFyZ2V0X3BhdGg6ICAgZGVzdGltYXRpb24gZm9yIGZpbGUKICAgIDpwYXJhbSBrZXk6ICAgICAgICAgICBrZXkgb2YgZGF0YSBpbiBhcnRpZmFjdCBzdG9yZQogICAgUmV0dXJucyBmaWxlbmFtZSBvZiBjcmVhdGVkIGRhdGEgKGluY2x1ZGVzIHBhdGgpLgogICAgIiIiCiAgICAjIGNoZWNrIGRpcmVjdG9yaWVzIGV4aXN0IGFuZCBjcmVhdGUgZmlsZW5hbWUgaWYgTm9uZToKICAgIG9zLm1ha2VkaXJzKHRhcmdldF9wYXRoLCBleGlzdF9vaz1UcnVlKQogICAgaWYgbm90IGZpbGVuYW1lOgogICAgICAgIG5hbWUgPSBmInNpbWRhdGEte25fc2FtcGxlczowLjBlfVh7bV9mZWF0dXJlc30ucGFycXVldCIucmVwbGFjZSgiKyIsICIiKQogICAgICAgIGZpbGVuYW1lID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgZWxzZToKICAgICAgICBmaWxlbmFtZSA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgZmlsZW5hbWUpCiAgICAKICAgIGZlYXR1cmVzLCBsYWJlbHMgPSBtYWtlX2NsYXNzaWZpY2F0aW9uKAogICAgICAgIG5fc2FtcGxlcz1uX3NhbXBsZXMsCiAgICAgICAgbl9mZWF0dXJlcz1tX2ZlYXR1cmVzLAogICAgICAgIHdlaWdodHM9W3dlaWdodF0sICAjIEZhbHNlCiAgICAgICAgbl9jbGFzc2VzPTIsCiAgICAgICAgcmFuZG9tX3N0YXRlPXJhbmRvbV9zdGF0ZSkKCiAgICAjIG1ha2UgZGF0YWZyYW1lcywgYWRkIGNvbHVtbiBuYW1lcywgY29uY2F0ZW5hdGUgKFgsIHkpCiAgICBYID0gcGQuRGF0YUZyYW1lKGZlYXR1cmVzKQogICAgaWYgbm90IGZlYXR1cmVzX2hkcjoKICAgICAgICBYLmNvbHVtbnMgPSBbImZlYXRfIiArIHN0cih4KSBmb3IgeCBpbiByYW5nZShtX2ZlYXR1cmVzKV0KICAgIGVsc2U6CiAgICAgICAgWC5jb2x1bW5zID0gZmVhdHVyZXNfaGRyCgogICAgeSA9IHBkLkRhdGFGcmFtZShsYWJlbHMsIGNvbHVtbnM9WyJsYWJlbHMiXSkKICAgIGRhdGEgPSBwZC5jb25jYXQoW1gsIHldLCBheGlzPTEpCgogICAgcHEud3JpdGVfdGFibGUocGEuVGFibGUuZnJvbV9wYW5kYXMoZGF0YSksIGZpbGVuYW1lKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3Qoa2V5LCB0YXJnZXRfcGF0aD1maWxlbmFtZSkK + functionSourceCode: IyBDb3B5cmlnaHQgMjAxOSBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgppbXBvcnQgb3MKaW1wb3J0IHBhbmRhcyBhcyBwZAppbXBvcnQgcHlhcnJvdyBhcyBwYQppbXBvcnQgcHlhcnJvdy5wYXJxdWV0IGFzIHBxCmZyb20gdHlwaW5nIGltcG9ydCBPcHRpb25hbCwgTGlzdCwgQW55CmZyb20gc2tsZWFybi5kYXRhc2V0cyBpbXBvcnQgbWFrZV9jbGFzc2lmaWNhdGlvbgoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CgoKZGVmIGNyZWF0ZV9iaW5hcnlfY2xhc3NpZmljYXRpb24oCiAgICBjb250ZXh0IDogTUxDbGllbnRDdHggPSBOb25lLAogICAgbl9zYW1wbGVzIDogaW50ID0gMTAwXzAwMCwKICAgIG1fZmVhdHVyZXMgOiBpbnQgPSAyMCwKICAgIGZlYXR1cmVzX2hkciA6IE9wdGlvbmFsW0xpc3Rbc3RyXV0gPSBOb25lLAogICAgd2VpZ2h0IDogZmxvYXQgPSAwLjUwLAogICAgcmFuZG9tX3N0YXRlIDogaW50ID0xLAogICAgZmlsZW5hbWUgOiBPcHRpb25hbFtzdHJdID0gTm9uZSwKICAgIHRhcmdldF9wYXRoIDogc3RyID0gIiIsCiAgICBrZXkgOiBzdHIgPSAiIgopOgogICAgIiIiQ3JlYXRlIGEgYmluYXJ5IGNsYXNzaWZpY2F0aW9uIHNhbXBsZSBkYXRhc2V0IGFuZCBzYXZlLgogICAgSWYgbm8gZmlsZW5hbWUgaXMgZ2l2ZW4gaXQgd2lsbCBkZWZhdWx0IHRvOgogICAgJ3NpbWRhdGEte25fc2FtcGxlc31Ye21fZmVhdHVyZXN9LnBhcnF1ZXQnLgogICAgQWxsIG9mIHRoZSBzY2lraXQtbGVhcm4gcGFyYW1ldGVycyBjYW4gYmUgc2V0IHVzaW5nICoqc2tfcGFyYW1zCiAgICAKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgICBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gbl9zYW1wbGVzOiAgICAgbnVtYmVyIG9mIHJvd3Mvc2FtcGxlcwogICAgOnBhcmFtIG1fZmVhdHVyZXM6ICAgIG51bWJlciBvZiBjb2xzL2ZlYXR1cmVzCiAgICA6cGFyYW0gZmVhdHVyZXNfaGRyOiAgaGVhZGVyIGZvciBmZWF0dXJlcyBhcnJheQogICAgOnBhcmFtIHdlaWdodDogICAgICAgIGZyYWN0aW9uIG9mIHNhbXBsZSAobmVnKQogICAgOnBhcmFtIHJhbmRvbV9zdGF0ZTogIHJuZyBzZWVkIChzZWUgaHR0cHM6Ly9zY2lraXQtbGVhcm4ub3JnL3N0YWJsZS9nbG9zc2FyeS5odG1sI3Rlcm0tcmFuZG9tLXN0YXRlKQogICAgOnBhcmFtIGZpbGVuYW1lOiAgICAgIG9wdGlvbmFsIG5hbWUgZm9yIHN0b3JlZCBkYXRhIGZpbGUKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogICBkZXN0aW1hdGlvbiBmb3IgZmlsZQogICAgOnBhcmFtIGtleTogICAgICAgICAgIGtleSBvZiBkYXRhIGluIGFydGlmYWN0IHN0b3JlCiAgICBSZXR1cm5zIGZpbGVuYW1lIG9mIGNyZWF0ZWQgZGF0YSAoaW5jbHVkZXMgcGF0aCkuCiAgICAiIiIKICAgICMgY2hlY2sgZGlyZWN0b3JpZXMgZXhpc3QgYW5kIGNyZWF0ZSBmaWxlbmFtZSBpZiBOb25lOgogICAgb3MubWFrZWRpcnModGFyZ2V0X3BhdGgsIGV4aXN0X29rPVRydWUpCiAgICBpZiBub3QgZmlsZW5hbWU6CiAgICAgICAgbmFtZSA9IGYic2ltZGF0YS17bl9zYW1wbGVzOjAuMGV9WHttX2ZlYXR1cmVzfS5wYXJxdWV0Ii5yZXBsYWNlKCIrIiwgIiIpCiAgICAgICAgZmlsZW5hbWUgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUpCiAgICBlbHNlOgogICAgICAgIGZpbGVuYW1lID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBmaWxlbmFtZSkKICAgIAogICAgZmVhdHVyZXMsIGxhYmVscyA9IG1ha2VfY2xhc3NpZmljYXRpb24oCiAgICAgICAgbl9zYW1wbGVzPW5fc2FtcGxlcywKICAgICAgICBuX2ZlYXR1cmVzPW1fZmVhdHVyZXMsCiAgICAgICAgd2VpZ2h0cz1bd2VpZ2h0XSwgICMgRmFsc2UKICAgICAgICBuX2NsYXNzZXM9MiwKICAgICAgICByYW5kb21fc3RhdGU9cmFuZG9tX3N0YXRlKQoKICAgICMgbWFrZSBkYXRhZnJhbWVzLCBhZGQgY29sdW1uIG5hbWVzLCBjb25jYXRlbmF0ZSAoWCwgeSkKICAgIFggPSBwZC5EYXRhRnJhbWUoZmVhdHVyZXMpCiAgICBpZiBub3QgZmVhdHVyZXNfaGRyOgogICAgICAgIFguY29sdW1ucyA9IFsiZmVhdF8iICsgc3RyKHgpIGZvciB4IGluIHJhbmdlKG1fZmVhdHVyZXMpXQogICAgZWxzZToKICAgICAgICBYLmNvbHVtbnMgPSBmZWF0dXJlc19oZHIKCiAgICB5ID0gcGQuRGF0YUZyYW1lKGxhYmVscywgY29sdW1ucz1bImxhYmVscyJdKQogICAgZGF0YSA9IHBkLmNvbmNhdChbWCwgeV0sIGF4aXM9MSkKCiAgICBwcS53cml0ZV90YWJsZShwYS5UYWJsZS5mcm9tX3BhbmRhcyhkYXRhKSwgZmlsZW5hbWUpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWZpbGVuYW1lKQo= base_image: yjbds/mlrun-ds:latest commands: [] - code_origin: https://github.com/yjb-ds/functions.git#3f7e0c78313c0f8da3f2ae8535b625f06f5c3ee4:/User/repos/functions/datagen/classification/binary.py + code_origin: https://github.com/yjb-ds/functions.git#25e611e4bd05320d342708ce786522bfecaa0e51:/User/repos/functions/datagen/classification/binary.py diff --git a/datagen/splitters/train_valid_test.yaml b/datagen/splitters/train_valid_test.yaml index c3ffa7e97..2df2f0e88 100644 --- a/datagen/splitters/train_valid_test.yaml +++ b/datagen/splitters/train_valid_test.yaml @@ -1,8 +1,19 @@ kind: job metadata: name: train-valid-test + tag: '' + hash: 877224f2dd10beff5f7e9cb9b4821a8685aab9db + project: '' spec: + command: '' + args: [] + image: yjbds/mlrun-ds:latest + volumes: [] + volume_mounts: [] + env: [] + description: '' build: functionSourceCode: aW1wb3J0IHBhbmRhcyBhcyBwZAppbXBvcnQgb3MKaW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBweWFycm93LnBhcnF1ZXQgYXMgcHEKaW1wb3J0IHB5YXJyb3cgYXMgcGEKZnJvbSBjbG91ZHBpY2tsZSBpbXBvcnQgZHVtcAoKZnJvbSBza2xlYXJuLm1vZGVsX3NlbGVjdGlvbiBpbXBvcnQgdHJhaW5fdGVzdF9zcGxpdApmcm9tIHR5cGluZyBpbXBvcnQgT3B0aW9uYWwsIFVuaW9uCmZyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIG1scnVuLmRhdGFzdG9yZSBpbXBvcnQgRGF0YUl0ZW0KCmRlZiB0cmFpbl92YWxpZF90ZXN0X3NwbGl0dGVyKAogICAgY29udGV4dDogT3B0aW9uYWxbTUxDbGllbnRDdHhdID0gTm9uZSwKICAgIHNyY19maWxlOiBVbmlvbltEYXRhSXRlbSwgc3RyXSA9ICcnLAogICAgaGVhZGVyOiBVbmlvbltEYXRhSXRlbSwgc3RyLCBsaXN0XSA9ICcnLAogICAgc2FtcGxlOiBpbnQgPSAtMSwKICAgIGxhYmVsX2NvbHVtbjogc3RyID0gJ2xhYmVscycsCiAgICB0ZXN0X3NpemU6IGZsb2F0ID0gMC4xLAogICAgdHJhaW5fdmFsX3NwbGl0OiBmbG9hdCA9IDAuNzUsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gJycsCiAgICBuYW1lOiBzdHIgPSAnJywKICAgIGtleTogc3RyID0gJycsCiAgICByYW5kb21fc3RhdGUgPSAxCikgLT4gTm9uZToKICAgICIiIlNwbGl0IHJhdyBkYXRhIGlucHV0IGludG8gdHJhaW4sIHZhbGlkYXRpb24gYW5kIHRlc3Qgc2V0cy4KCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgICB0aGUgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIHNyY19maWxlOiAgICAgICAgKCdyYXcnKSBuYW1lIG9mIHJhdyBkYXRhIGZpbGUKICAgIDpwYXJhbSBoZWFkZXI6ICAgICAgICAgIChOb25lKSBoZWFkZXIgYXJ0aWZhY3Qgb3IgbGlzdCBvZiBjb2x1bW4gbmFtZXMuCiAgICA6cGFyYW0gc2FtcGxlOiAgICAgICAgICAoLTEpLiBTZWxlY3RzIHRoZSBmaXJzdCBuIHJvd3MsIG9yIHNlbGVjdCBhIHNhbXBsZSBzdGFydGluZwogICAgICAgICAgICAgICAgICAgICAgICAgICAgZnJvbSB0aGUgZmlyc3QuIElmIG5lZ2F0aXZlIDwtMSwgc2VsZWN0IGEgcmFuZG9tIHNhbXBsZSBmcm9tIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgdGhlIGVudGlyZSBmaWxlCiAgICA6cGFyYW0gbGFiZWxfY29sdW1uOiAgICBncm91bmQtdHJ1dGggKHkpIGxhYmVscwogICAgOnBhcmFtIHRlc3Rfc2l6ZTogICAgICAgKDAuMSkgdGVzdCBzZXQgc2l6ZQogICAgOnBhcmFtIHRyYWluX3ZhbF9zcGxpdDogKDAuNzUpIE9uY2UgdGhlIHRlc3Qgc2V0IGhhcyBiZWVuIHJlbW92ZWQgdGhlIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgdHJhaW5pbmcgc2V0IGdldHMgdGhpcyBwcm9wb3J0aW9uLgogICAgOnBhcmFtIHRhcmdldF9wYXRoOiAgICAgZm9sZGVyIGxvY2F0aW9uIG9mIGZpbGVzCiAgICA6cGFyYW0gbmFtZTogICAgICAgICAgICBkZXN0aW5hdGlvbiBwcmVmaXggbmFtZSBmb3IgbW9kZWwgZmlsZXMKICAgIDpwYXJhbSBrZXk6ICAgICAgICAgICAgIGtleSBmb3IgbW9kZWwgYXJ0aWZhY3QKICAgIDpwYXJhbSByYW5kb21fc3RhdGU6ICAgICgxKSBza2xlYXJuIHJuZyBzZWVkCiAgICAiIiIKICAgIGlmIGlzaW5zdGFuY2Uoc3JjX2ZpbGUsIERhdGFJdGVtKToKICAgICAgICBzcmNfZmlsZSA9IHN0cihzcmNfZmlsZSkKICAgIHNyY2ZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBzcmNfZmlsZSkKCiAgICBpZiAoc2FtcGxlID09IC0xKSBvciAoc2FtcGxlID49IDEpOgogICAgICAgICMgZ2V0IGFsbCByb3dzLCBvciBjb250aWd1b3VzIHNhbXBsZSBzdGFydGluZyBhdCByb3cgMS4KICAgICAgICByYXcgPSBwZC5yZWFkX3BhcnF1ZXQoc3JjZmlsZXBhdGgsIGVuZ2luZT0ncHlhcnJvdycpCiAgICAgICAgbGFiZWxzID0gcmF3LnBvcChsYWJlbF9jb2x1bW4pCiAgICAgICAgcmF3ID0gcmF3Lmlsb2NbOnNhbXBsZSwgOl0KICAgICAgICBsYWJlbHMgPSBsYWJlbHMuaWxvY1s6c2FtcGxlXQogICAgZWxzZToKICAgICAgICAjIGdyYWIgYSByYW5kb20gc2FtcGxlCiAgICAgICAgcmF3ID0gcGQucmVhZF9wYXJxdWV0KHNyY2ZpbGVwYXRoLCBlbmdpbmU9J3B5YXJyb3cnKS5zYW1wbGUoc2FtcGxlKi0xKQogICAgICAgIGxhYmVscyA9IHJhdy5wb3AobGFiZWxfY29sdW1uKQogICAgCiAgICAjIGRvdWJsZSBzcGxpdCB0cCBnZW5lcmF0ZSAzIGRhdGEgc2V0czogdHJhaW4sIHZhbGlkYXRpb24gYW5kIHRlc3QKICAgIHgsIHh0ZXN0LCB5LCB5dGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQocmF3LCBsYWJlbHMsIHRlc3Rfc2l6ZT10ZXN0X3NpemUsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICByYW5kb21fc3RhdGU9cmFuZG9tX3N0YXRlKQogICAKICAgIHh0cmFpbiwgeHZhbGlkLCB5dHJhaW4sIHl2YWxpZCA9IHRyYWluX3Rlc3Rfc3BsaXQoeCwgeSwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHRyYWluX3NpemU9dHJhaW5fdmFsX3NwbGl0LCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgcmFuZG9tX3N0YXRlPXJhbmRvbV9zdGF0ZSkgICAgICAgIAoKICAgIGlmIG5hbWU6CiAgICAgICAgbmFtZSA9ICctJyArIG5hbWUKICAgIAogICAgIyBzYXZlIGhlYWRlcgogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICdoZWFkZXIucGtsJykKICAgIGR1bXAocmF3LmNvbHVtbnMudmFsdWVzLCBvcGVuKGYsICd3YicpKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ2hlYWRlcicsIHRhcmdldF9wYXRoPWYpCiAgICAKICAgICMgc2F2ZSBkYXRhIHNldHMKICAgIGYgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUgKyAneHRyYWluLnBxdCcpCiAgICB4dHJhaW4udG9fcGFycXVldChmKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ3h0cmFpbicsIHRhcmdldF9wYXRoPWYpCiAgICAKICAgIGYgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUgKyAneHZhbGlkLnBxdCcpCiAgICB4dmFsaWQudG9fcGFycXVldChmKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ3h2YWxpZCcsIHRhcmdldF9wYXRoPWYpCiAgICAKICAgIGYgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUgKyAneHRlc3QucHF0JykKICAgIHh0ZXN0LnRvX3BhcnF1ZXQoZikKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KCd4dGVzdCcsIHRhcmdldF9wYXRoPWYpCiAgICAKICAgIGYgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUgKyAneXRyYWluLnBxdCcpCiAgICBwZC5EYXRhRnJhbWUoeydsYWJlbHMnOiB5dHJhaW59KS50b19wYXJxdWV0KGYpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgneXRyYWluJywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICd5dmFsaWQucHF0JykKICAgIHBkLkRhdGFGcmFtZSh7J2xhYmVscyc6IHl2YWxpZH0pLnRvX3BhcnF1ZXQoZikKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KCd5dmFsaWQnLCB0YXJnZXRfcGF0aD1mKQogICAgCiAgICBmID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lICsgJ3l0ZXN0LnBxdCcpCiAgICBwZC5EYXRhRnJhbWUoeydsYWJlbHMnOiB5dGVzdH0pLnRvX3BhcnF1ZXQoZikKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KCd5dGVzdCcsIHRhcmdldF9wYXRoPWYp base_image: yjbds/mlrun-ds:latest commands: [] + code_origin: https://github.com/yjb-ds/functions.git#25e611e4bd05320d342708ce786522bfecaa0e51:/User/repos/functions/datagen/splitters/train_valid_test.py diff --git a/evaluation/test-classifier.py b/evaluation/test-classifier.py index b0e5cc5d1..82e693f95 100644 --- a/evaluation/test-classifier.py +++ b/evaluation/test-classifier.py @@ -19,6 +19,8 @@ from mlrun.datastore import DataItem from mlrun.artifacts import TableArtifact, PlotArtifact +import warnings +warnings.simplefilter(action='ignore', category=FutureWarning) def test_model( context: Optional[MLClientCtx], diff --git a/evaluation/test-classifier.yaml b/evaluation/test-classifier.yaml index 52d0ac16e..eb865089f 100644 --- a/evaluation/test-classifier.yaml +++ b/evaluation/test-classifier.yaml @@ -1,8 +1,19 @@ kind: job metadata: name: test-classifier + tag: '' + hash: c768dc0c66298e9a0ebe79a118698713fc84cfbd + project: '' spec: + command: '' + args: [] + image: yjbds/mlrun-ds:latest + volumes: [] + volume_mounts: [] + env: [] + description: '' build: - functionSourceCode: aW1wb3J0IG9zCmltcG9ydCBpbXBvcnRsaWIKZnJvbSBjbG91ZHBpY2tsZSBpbXBvcnQgbG9hZAoKaW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IGxpZ2h0Z2JtIGFzIGxnYgoKZnJvbSBza2xlYXJuLm1ldHJpY3MgaW1wb3J0IChyb2NfY3VydmUsIGNvbmZ1c2lvbl9tYXRyaXgpCmZyb20gc2tsZWFybi5tb2RlbF9zZWxlY3Rpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXQKCmltcG9ydCBtYXRwbG90bGliLnB5cGxvdCBhcyBwbHQKZnJvbSBtYXRwbG90bGliLmZpZ3VyZSBpbXBvcnQgRmlndXJlCmltcG9ydCBzZWFib3JuIGFzIHNucwoKZnJvbSB0eXBpbmcgaW1wb3J0IE9wdGlvbmFsLCBVbmlvbiwgTGlzdAoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gbWxydW4uZGF0YXN0b3JlIGltcG9ydCBEYXRhSXRlbQpmcm9tIG1scnVuLmFydGlmYWN0cyBpbXBvcnQgVGFibGVBcnRpZmFjdCwgUGxvdEFydGlmYWN0CgoKZGVmIHRlc3RfbW9kZWwoCiAgICBjb250ZXh0OiBPcHRpb25hbFtNTENsaWVudEN0eF0sCiAgICBtb2RlbDogVW5pb25bRGF0YUl0ZW0sIHN0cl0sCiAgICB4dGVzdCwgCiAgICB5dGVzdCwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAnJywKICAgIG5hbWU6IHN0ciA9ICcnLAogICAga2V5OiBzdHIgPSAnJywKICAgIHJhbmRvbV9zdGF0ZSA9IDEKKSAtPiBOb25lOgogICAgIiIiVGVzdCBhIGNsYXNzaWZpZXIgbW9kZWwKICAgIAogICAgVXNpbmcgaGVsZC1vdXQgdGVzdCBmZWF0dXJlcywgY2FsbHMgYG1vZGVsLnByZWRpY3QoeHRlc3QpYCBhbmQgZXZhbHVhdGVzIHRoZSBhY2N1cmFjeSBvZiB0aGUgCiAgICBlc3RpbWF0ZWQgbW9kZWwuCiAgICAKICAgIENhbiBiZSBwYXJ0IG9mIGEga3ViZWZsb3cgcGlwZWxpbmUgYXMgYSB0ZXN0IHN0ZXAgb3IgY2FsbGVkCiAgICAKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgICAgIHRoZSBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gbW9kZWw6ICAgICAgICAgICBlc3RpbWF0ZWQgbW9kZWwgZmlsZSBuYW1lIGFzIGFydGlmYWN0IHN0b3JlIGl0ZW0KICAgICAgICAgICAgICAgICAgICAgICAgICAgIG9yIHBpY2tsZSBmaWxlIG5hbWUKICAgIDpwYXJhbSB4dGVzdDogICAgICAgICAgIHRlc3QgZmVhdHVyZXMgZmlsZSBuYW1lIGFzIGFydGlmYWN0IHN0b3JlIGl0ZW0KICAgICAgICAgICAgICAgICAgICAgICAgICAgIG9yIHBpY2tsZSBmaWxlIG5hbWUKICAgIDpwYXJhbSBoZWFkZXI6ICAgICAgICAgIChPcHRpb25hbCkgdXNlIGlmIHh0ZXN0IGRvZXMgbm90IGhhdmUgYSBoZWFkZXIKICAgIDpwYXJhbSB5dGVzdDogICAgICAgICAgIHRlc3QgbGFiZWxzIGZpbGUgbmFtZSBhcyBhcnRpZmFjdCBzdG9yZSAKICAgICAgICAgICAgICAgICAgICAgICAgICAgIGl0ZW0gb3IgcGlja2xlIGZpbGUgbmFtZQogICAgOnBhcmFtIHRhcmdldF9wYXRoOiAgICAgZm9sZGVyIGxvY2F0aW9uIG9mIGZpbGVzCiAgICA6cGFyYW0gbmFtZTogICAgICAgICAgICBkZXN0aW5hdGlvbiBuYW1lIGZvciB0ZXN0IHJlc3VsdHMKICAgIDpwYXJhbSBrZXk6ICAgICAgICAgICAgIGtleSBmb3IgbW9kZWwgYXJ0aWZhY3QKICAgICIiIgogICAgIyBsb2FkIG1vZGVsIGFuZCBkYXRhCiAgICBpZiBpc2luc3RhbmNlKG1vZGVsLCBEYXRhSXRlbSk6CiAgICAgICAgY2xmID0gbG9hZChvcGVuKHN0cihtb2RlbCksICdyYicpKQogICAgZWxzZToKICAgICAgICBjbGYgPSBsb2FkKG9wZW4obW9kZWwsICdyYicpKQoKICAgIGlmIGlzaW5zdGFuY2UoeHRlc3QsIERhdGFJdGVtKToKICAgICAgICB4dGVzdCA9IHBkLnJlYWRfcGFycXVldChzdHIoeHRlc3QpKQogICAgICAgIHl0ZXN0ID0gcGQucmVhZF9wYXJxdWV0KHN0cih5dGVzdCkpCiAgICBlbHNlOgogICAgICAgIHh0ZXN0ID0gcGQucmVhZF9wYXJxdWV0KHh0ZXN0KQogICAgICAgIHl0ZXN0ID0gcGQucmVhZF9wYXJxdWV0KHl0ZXN0KQogICAgCiAgICBpZiBjYWxsYWJsZShnZXRhdHRyKGNsZiwgJ3ByZWRpY3RfcHJvYmEnKSk6CiAgICAgICAgeXByZWRfcHJvYnMgPSBjbGYucHJlZGljdF9wcm9iYSh4dGVzdClbOiwgMV0KICAgICAgICB5cHJlZCA9IG5wLndoZXJlKHlwcmVkX3Byb2JzID49IDAuNSwgMSwgMCkKICAgICAgICBwbG90X3JvYyhjb250ZXh0LCB5dGVzdCwgeXByZWRfcHJvYnMsIHRhcmdldF9wYXRoKQogICAgZWxzZToKICAgICAgICB5cHJlZCA9IGNsZi5wcmVkaWN0KHh0ZXN0KQogICAgICAgIHlwcmVkX3Byb2JzID0gTm9uZQogICAgCiAgICBwbG90X2NvbmZ1c2lvbl9tYXRyaXgoY29udGV4dCwgeXRlc3QsIHlwcmVkLCB0YXJnZXRfcGF0aCkKCiAgICBpZiBoYXNhdHRyKGNsZiwgJ2ZlYXR1cmVfaW1wb3J0YW5jZXNfJyk6CiAgICAgICAgcGxvdF9pbXBvcnRhbmNlKGNvbnRleHQsIGNsZiwgeHRlc3QuY29sdW1ucy52YWx1ZXMsIHRhcmdldF9wYXRoKQoKZGVmIF9nY2ZfY2xlYXIocGx0KToKICAgIHBsdC5jbGEoKQogICAgcGx0LmNsZigpCiAgICBwbHQuY2xvc2UoKSAgICAgICAgCgpkZWYgcGxvdF9yb2MoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwgCiAgICB5X2xhYmVscywKICAgIHlfcHJvYnMsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gJycsCiAgICBuYW1lPSdyb2MucG5nJywKICAgIGtleT0ncm9jJywKICAgIGZtdD0ncG5nJwopOgogICAgIiIiUGxvdCBhbiBST0MgY3VydmUgZnJvbSB0ZXN0IGRhdGEgc2F2ZWQgaW4gYW4gYXJ0aWZhY3Qgc3RvcmUuCiAgICAKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSB5X2xhYmVsczogICAgICAgIHRlc3QgZGF0YSBsYWJlbHMKICAgIDpwYXJhbSB5X3Byb2JzOiAgICAgICAgIHRlc3QgZGF0YSAKICAgICIiIgogICAgZnByX3hnLCB0cHJfeGcsIF8gPSByb2NfY3VydmUoeV9sYWJlbHMsIHlfcHJvYnMpCiAgICBwbHQucGxvdChbMCwgMV0sIFswLCAxXSwgImstLSIpCiAgICBwbHQucGxvdChmcHJfeGcsIHRwcl94ZywgbGFiZWw9InJvYyIpCiAgICBwbHQueGxhYmVsKCJmYWxzZSBwb3NpdGl2ZSByYXRlIikKICAgIHBsdC55bGFiZWwoInRydWUgcG9zaXRpdmUgcmF0ZSIpCiAgICBwbHQudGl0bGUoInJvYyBjdXJ2ZSIpCiAgICBwbHQubGVnZW5kKGxvYz0iYmVzdCIpCiAgICBmaWcgPSBwbHQuZ2NmKCkKCiAgICBwbG90cGF0aCA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSkKICAgIGZpZy5zYXZlZmlnKHBsb3RwYXRoLCBmb3JtYXQ9Zm10KQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoUGxvdEFydGlmYWN0KGtleSwgYm9keT1maWcpKQoKICAgIF9nY2ZfY2xlYXIocGx0KQoKZGVmIHBsb3RfY29uZnVzaW9uX21hdHJpeCgKICAgIGNvbnRleHQ6IE1MQ2xpZW50Q3R4LCAKICAgIGxhYmVscywgCiAgICBwcmVkaWN0aW9ucywKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAnJywgCiAgICBuYW1lOiBzdHIgPSJjb25mdXNpb24ucG5nIiwgCiAgICBrZXk6IHN0ciA9J2NvbmZ1c2lvbl9tYXRyaXgnLAogICAgZm10OiBzdHIgPSAncG5nJwopOgogICAgIiIiQ3JlYXRlIGEgY29uZnVzaW9uIG1hdHJpeC4KICAgIFBsb3QgYW5kIHNhdmUgYSBjb25mdXNpb24gbWF0cml4IHVzaW5nIHRlc3QgZGF0YSBmcm9tIGEKICAgIHBpcGVsaW5lIHN0ZXAuCgogICAgOnBhcmFtIGNvbnRleHQ6ICAgICAgICAgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIGxhYmVsczogICAgICAgICAgdGVzdCBkYXRhIGxhYmVscwogICAgOnBhcmFtIHByZWRpY3Rpb25zOiAgICAgdGVzdCBkYXRhIHByZWRpY3Rpb25zCiAgICAiIiIKICAgIGNtID0gY29uZnVzaW9uX21hdHJpeChsYWJlbHMsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBwcmVkaWN0aW9ucywKICAgICAgICAgICAgICAgICAgICAgICAgICAgIHNhbXBsZV93ZWlnaHQ9Tm9uZSwKICAgICAgICAgICAgICAgICAgICAgICAgICAgIG5vcm1hbGl6ZT0nYWxsJykKICAgIHNucy5oZWF0bWFwKGNtLCBhbm5vdD1UcnVlLCBjbWFwPSJCbHVlcyIpCiAgICBwbG90cGF0aCA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSkKICAgIGZpZyA9IHBsdC5nY2YoKQogICAgZmlnLnNhdmVmaWcocGxvdHBhdGgsIGZvcm1hdD1mbXQpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChQbG90QXJ0aWZhY3Qoa2V5LCBib2R5PWZpZykpCgogICAgX2djZl9jbGVhcihwbHQpCgpkZWYgcGxvdF9pbXBvcnRhbmNlKAogICAgY29udGV4dCwKICAgIG1vZGVsLAogICAgaGVhZGVyOiBMaXN0ID0gW10sCiAgICB0YXJnZXRfcGF0aDogc3RyID0gJycsCiAgICBuYW1lOiBzdHIgPSAnZmVhdHVyZS1pbXBvcnRhbmNlcy5wbmcnLAogICAga2V5OiBzdHIgPSAnZmVhdHVyZS1pbXBvcnRhbmNlcycsCiAgICBmbXQgPSAncG5nJwopOgogICAgIiIiRGlzcGxheSBlc3RpbWF0ZWQgZmVhdHVyZSBpbXBvcnRhbmNlcy4KCiAgICA6cGFyYW0gY29udGV4dDogICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBtb2RlbDogICAgICAgZml0dGVkIGxpZ2h0Z2JtIG1vZGVsCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgIGxpc3Qgb2YgZmVhdHVyZSBuYW1lcwogICAgIiIiCiAgICAjIGNyZWF0ZSBhIGZlYXR1cmUgaW1wb3J0YW5jZSB0YWJsZSB3aXRoIGRlc2lyZWQgbGFiZWxzCiAgICB6aXBwZWQgPSB6aXAobW9kZWwuZmVhdHVyZV9pbXBvcnRhbmNlc18sIGhlYWRlcikKCiAgICBmZWF0dXJlX2ltcCA9IHBkLkRhdGFGcmFtZShzb3J0ZWQoemlwcGVkKSwgY29sdW1ucz1bJ2ZyZXEnLCdmZWF0dXJlJ10KICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICApLnNvcnRfdmFsdWVzKGJ5PSJmcmVxIiwgYXNjZW5kaW5nPUZhbHNlKQoKICAgIHBsdC5maWd1cmUoZmlnc2l6ZT0oMjAsIDEwKSkKICAgIHNucy5iYXJwbG90KHg9ImZyZXEiLCB5PSJmZWF0dXJlIiwgZGF0YT1mZWF0dXJlX2ltcCkKICAgIHBsdC50aXRsZSgnTGlnaHRHQk0gRmVhdHVyZXMnKQogICAgcGx0LnRpZ2h0X2xheW91dCgpCiAgICBmaWcgPSBwbHQuZ2NmKCkKICAgIHBsb3RwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgZmlnLnNhdmVmaWcocGxvdHBhdGgsIGZvcm1hdD0ncG5nJykKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KFBsb3RBcnRpZmFjdChrZXkgKyAnLXBsb3QnLCBib2R5PWZpZykpCgogICAgIyBmZWF0dXJlIGltcG9ydGFuY2VzIGFyZSBhbHNvIHNhdmVkIGFzIGEgdGFibGU6CiAgICB0YWJsZXBhdGggPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIGtleSArICctdGFibGUuY3N2JykKICAgIGZlYXR1cmVfaW1wLnRvX2Nzdih0YWJsZXBhdGgpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChUYWJsZUFydGlmYWN0KGtleSArICctdGFibGUnLCB0YXJnZXRfcGF0aD10YWJsZXBhdGgpKQoKICAgICMgdG8gZW5zdXJlIHdlIGRvbid0IG92ZXJ3cml0ZSB0aGlzIGZpZ3VyZSB3aGVuIGNyZWF0aW5nIHRoZSBuZXh0OgogICAgX2djZl9jbGVhcihwbHQpCg== + functionSourceCode: aW1wb3J0IG9zCmltcG9ydCBpbXBvcnRsaWIKZnJvbSBjbG91ZHBpY2tsZSBpbXBvcnQgbG9hZAoKaW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IGxpZ2h0Z2JtIGFzIGxnYgoKZnJvbSBza2xlYXJuLm1ldHJpY3MgaW1wb3J0IChyb2NfY3VydmUsIGNvbmZ1c2lvbl9tYXRyaXgpCmZyb20gc2tsZWFybi5tb2RlbF9zZWxlY3Rpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXQKCmltcG9ydCBtYXRwbG90bGliLnB5cGxvdCBhcyBwbHQKZnJvbSBtYXRwbG90bGliLmZpZ3VyZSBpbXBvcnQgRmlndXJlCmltcG9ydCBzZWFib3JuIGFzIHNucwoKZnJvbSB0eXBpbmcgaW1wb3J0IE9wdGlvbmFsLCBVbmlvbiwgTGlzdAoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gbWxydW4uZGF0YXN0b3JlIGltcG9ydCBEYXRhSXRlbQpmcm9tIG1scnVuLmFydGlmYWN0cyBpbXBvcnQgVGFibGVBcnRpZmFjdCwgUGxvdEFydGlmYWN0CgppbXBvcnQgd2FybmluZ3MKd2FybmluZ3Muc2ltcGxlZmlsdGVyKGFjdGlvbj0naWdub3JlJywgY2F0ZWdvcnk9RnV0dXJlV2FybmluZykKCmRlZiB0ZXN0X21vZGVsKAogICAgY29udGV4dDogT3B0aW9uYWxbTUxDbGllbnRDdHhdLAogICAgbW9kZWw6IFVuaW9uW0RhdGFJdGVtLCBzdHJdLAogICAgeHRlc3QsIAogICAgeXRlc3QsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gJycsCiAgICBuYW1lOiBzdHIgPSAnJywKICAgIGtleTogc3RyID0gJycsCiAgICByYW5kb21fc3RhdGUgPSAxCikgLT4gTm9uZToKICAgICIiIlRlc3QgYSBjbGFzc2lmaWVyIG1vZGVsCiAgICAKICAgIFVzaW5nIGhlbGQtb3V0IHRlc3QgZmVhdHVyZXMsIGNhbGxzIGBtb2RlbC5wcmVkaWN0KHh0ZXN0KWAgYW5kIGV2YWx1YXRlcyB0aGUgYWNjdXJhY3kgb2YgdGhlIAogICAgZXN0aW1hdGVkIG1vZGVsLgogICAgCiAgICBDYW4gYmUgcGFydCBvZiBhIGt1YmVmbG93IHBpcGVsaW5lIGFzIGEgdGVzdCBzdGVwIG9yIGNhbGxlZAogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgICB0aGUgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIG1vZGVsOiAgICAgICAgICAgZXN0aW1hdGVkIG1vZGVsIGZpbGUgbmFtZSBhcyBhcnRpZmFjdCBzdG9yZSBpdGVtCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBvciBwaWNrbGUgZmlsZSBuYW1lCiAgICA6cGFyYW0geHRlc3Q6ICAgICAgICAgICB0ZXN0IGZlYXR1cmVzIGZpbGUgbmFtZSBhcyBhcnRpZmFjdCBzdG9yZSBpdGVtCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBvciBwaWNrbGUgZmlsZSBuYW1lCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgICAgICAoT3B0aW9uYWwpIHVzZSBpZiB4dGVzdCBkb2VzIG5vdCBoYXZlIGEgaGVhZGVyCiAgICA6cGFyYW0geXRlc3Q6ICAgICAgICAgICB0ZXN0IGxhYmVscyBmaWxlIG5hbWUgYXMgYXJ0aWZhY3Qgc3RvcmUgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBpdGVtIG9yIHBpY2tsZSBmaWxlIG5hbWUKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogICAgIGZvbGRlciBsb2NhdGlvbiBvZiBmaWxlcwogICAgOnBhcmFtIG5hbWU6ICAgICAgICAgICAgZGVzdGluYXRpb24gbmFtZSBmb3IgdGVzdCByZXN1bHRzCiAgICA6cGFyYW0ga2V5OiAgICAgICAgICAgICBrZXkgZm9yIG1vZGVsIGFydGlmYWN0CiAgICAiIiIKICAgICMgbG9hZCBtb2RlbCBhbmQgZGF0YQogICAgaWYgaXNpbnN0YW5jZShtb2RlbCwgRGF0YUl0ZW0pOgogICAgICAgIGNsZiA9IGxvYWQob3BlbihzdHIobW9kZWwpLCAncmInKSkKICAgIGVsc2U6CiAgICAgICAgY2xmID0gbG9hZChvcGVuKG1vZGVsLCAncmInKSkKCiAgICBpZiBpc2luc3RhbmNlKHh0ZXN0LCBEYXRhSXRlbSk6CiAgICAgICAgeHRlc3QgPSBwZC5yZWFkX3BhcnF1ZXQoc3RyKHh0ZXN0KSkKICAgICAgICB5dGVzdCA9IHBkLnJlYWRfcGFycXVldChzdHIoeXRlc3QpKQogICAgZWxzZToKICAgICAgICB4dGVzdCA9IHBkLnJlYWRfcGFycXVldCh4dGVzdCkKICAgICAgICB5dGVzdCA9IHBkLnJlYWRfcGFycXVldCh5dGVzdCkKICAgIAogICAgaWYgY2FsbGFibGUoZ2V0YXR0cihjbGYsICdwcmVkaWN0X3Byb2JhJykpOgogICAgICAgIHlwcmVkX3Byb2JzID0gY2xmLnByZWRpY3RfcHJvYmEoeHRlc3QpWzosIDFdCiAgICAgICAgeXByZWQgPSBucC53aGVyZSh5cHJlZF9wcm9icyA+PSAwLjUsIDEsIDApCiAgICAgICAgcGxvdF9yb2MoY29udGV4dCwgeXRlc3QsIHlwcmVkX3Byb2JzLCB0YXJnZXRfcGF0aCkKICAgIGVsc2U6CiAgICAgICAgeXByZWQgPSBjbGYucHJlZGljdCh4dGVzdCkKICAgICAgICB5cHJlZF9wcm9icyA9IE5vbmUKICAgIAogICAgcGxvdF9jb25mdXNpb25fbWF0cml4KGNvbnRleHQsIHl0ZXN0LCB5cHJlZCwgdGFyZ2V0X3BhdGgpCgogICAgaWYgaGFzYXR0cihjbGYsICdmZWF0dXJlX2ltcG9ydGFuY2VzXycpOgogICAgICAgIHBsb3RfaW1wb3J0YW5jZShjb250ZXh0LCBjbGYsIHh0ZXN0LmNvbHVtbnMudmFsdWVzLCB0YXJnZXRfcGF0aCkKCmRlZiBfZ2NmX2NsZWFyKHBsdCk6CiAgICBwbHQuY2xhKCkKICAgIHBsdC5jbGYoKQogICAgcGx0LmNsb3NlKCkgICAgICAgIAoKZGVmIHBsb3Rfcm9jKAogICAgY29udGV4dDogTUxDbGllbnRDdHgsIAogICAgeV9sYWJlbHMsCiAgICB5X3Byb2JzLAogICAgdGFyZ2V0X3BhdGg6IHN0ciA9ICcnLAogICAgbmFtZT0ncm9jLnBuZycsCiAgICBrZXk9J3JvYycsCiAgICBmbXQ9J3BuZycKKToKICAgICIiIlBsb3QgYW4gUk9DIGN1cnZlIGZyb20gdGVzdCBkYXRhIHNhdmVkIGluIGFuIGFydGlmYWN0IHN0b3JlLgogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgICBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0geV9sYWJlbHM6ICAgICAgICB0ZXN0IGRhdGEgbGFiZWxzCiAgICA6cGFyYW0geV9wcm9iczogICAgICAgICB0ZXN0IGRhdGEgCiAgICAiIiIKICAgIGZwcl94ZywgdHByX3hnLCBfID0gcm9jX2N1cnZlKHlfbGFiZWxzLCB5X3Byb2JzKQogICAgcGx0LnBsb3QoWzAsIDFdLCBbMCwgMV0sICJrLS0iKQogICAgcGx0LnBsb3QoZnByX3hnLCB0cHJfeGcsIGxhYmVsPSJyb2MiKQogICAgcGx0LnhsYWJlbCgiZmFsc2UgcG9zaXRpdmUgcmF0ZSIpCiAgICBwbHQueWxhYmVsKCJ0cnVlIHBvc2l0aXZlIHJhdGUiKQogICAgcGx0LnRpdGxlKCJyb2MgY3VydmUiKQogICAgcGx0LmxlZ2VuZChsb2M9ImJlc3QiKQogICAgZmlnID0gcGx0LmdjZigpCgogICAgcGxvdHBhdGggPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUpCiAgICBmaWcuc2F2ZWZpZyhwbG90cGF0aCwgZm9ybWF0PWZtdCkKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KFBsb3RBcnRpZmFjdChrZXksIGJvZHk9ZmlnKSkKCiAgICBfZ2NmX2NsZWFyKHBsdCkKCmRlZiBwbG90X2NvbmZ1c2lvbl9tYXRyaXgoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwgCiAgICBsYWJlbHMsIAogICAgcHJlZGljdGlvbnMsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gJycsIAogICAgbmFtZTogc3RyID0iY29uZnVzaW9uLnBuZyIsIAogICAga2V5OiBzdHIgPSdjb25mdXNpb25fbWF0cml4JywKICAgIGZtdDogc3RyID0gJ3BuZycKKToKICAgICIiIkNyZWF0ZSBhIGNvbmZ1c2lvbiBtYXRyaXguCiAgICBQbG90IGFuZCBzYXZlIGEgY29uZnVzaW9uIG1hdHJpeCB1c2luZyB0ZXN0IGRhdGEgZnJvbSBhCiAgICBwaXBlbGluZSBzdGVwLgoKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBsYWJlbHM6ICAgICAgICAgIHRlc3QgZGF0YSBsYWJlbHMKICAgIDpwYXJhbSBwcmVkaWN0aW9uczogICAgIHRlc3QgZGF0YSBwcmVkaWN0aW9ucwogICAgIiIiCiAgICBjbSA9IGNvbmZ1c2lvbl9tYXRyaXgobGFiZWxzLAogICAgICAgICAgICAgICAgICAgICAgICAgICAgcHJlZGljdGlvbnMsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBzYW1wbGVfd2VpZ2h0PU5vbmUsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBub3JtYWxpemU9J2FsbCcpCiAgICBzbnMuaGVhdG1hcChjbSwgYW5ub3Q9VHJ1ZSwgY21hcD0iQmx1ZXMiKQogICAgcGxvdHBhdGggPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUpCiAgICBmaWcgPSBwbHQuZ2NmKCkKICAgIGZpZy5zYXZlZmlnKHBsb3RwYXRoLCBmb3JtYXQ9Zm10KQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoUGxvdEFydGlmYWN0KGtleSwgYm9keT1maWcpKQoKICAgIF9nY2ZfY2xlYXIocGx0KQoKZGVmIHBsb3RfaW1wb3J0YW5jZSgKICAgIGNvbnRleHQsCiAgICBtb2RlbCwKICAgIGhlYWRlcjogTGlzdCA9IFtdLAogICAgdGFyZ2V0X3BhdGg6IHN0ciA9ICcnLAogICAgbmFtZTogc3RyID0gJ2ZlYXR1cmUtaW1wb3J0YW5jZXMucG5nJywKICAgIGtleTogc3RyID0gJ2ZlYXR1cmUtaW1wb3J0YW5jZXMnLAogICAgZm10ID0gJ3BuZycKKToKICAgICIiIkRpc3BsYXkgZXN0aW1hdGVkIGZlYXR1cmUgaW1wb3J0YW5jZXMuCgogICAgOnBhcmFtIGNvbnRleHQ6ICAgICBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gbW9kZWw6ICAgICAgIGZpdHRlZCBsaWdodGdibSBtb2RlbAogICAgOnBhcmFtIGhlYWRlcjogICAgICBsaXN0IG9mIGZlYXR1cmUgbmFtZXMKICAgICIiIgogICAgIyBjcmVhdGUgYSBmZWF0dXJlIGltcG9ydGFuY2UgdGFibGUgd2l0aCBkZXNpcmVkIGxhYmVscwogICAgemlwcGVkID0gemlwKG1vZGVsLmZlYXR1cmVfaW1wb3J0YW5jZXNfLCBoZWFkZXIpCgogICAgZmVhdHVyZV9pbXAgPSBwZC5EYXRhRnJhbWUoc29ydGVkKHppcHBlZCksIGNvbHVtbnM9WydmcmVxJywnZmVhdHVyZSddCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgKS5zb3J0X3ZhbHVlcyhieT0iZnJlcSIsIGFzY2VuZGluZz1GYWxzZSkKCiAgICBwbHQuZmlndXJlKGZpZ3NpemU9KDIwLCAxMCkpCiAgICBzbnMuYmFycGxvdCh4PSJmcmVxIiwgeT0iZmVhdHVyZSIsIGRhdGE9ZmVhdHVyZV9pbXApCiAgICBwbHQudGl0bGUoJ0xpZ2h0R0JNIEZlYXR1cmVzJykKICAgIHBsdC50aWdodF9sYXlvdXQoKQogICAgZmlnID0gcGx0LmdjZigpCiAgICBwbG90cGF0aCA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSkKICAgIGZpZy5zYXZlZmlnKHBsb3RwYXRoLCBmb3JtYXQ9J3BuZycpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChQbG90QXJ0aWZhY3Qoa2V5ICsgJy1wbG90JywgYm9keT1maWcpKQoKICAgICMgZmVhdHVyZSBpbXBvcnRhbmNlcyBhcmUgYWxzbyBzYXZlZCBhcyBhIHRhYmxlOgogICAgdGFibGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBrZXkgKyAnLXRhYmxlLmNzdicpCiAgICBmZWF0dXJlX2ltcC50b19jc3YodGFibGVwYXRoKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoVGFibGVBcnRpZmFjdChrZXkgKyAnLXRhYmxlJywgdGFyZ2V0X3BhdGg9dGFibGVwYXRoKSkKCiAgICAjIHRvIGVuc3VyZSB3ZSBkb24ndCBvdmVyd3JpdGUgdGhpcyBmaWd1cmUgd2hlbiBjcmVhdGluZyB0aGUgbmV4dDoKICAgIF9nY2ZfY2xlYXIocGx0KQo= base_image: yjbds/mlrun-ds:latest commands: [] + code_origin: https://github.com/yjb-ds/functions.git#25e611e4bd05320d342708ce786522bfecaa0e51:/User/repos/functions/evaluation/test-classifier.py diff --git a/tests/arc_to_parquet.ipynb b/tests/arc_to_parquet.ipynb index 354f52e7d..84530415e 100644 --- a/tests/arc_to_parquet.ipynb +++ b/tests/arc_to_parquet.ipynb @@ -7,110 +7,15 @@ "# archive to parquet" ] }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "import mlrun\n", - "mlrun.mlconf.dbpath = 'http://mlrun-api:8080'" - ] - }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ - "# nuclio: ignore\n", - "import nuclio" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "%nuclio: setting spec.build.baseImage to 'yjbds/mlrun-files:latest'\n" - ] - } - ], - "source": [ - "%nuclio config spec.build.baseImage = \"yjbds/mlrun-files:latest\"" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ + "import mlrun\n", "import os\n", - "import json\n", - "from pathlib import Path\n", - "import pandas as pd\n", - "import pyarrow.parquet as pq\n", - "import pyarrow as pa\n", - "from cloudpickle import dump, load\n", - "\n", - "from mlrun.execution import MLClientCtx\n", - "from typing import IO, AnyStr, Union, List, Optional\n", - "\n", - "\n", - "def arc_to_parquet(\n", - " context: MLClientCtx,\n", - " archive_url: Union[str, Path, IO[AnyStr]],\n", - " header: Optional[List[str]] = None,\n", - " target_path: str = \"\",\n", - " name: str = \"\",\n", - " chunksize: int = 10_000,\n", - " log_data: bool = True,\n", - " add_uid: bool = False,\n", - " key: str = \"raw_data\",\n", - ") -> None:\n", - " \"\"\"Open a file/object archive and save as a parquet file.\n", - " \n", - " :param context: function context\n", - " :param archive_url: any valid string path consistent with the path variable\n", - " of pandas.read_csv, including strings as file paths, as urls, \n", - " pathlib.Path objects, etc...\n", - " :param header: column names\n", - " :param target_path: destination folder of table\n", - " :param name: name file to be saved locally, also\n", - " :param chunksize: (0) row size retrieved per iteration\n", - " :param key: key in artifact store (when log_data=True)\n", - " \"\"\"\n", - " if not name.endswith(\".parquet\"):\n", - " name += \".parquet\"\n", - "\n", - " dest_path = os.path.join(target_path, name)\n", - " os.makedirs(os.path.join(target_path), exist_ok=True)\n", - " if not os.path.isfile(dest_path):\n", - " context.logger.info(\"destination file does not exist, downloading\")\n", - " pqwriter = None\n", - " for i, df in enumerate(pd.read_csv(archive_url, chunksize=chunksize, names=header)):\n", - " parquet_schema = pa.Table.from_pandas(df=df).schema\n", - " if i == 0:\n", - " pqwriter = pq.ParquetWriter(dest_path, parquet_schema)\n", - " table = pa.Table.from_pandas(df, parquet_schema)\n", - " pqwriter.write_table(table)\n", - " if pqwriter:\n", - " pqwriter.close()\n", - "\n", - " context.logger.info(f\"saved table to {dest_path}\")\n", - " else:\n", - " context.logger.info(\"destination file already exists\")\n", - "\n", - " context.log_artifact(key, target_path=dest_path)\n", - " # log header\n", - " filepath = os.path.join(target_path, 'header.pkl')\n", - " dump(header, open(filepath, 'wb'))\n", - " context.log_artifact('header', target_path=filepath) " + "mlrun.mlconf.dbpath = 'http://mlrun-api:8080'" ] }, { @@ -119,20 +24,7 @@ "metadata": {}, "outputs": [], "source": [ - "# nuclio: end-code" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "# create job function object from notebook code\n", - "fn = mlrun.code_to_function(\n", - " 'arc to parquet',\n", - " runtime='job', \n", - " handler=arc_to_parquet)" + "CODE_BASE = '/User/repos/functions/' \n" ] }, { @@ -149,66 +41,14 @@ "outputs": [], "source": [ "# load function from a local Python file\n", - "# fn = mlrun.code_to_function('/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py', kind='job')" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-22 17:42:17,438 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" - ] - } - ], - "source": [ - "# export function yaml\n", - "fn.export('/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# import function yaml\n", - "# fn = mlrun.import_function('/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# push yaml to github" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# load function from Github\n", - "# fn = mlrun.import_function(\n", - "# 'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/fileutils/arc_to_parquet/arc_to_parquet.yaml')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# configure function: mount on the Iguazio data fabric, set as interactive (return stdout)\n", - "fn.apply(mlrun.mount_v3io())\n", - "fn.interactive = True" + "yaml_name = os.path.join(CODE_BASE, 'fileutils/arc_to_parquet', 'arc_to_parquet.yaml')\n", + "if not os.path.isfile(yaml_name):\n", + " testfn = mlrun.code_to_function(CODE_BASE + '/arc_to_parquet/arc_to_parquet.py', \n", + " kind='job')\n", + " testfn.build_config(base_image='yjbds/mlrun-ds:latest')\n", + " testfn.export(yaml_name)\n", + " testfn.apply(mlrun.mount_v3io())\n", + " fn.interactive = True" ] }, { @@ -231,16 +71,7 @@ "metadata": {}, "outputs": [], "source": [ - "fn.deploy()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# fn.with_code()" + "fn.deploy(skip_deployed=True, with_mlrun=False)" ] }, { diff --git a/tests/create_binary_data.ipynb b/tests/create_binary_data.ipynb index 146dbbf47..9e35912ae 100644 --- a/tests/create_binary_data.ipynb +++ b/tests/create_binary_data.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -14,33 +14,42 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "TARGET_CODE_PATH = '/User/repos/functions/datagen/classification'\n", - "N_SAMPLES = 10_000\n", - "M_FEATURES = 20\n", + "N_SAMPLES = 100_000\n", + "M_FEATURES = 28\n", "NEG_WEIGHT = 0.5\n", "TARGET_DATA_PATH = '/User/mlrun/datagen'\n", - "KEY = 'bindata'" + "KEY = 'simdata'" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 15, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-26 13:24:11,657 function spec saved to path: /User/repos/functions/datagen/classification/binary.yaml\n" + ] + } + ], "source": [ - "# mlrun.code_to_function(\n", - "# filename=os.path.join(TARGET_CODE_PATH, 'binary.py'), \n", - "# kind='job'\n", - "# ).export(os.path.join(TARGET_CODE_PATH, 'binary.yaml'))" + "testfn = mlrun.code_to_function(\n", + " filename=os.path.join(TARGET_CODE_PATH, 'binary.py'), \n", + " kind='job')\n", + "testfn.build_config(base_image='yjbds/mlrun-ds:latest')\n", + "testfn.export(os.path.join(TARGET_CODE_PATH, 'binary.yaml'))" ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 16, "metadata": {}, "outputs": [], "source": [ @@ -51,29 +60,40 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 17, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "'ready'" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "# binarydatagen.deploy()" + "binarydatagen.deploy(skip_deployed=True, with_mlrun=False)" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-22 20:01:50,178 starting run create_binary_classification uid=b52bfe49d1644faf806a3f90c288012d -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-22 20:01:50,258 Job is running in the background, pod: create-binary-classification-fcvdh\n", + "[mlrun] 2020-01-26 13:24:12,576 starting run create_binary_classification uid=86d601dc16eb4fac90f08974153d0d9d -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 13:24:12,655 Job is running in the background, pod: create-binary-classification-7zlqd\n", + "[mlrun] 2020-01-26 13:24:25,635 log artifact simdata at /User/mlrun/datagen/simdata-1e05X28.parquet, size: None, db: Y\n", + "\n", + "[mlrun] 2020-01-26 13:24:25,648 run executed, status=completed\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", " result = infer_dtype(pandas_collection)\n", - "[mlrun] 2020-01-22 20:02:02,615 log artifact bindata at /User/mlrun/datagen/simdata-1e04X20.parquet, size: None, db: Y\n", - "\n", - "[mlrun] 2020-01-22 20:02:02,629 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -246,26 +266,26 @@ " \n", " \n", " \n", - "
...88012d
\n", + "
...3d0d9d
\n", " 0\n", - " Jan 22 20:02:02\n", + " Jan 26 13:24:24\n", " completed\n", " binary\n", - "
host=create-binary-classification-fcvdh
kind=job
owner=admin
\n", + "
host=create-binary-classification-7zlqd
kind=job
owner=admin
\n", " \n", - "
key=bindata
m_features=20
n_samples=10000
target_path=/User/mlrun/datagen
weight=0.5
\n", + "
key=simdata
m_features=28
n_samples=100000
target_path=/User/mlrun/datagen
weight=0.5
\n", " \n", - "
bindata
\n", + "
simdata
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -281,17 +301,17 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run b52bfe49d1644faf806a3f90c288012d , !mlrun logs b52bfe49d1644faf806a3f90c288012d \n", - "[mlrun] 2020-01-22 20:02:09,459 run executed, status=completed\n" + "!mlrun get run 86d601dc16eb4fac90f08974153d0d9d , !mlrun logs 86d601dc16eb4fac90f08974153d0d9d \n", + "[mlrun] 2020-01-26 13:24:31,859 run executed, status=completed\n" ] }, { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 6, + "execution_count": 18, "metadata": {}, "output_type": "execute_result" } @@ -317,7 +337,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 19, "metadata": {}, "outputs": [], "source": [ diff --git a/tests/test_classifier.ipynb b/tests/test_classifier.ipynb index 233b043c9..91510af1b 100644 --- a/tests/test_classifier.ipynb +++ b/tests/test_classifier.ipynb @@ -11,7 +11,7 @@ }, { "cell_type": "code", - "execution_count": 57, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -30,7 +30,7 @@ }, { "cell_type": "code", - "execution_count": 58, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -52,17 +52,9 @@ }, { "cell_type": "code", - "execution_count": 122, + "execution_count": 3, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-26 00:40:51,865 function spec saved to path: /User/repos/functions/evaluation/test-classifier.yaml\n" - ] - } - ], + "outputs": [], "source": [ "yaml_name = os.path.join(CODE_BASE, 'evaluation', 'test-classifier.yaml')\n", "if not os.path.isfile(yaml_name):\n", @@ -70,13 +62,14 @@ " kind='job', \n", " image='yjbds/mlrun-ds:latest',\n", " filename=os.path.join(CODE_BASE, 'evaluation', 'test-classifier.py'))\n", + " testfn.build_config(base_image='yjbds/mlrun-ds:latest', commands=[])\n", "\n", " testfn.export(os.path.join(CODE_BASE, 'evaluation', 'test-classifier.yaml'))" ] }, { "cell_type": "code", - "execution_count": 123, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -87,7 +80,7 @@ }, { "cell_type": "code", - "execution_count": 124, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -96,7 +89,7 @@ "'ready'" ] }, - "execution_count": 124, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -107,16 +100,16 @@ }, { "cell_type": "code", - "execution_count": 125, + "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 125, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } @@ -132,28 +125,32 @@ }, { "cell_type": "code", - "execution_count": 126, + "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 00:41:47,446 starting run test_model uid=ea859673d08b4eccbd7746b9d36fb8e8 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-26 00:41:47,550 Job is running in the background, pod: test-model-shb4b\n", - "[mlrun] 2020-01-26 00:42:01,686 log artifact roc.html at roc.html, size: 40483, db: Y\n", - "[mlrun] 2020-01-26 00:42:02,895 log artifact confusion_matrix.html at confusion_matrix.html, size: 15292, db: Y\n", - "[mlrun] 2020-01-26 00:42:03,498 log artifact feature-importances-plot.html at feature-importances-plot.html, size: 67516, db: Y\n", - "[mlrun] 2020-01-26 00:42:03,525 log artifact feature-importances-table at /User/mlrun/splitter/feature-importances-table.csv, size: None, db: Y\n", + "[mlrun] 2020-01-26 13:19:07,552 starting run test_model uid=b2dff34b7d184eb89aac9e919764e9c4 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 13:19:07,644 Job is running in the background, pod: test-model-4qkft\n", + "[mlrun] 2020-01-26 13:19:17,924 Traceback (most recent call last):\n", + " File \"/opt/conda/lib/python3.7/site-packages/mlrun/runtimes/local.py\", line 174, in exec_from_params\n", + " val = handler(*args_list)\n", + " File \"main.py\", line 68, in test_model\n", + " ypred_probs = clf.predict_proba(xtest)[:, 1]\n", + " File \"/opt/conda/lib/python3.7/site-packages/lightgbm/sklearn.py\", line 858, in predict_proba\n", + " pred_leaf, pred_contrib, **kwargs)\n", + " File \"/opt/conda/lib/python3.7/site-packages/lightgbm/sklearn.py\", line 658, in predict\n", + " % (self._n_features, n_features))\n", + "ValueError: Number of features of the model must match the input. Model n_features_ is 20 and input n_features is 28 \n", "\n", - "[mlrun] 2020-01-26 00:42:03,596 run executed, status=completed\n", - "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:708: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", - " labels = getattr(columns, 'labels', None) or [\n", - "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:735: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead\n", - " return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)\n", - "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:752: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", - " labels, = index.labels\n", - "final state: succeeded\n" + "\n", + "[mlrun] 2020-01-26 13:19:17,941 exec error - Number of features of the model must match the input. Model n_features_ is 20 and input n_features is 28 \n", + "[mlrun] 2020-01-26 13:19:17,968 run executed, status=error\n", + "Number of features of the model must match the input. Model n_features_ is 20 and input n_features is 28 \n", + "runtime error: Number of features of the model must match the input. Model n_features_ is 20 and input n_features is 28 \n", + "final state: failed\n" ] }, { @@ -325,26 +322,26 @@ " \n", " \n", " \n", - "
...6fb8e8
\n", + "
...64e9c4
\n", " 0\n", - " Jan 26 00:41:54\n", - " completed\n", + " Jan 26 13:19:14\n", + "
error
\n", " test-classifier\n", - "
host=test-model-shb4b
kind=job
owner=admin
\n", + "
host=test-model-4qkft
kind=job
owner=admin
\n", " \n", "
model=/User/mlrun/models/lgb-classifier.pkl
target_path=/User/mlrun/splitter
xtest=/User/mlrun/splitter/xtest.pqt
ytest=/User/mlrun/splitter/ytest.pqt
\n", " \n", - "
roc.html
confusion_matrix.html
feature-importances-plot.html
feature-importances-table
\n", + " \n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -360,8 +357,22 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run ea859673d08b4eccbd7746b9d36fb8e8 , !mlrun logs ea859673d08b4eccbd7746b9d36fb8e8 \n", - "[mlrun] 2020-01-26 00:42:06,762 run executed, status=completed\n" + "!mlrun get run b2dff34b7d184eb89aac9e919764e9c4 , !mlrun logs b2dff34b7d184eb89aac9e919764e9c4 \n", + "[mlrun] 2020-01-26 13:19:26,852 run executed, status=error\n", + "runtime error: Number of features of the model must match the input. Model n_features_ is 20 and input n_features is 28 \n" + ] + }, + { + "ename": "RunError", + "evalue": "Number of features of the model must match the input. Model n_features_ is 20 and input n_features is 28 ", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mRunError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mtsk_run\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtestfn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhandler\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'test_model'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, runspec, handler, name, project, params, inputs, out_path, workdir, watch, schedule)\u001b[0m\n\u001b[1;32m 266\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_post_run\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 267\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 268\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 269\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 270\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_api_server\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkfp\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36m_wrap_result\u001b[0;34m(self, result, runspec, err)\u001b[0m\n\u001b[1;32m 334\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_child\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 335\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'runtime error: {}'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 336\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mRunError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 337\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrun\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 338\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mRunError\u001b[0m: Number of features of the model must match the input. Model n_features_ is 20 and input n_features is 28 " ] } ], diff --git a/tests/train_classifier.ipynb b/tests/train_classifier.ipynb index 332b0cbac..f5bc34c61 100644 --- a/tests/train_classifier.ipynb +++ b/tests/train_classifier.ipynb @@ -47,7 +47,7 @@ "metadata": {}, "outputs": [], "source": [ - "TARGET_CODE_BASE = '/User/repos/functions/' \n", + "CODE_BASE = '/User/repos/functions/' \n", "N_SAMPLES = 100_000 # size of HIGGS data\n", "M_FEATURES = 20\n", "NEG_WEIGHT = 0.5\n", @@ -57,7 +57,7 @@ "RNG = 1\n", "SKLEARN_CLASSIFIER = 'lightgbm.sklearn.LGBMClassifier'\n", "MODEL_KEY = 'model'\n", - "MODEL_NAME = MODEL_KEY\n", + "MODEL_NAME = 'lgb-classifier.pkl'\n", "VERBOSE = False" ] }, @@ -75,13 +75,13 @@ "outputs": [], "source": [ "binarydatagen = mlrun.import_function(\n", - " os.path.join(TARGET_CODE_BASE+'datagen/classification', 'binary.yaml')\n", + " os.path.join(CODE_BASE+'datagen/classification', 'binary.yaml')\n", ").apply(mlrun.mount_v3io())" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -90,27 +90,27 @@ "'ready'" ] }, - "execution_count": 5, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "binarydatagen.deploy(skip_deployed=True)" + "binarydatagen.deploy(skip_deployed=True, with_mlrun=False)" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 6, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -129,18 +129,18 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-23 11:46:49,385 starting run create_binary_classification uid=e1164e49ef22478791f5b23fea2de60b -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-23 11:46:49,486 Job is running in the background, pod: create-binary-classification-j6gng\n", - "[mlrun] 2020-01-23 11:47:00,255 log artifact simdata at /User/mlrun/sklearn-classifier/simdata.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 13:15:51,762 starting run create_binary_classification uid=39417bbf476c45b7a5cb0809e883b979 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 13:15:51,849 Job is running in the background, pod: create-binary-classification-vsdvh\n", + "[mlrun] 2020-01-26 13:16:02,850 log artifact simdata at /User/mlrun/sklearn-classifier/simdata.pqt, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-23 11:47:00,268 run executed, status=completed\n", + "[mlrun] 2020-01-26 13:16:02,862 run executed, status=completed\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", " result = infer_dtype(pandas_collection)\n", "final state: succeeded\n" @@ -315,12 +315,12 @@ " \n", " \n", " \n", - "
...2de60b
\n", + "
...83b979
\n", " 0\n", - " Jan 23 11:46:59\n", + " Jan 26 13:16:02\n", " completed\n", " binary\n", - "
host=create-binary-classification-j6gng
kind=job
owner=admin
\n", + "
host=create-binary-classification-vsdvh
kind=job
owner=admin
\n", " \n", "
filename=simdata.pqt
key=simdata
m_features=20
n_samples=100000
random_state=1
target_path=/User/mlrun/sklearn-classifier
weight=0.5
\n", " \n", @@ -329,12 +329,12 @@ " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -350,8 +350,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run e1164e49ef22478791f5b23fea2de60b , !mlrun logs e1164e49ef22478791f5b23fea2de60b \n", - "[mlrun] 2020-01-23 11:47:08,728 run executed, status=completed\n" + "!mlrun get run 39417bbf476c45b7a5cb0809e883b979 , !mlrun logs 39417bbf476c45b7a5cb0809e883b979 \n", + "[mlrun] 2020-01-26 13:16:11,051 run executed, status=completed\n" ] } ], @@ -363,257 +363,100 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "____\n", - "# tests" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "df = pd.read_parquet(os.path.join(TARGET_DATA_PATH, FILE_NAME), engine='pyarrow')" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "assert tsk1.output(KEY) == os.path.join(TARGET_DATA_PATH, FILE_NAME), \"binary.yaml failed to create a file\"\n", - "assert df.shape== (N_SAMPLES, M_FEATURES+1), \"simulation data artifact is not of the correct dimensions\"" + "______" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "_____\n", - "## train a classifier" + "## split the generated data" ] }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ - "trainfn = mlrun.import_function(\n", - " os.path.join(TARGET_CODE_BASE+'train/sklearn-classifier.yaml')\n", + "splitter = mlrun.import_function(\n", + " os.path.join(CODE_BASE+'datagen/splitters', 'train_valid_test.yaml')\n", ").apply(mlrun.mount_v3io())" ] }, { "cell_type": "code", - "execution_count": 18, - "metadata": { - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, + "execution_count": 8, + "metadata": {}, "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-23 11:49:20,163 starting remote build, image: .mlrun/func-default-sklearn-classifier-latest\n", - "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Built cross stage deps: map[] \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Unpacking rootfs as cmd RUN pip install mlrun requires it. \n", - "\u001b[36mINFO\u001b[0m[0044] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0065] RUN pip install mlrun \n", - "\u001b[36mINFO\u001b[0m[0065] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0065] args: [-c pip install mlrun] \n", - "Requirement already satisfied: mlrun in /opt/conda/lib/python3.7/site-packages (0.4.3)\n", - "Requirement already satisfied: pyyaml>=5.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (5.1.1)\n", - "Requirement already satisfied: Flask>=1.1.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.1.1)\n", - "Requirement already satisfied: GitPython>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.0.5)\n", - "Requirement already satisfied: requests>=2.20.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (2.20.1)\n", - "Requirement already satisfied: croniter==0.3.31 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.3.31)\n", - "Requirement already satisfied: sqlalchemy==1.3.11 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.3.11)\n", - "Requirement already satisfied: nuclio-sdk>=0.0.3 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.0.5)\n", - "Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (7.0)\n", - "Requirement already satisfied: boto3>=1.9 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.11.7)\n", - "Requirement already satisfied: gunicorn==19.9.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (19.9.0)\n", - "Requirement already satisfied: tabulate<=0.8.3,>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.3)\n", - "Requirement already satisfied: gevent==1.4.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.4.0)\n", - "Requirement already satisfied: kfp>=0.1.29 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.2.0)\n", - "Requirement already satisfied: nest-asyncio>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.2.2)\n", - "Requirement already satisfied: aiohttp>=3.5.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.6.2)\n", - "Requirement already satisfied: pandas>=0.23.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.25.1)\n", - "Requirement already satisfied: nuclio-jupyter>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.0)\n", - "Requirement already satisfied: itsdangerous>=0.24 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (1.1.0)\n", - "Requirement already satisfied: Werkzeug>=0.15 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (0.16.0)\n", - "Requirement already satisfied: Jinja2>=2.10.1 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (2.10.3)\n", - "Requirement already satisfied: gitdb2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from GitPython>=2.1.0->mlrun) (2.0.6)\n", - "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (3.0.4)\n", - "Requirement already satisfied: idna<2.8,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (2.6)\n", - "Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (1.24.1)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (2019.9.11)\n", - "Requirement already satisfied: python-dateutil in /opt/conda/lib/python3.7/site-packages (from croniter==0.3.31->mlrun) (2.8.0)\n", - "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.3.1)\n", - "Requirement already satisfied: botocore<1.15.0,>=1.14.7 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (1.14.7)\n", - "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.9.4)\n", - "Requirement already satisfied: greenlet>=0.4.14; platform_python_implementation == \"CPython\" in /opt/conda/lib/python3.7/site-packages (from gevent==1.4.0->mlrun) (0.4.15)\n", - "Requirement already satisfied: google-cloud-storage>=1.13.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.25.0)\n", - "Requirement already satisfied: cryptography>=2.4.2 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.7)\n", - "Requirement already satisfied: cloudpickle==1.1.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.1.1)\n", - "Requirement already satisfied: jsonschema>=3.0.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (3.2.0)\n", - "Requirement already satisfied: Deprecated in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.2.7)\n", - "Requirement already satisfied: requests-toolbelt>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.9.1)\n", - "Requirement already satisfied: kfp-server-api<=0.1.40,>=0.1.18 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.1.40)\n", - "Requirement already satisfied: argo-models==2.2.1a in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.2.1a0)\n", - "Requirement already satisfied: kubernetes<=10.0.0,>=8.0.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (10.0.0)\n", - "Requirement already satisfied: google-auth>=1.6.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.10.1)\n", - "Requirement already satisfied: PyJWT>=1.6.4 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.7.1)\n", - "Requirement already satisfied: six>=1.10 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.12.0)\n", - "Requirement already satisfied: multidict<5.0,>=4.5 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (4.7.4)\n", - "Requirement already satisfied: async-timeout<4.0,>=3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.1)\n", - "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (1.4.2)\n", - "Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (19.3.0)\n", - "Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (2019.1)\n", - "Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (1.17.4)\n", - "Requirement already satisfied: jupyterlab>=0.35.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (1.2.5)\n", - "Requirement already satisfied: tornado<6,>=5 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.1.1)\n", - "Requirement already satisfied: ipython>=7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (7.11.1)\n", - "Requirement already satisfied: nbconvert>=5.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.6.1)\n", - "Requirement already satisfied: notebook>=5.7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.3)\n", - "Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.7/site-packages (from Jinja2>=2.10.1->Flask>=1.1.1->mlrun) (1.1.1)\n", - "Requirement already satisfied: smmap2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from gitdb2>=2.0.0->GitPython>=2.1.0->mlrun) (2.0.5)\n", - "Requirement already satisfied: docutils<0.16,>=0.10 in /opt/conda/lib/python3.7/site-packages (from botocore<1.15.0,>=1.14.7->boto3>=1.9->mlrun) (0.15.2)\n", - "Requirement already satisfied: google-cloud-core<2.0dev,>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.2.0)\n", - "Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (0.5.0)\n", - "Requirement already satisfied: asn1crypto>=0.21.0 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (0.24.0)\n", - "Requirement already satisfied: cffi!=1.11.3,>=1.8 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (1.12.3)\n", - "Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (41.0.1.post20191122)\n", - "Requirement already satisfied: importlib-metadata; python_version < \"3.8\" in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (1.4.0)\n", - "Requirement already satisfied: pyrsistent>=0.14.0 in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (0.15.7)\n", - "Requirement already satisfied: wrapt<2,>=1.10 in /opt/conda/lib/python3.7/site-packages (from Deprecated->kfp>=0.1.29->mlrun) (1.11.2)\n", - "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (0.57.0)\n", - "Requirement already satisfied: requests-oauthlib in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (1.3.0)\n", - "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0.0)\n", - "Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.2.8)\n", - "Requirement already satisfied: rsa<4.1,>=3.1.4 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0)\n", - "Requirement already satisfied: jupyterlab-server~=1.0.0 in /opt/conda/lib/python3.7/site-packages (from jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (1.0.6)\n", - "Requirement already satisfied: decorator in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.4.1)\n", - "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.8.0)\n", - "Requirement already satisfied: pickleshare in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.5)\n", - "Requirement already satisfied: jedi>=0.10 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.15.2)\n", - "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (3.0.2)\n", - "Requirement already satisfied: backcall in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.0)\n", - "Requirement already satisfied: traitlets>=4.2 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.3.3)\n", - "Requirement already satisfied: pygments in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (2.5.2)\n", - "Requirement already satisfied: bleach in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (3.1.0)\n", - "Requirement already satisfied: mistune<2,>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.4)\n", - "Requirement already satisfied: jupyter-core in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (4.6.1)\n", - "Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (1.4.2)\n", - "Requirement already satisfied: nbformat>=4.4 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (5.0.3)\n", - "Requirement already satisfied: testpath in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.4.4)\n", - "Requirement already satisfied: defusedxml in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", - "Requirement already satisfied: entrypoints>=0.2.2 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.3)\n", - "Requirement already satisfied: jupyter-client>=5.3.4 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.3.4)\n", - "Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.2.0)\n", - "Requirement already satisfied: prometheus-client in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.1)\n", - "Requirement already satisfied: Send2Trash in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (1.5.0)\n", - "Requirement already satisfied: terminado>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.8.3)\n", - "Requirement already satisfied: pyzmq>=17 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (18.1.1)\n", - "Requirement already satisfied: ipykernel in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.1.3)\n", - "Requirement already satisfied: google-api-core<2.0.0dev,>=1.16.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.16.0)\n", - "Requirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.4.2->kfp>=0.1.29->mlrun) (2.18)\n", - "Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (2.0.0)\n", - "Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from requests-oauthlib->kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (3.1.0)\n", - "Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /opt/conda/lib/python3.7/site-packages (from pyasn1-modules>=0.2.1->google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.4.8)\n", - "Requirement already satisfied: json5 in /opt/conda/lib/python3.7/site-packages (from jupyterlab-server~=1.0.0->jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.5)\n", - "Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.7/site-packages (from pexpect; sys_platform != \"win32\"->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", - "Requirement already satisfied: parso>=0.5.2 in /opt/conda/lib/python3.7/site-packages (from jedi>=0.10->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.5.2)\n", - "Requirement already satisfied: wcwidth in /opt/conda/lib/python3.7/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.8)\n", - "Requirement already satisfied: webencodings in /opt/conda/lib/python3.7/site-packages (from bleach->nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.5.1)\n", - "Requirement already satisfied: protobuf>=3.4.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (3.11.2)\n", - "Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.51.0)\n", - "Requirement already satisfied: more-itertools in /opt/conda/lib/python3.7/site-packages (from zipp>=0.5->importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (8.1.0)\n", - "\u001b[36mINFO\u001b[0m[0067] Taking snapshot of full filesystem... \n" - ] - }, { "data": { "text/plain": [ - "True" + "'ready'" ] }, - "execution_count": 18, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "trainfn.deploy()" + "splitter.deploy(skip_deployed=True, with_mlrun=False)" ] }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 19, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "task2 = mlrun.NewTask()\n", - "task2.with_params(\n", - " src_file=tsk1.output(KEY),\n", - " SKClassifier=SKLEARN_CLASSIFIER,\n", - " name=MODEL_NAME,\n", - " key=MODEL_KEY,\n", - " verbose=VERBOSE,\n", - " random_state=RNG,\n", - " callbacks = [])" + "task1 = mlrun.NewTask()\n", + "task1.with_params(\n", + " src_file=TARGET_DATA_PATH + '/' + FILE_NAME,\n", + " sample=20_000,\n", + " target_path=TARGET_DATA_PATH,\n", + " random_state=RNG)" ] }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-23 11:50:58,444 starting run train uid=d7118c8161b9487ea79b136cd2d4a0cc -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-23 11:50:58,533 Job is running in the background, pod: train-s9w4j\n", - "[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the \"boost_from_average\" parameter in \"binary\" objective is true.\n", - "This may cause significantly different results comparing to the previous versions of LightGBM.\n", - "Try to set boost_from_average=false, if your old models produce bad results\n", - "[LightGBM] [Warning] Cannot change bin_construct_sample_cnt after constructed Dataset handle.\n", - "[mlrun] 2020-01-23 11:51:12,955 log artifact model at model, size: None, db: Y\n", - "[mlrun] 2020-01-23 11:51:12,974 log artifact xtest at xtest.pkl, size: None, db: Y\n", - "[mlrun] 2020-01-23 11:51:12,998 log artifact ytest at ytest.pkl, size: None, db: Y\n", - "\n", - "[mlrun] 2020-01-23 11:51:13,022 run executed, status=completed\n", + "[mlrun] 2020-01-26 13:16:11,109 starting run train_valid_test_splitter uid=ecb802dffb1b43269c25f43fd7a4919a -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 13:16:11,191 Job is running in the background, pod: train-valid-test-splitter-7k25p\n", + "[mlrun] 2020-01-26 13:16:21,068 log artifact header at /User/mlrun/sklearn-classifier/header.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-26 13:16:21,156 log artifact xtrain at /User/mlrun/sklearn-classifier/xtrain.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 13:16:21,220 log artifact xvalid at /User/mlrun/sklearn-classifier/xvalid.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 13:16:21,262 log artifact xtest at /User/mlrun/sklearn-classifier/xtest.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 13:16:21,280 log artifact ytrain at /User/mlrun/sklearn-classifier/ytrain.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 13:16:21,298 log artifact yvalid at /User/mlrun/sklearn-classifier/yvalid.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 13:16:21,312 log artifact ytest at /User/mlrun/sklearn-classifier/ytest.pqt, size: None, db: Y\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:708: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", " labels = getattr(columns, 'labels', None) or [\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:735: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead\n", " return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:752: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", " labels, = index.labels\n", + "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", + " result = infer_dtype(pandas_collection)\n", + "\n", + "[mlrun] 2020-01-26 13:16:21,325 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -786,26 +629,26 @@ " \n", " \n", " \n", - "
...d4a0cc
\n", + "
...a4919a
\n", " 0\n", - " Jan 23 11:51:07\n", + " Jan 26 13:16:20\n", " completed\n", - " sklearn-classifier\n", - "
host=train-s9w4j
kind=job
owner=admin
\n", + " train-valid-test\n", + "
host=train-valid-test-splitter-7k25p
kind=job
owner=admin
\n", " \n", - "
SKClassifier=lightgbm.sklearn.LGBMClassifier
callbacks=[]
key=model
name=model
random_state=1
src_file=/User/mlrun/sklearn-classifier/simdata.pqt
verbose=False
\n", - "
train_accuracy=0.9546808100860753
\n", - "
model
xtest
ytest
\n", + "
random_state=1
sample=20000
src_file=/User/mlrun/sklearn-classifier/simdata.pqt
target_path=/User/mlrun/sklearn-classifier
\n", + " \n", + "
header
xtrain
xvalid
xtest
ytrain
yvalid
ytest
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -821,36 +664,39 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run d7118c8161b9487ea79b136cd2d4a0cc , !mlrun logs d7118c8161b9487ea79b136cd2d4a0cc \n", - "[mlrun] 2020-01-23 11:51:17,725 run executed, status=completed\n" + "!mlrun get run ecb802dffb1b43269c25f43fd7a4919a , !mlrun logs ecb802dffb1b43269c25f43fd7a4919a \n", + "[mlrun] 2020-01-26 13:16:30,357 run executed, status=completed\n" ] } ], "source": [ - "tsk2 = trainfn.run(task2, handler='train')" + "tsk1 = splitter.run(task1, handler='train_valid_test_splitter')" ] }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'train_accuracy': 0.9546808100860753,\n", - " 'model': 'model',\n", - " 'xtest': 'xtest.pkl',\n", - " 'ytest': 'ytest.pkl'}" + "{'header': '/User/mlrun/sklearn-classifier/header.pkl',\n", + " 'xtrain': '/User/mlrun/sklearn-classifier/xtrain.pqt',\n", + " 'xvalid': '/User/mlrun/sklearn-classifier/xvalid.pqt',\n", + " 'xtest': '/User/mlrun/sklearn-classifier/xtest.pqt',\n", + " 'ytrain': '/User/mlrun/sklearn-classifier/ytrain.pqt',\n", + " 'yvalid': '/User/mlrun/sklearn-classifier/yvalid.pqt',\n", + " 'ytest': '/User/mlrun/sklearn-classifier/ytest.pqt'}" ] }, - "execution_count": 21, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "tsk2.outputs" + "tsk1.outputs" ] }, { @@ -858,74 +704,120 @@ "metadata": {}, "source": [ "_____\n", - "## train another classifier" + "## train a classifier" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 12, "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-26 13:16:30,608 function spec saved to path: /User/repos/functions/train/sklearn-classifier.yaml\n" + ] + } + ], "source": [ - "____" + "yaml_name = os.path.join(CODE_BASE, 'train', 'sklearn-classifier.yaml')\n", + "if not os.path.isfile(yaml_name):\n", + " testfn = mlrun.code_to_function(\n", + " kind='job', \n", + " image='yjbds/mlrun-ds:latest',\n", + " filename=os.path.join(CODE_BASE, 'train', 'sklearn-classifier.py'))\n", + " testfn.build_config(base_image='yjbds/mlrun-ds:latest', commands=[])\n", + " testfn.export(os.path.join(CODE_BASE, 'train', 'sklearn-classifier.yaml'))" ] }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "trainfn = mlrun.import_function(\n", + " os.path.join(CODE_BASE+'train/sklearn-classifier.yaml')\n", + ").apply(mlrun.mount_v3io())" + ] + }, + { + "cell_type": "code", + "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "'ready'" ] }, - "execution_count": 24, + "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "task3 = mlrun.NewTask()\n", - "task3.with_params(\n", + "trainfn.deploy(skip_deployed=True, with_mlrun=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "task2 = mlrun.NewTask()\n", + "task2.with_params(\n", " src_file=tsk1.output(KEY),\n", - " SKClassifier='xgboost.XGBClassifier',\n", - " name='xgb-classifier.pkl',\n", - " key='xgb-classifier',\n", - " verbose=VERBOSE,\n", - " random_state=RNG,\n", - " callbacks = [])" + " SKClassifier=SKLEARN_CLASSIFIER,\n", + " callbacks = [],\n", + " xtrain=tsk1.outputs['xtrain'],\n", + " ytrain=tsk1.outputs['ytrain'],\n", + " xvalid=tsk1.outputs['xvalid'],\n", + " yvalid=tsk1.outputs['yvalid'],\n", + " target_path='/User/mlrun/models',\n", + " name=MODEL_NAME,\n", + " key=MODEL_KEY,\n", + " verbose=VERBOSE)" ] }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-23 11:52:46,121 starting run train uid=3539274893904935adea979b410bf135 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-23 11:52:46,218 Job is running in the background, pod: train-qwzg9\n", - "[mlrun] 2020-01-23 11:52:56,785 Traceback (most recent call last):\n", - " File \"/opt/conda/lib/python3.7/site-packages/mlrun/runtimes/local.py\", line 174, in exec_from_params\n", - " val = handler(*args_list)\n", - " File \"main.py\", line 91, in train\n", - " verbose=verbose)\n", - "TypeError: fit() got an unexpected keyword argument 'eval_names'\n", - "\n", + "[mlrun] 2020-01-26 13:17:59,993 starting run train uid=b57510063377418ab0f90b33d14b6117 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 13:18:00,097 Job is running in the background, pod: train-wr4m2\n", + "[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the \"boost_from_average\" parameter in \"binary\" objective is true.\n", + "This may cause significantly different results comparing to the previous versions of LightGBM.\n", + "Try to set boost_from_average=false, if your old models produce bad results\n", + "[LightGBM] [Warning] Cannot change bin_construct_sample_cnt after constructed Dataset handle.\n", + "[mlrun] 2020-01-26 13:18:15,384 log artifact training-validation-plot.html at training-validation-plot.html, size: 32700, db: Y\n", + "[mlrun] 2020-01-26 13:18:15,498 log artifact model at /User/mlrun/models/lgb-classifier.pkl, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-23 11:52:56,796 exec error - fit() got an unexpected keyword argument 'eval_names'\n", - "[mlrun] 2020-01-23 11:52:56,830 run executed, status=error\n", - "runtime error: fit() got an unexpected keyword argument 'eval_names'\n", - "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:708: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", - " labels = getattr(columns, 'labels', None) or [\n", - "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:735: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead\n", - " return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)\n", - "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:752: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", - " labels, = index.labels\n", - "fit() got an unexpected keyword argument 'eval_names'\n", - "final state: failed\n" + "[mlrun] 2020-01-26 13:18:15,519 run executed, status=completed\n", + "/opt/conda/lib/python3.7/site-packages/sklearn/preprocessing/_label.py:235: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", + " y = column_or_1d(y, warn=True)\n", + "/opt/conda/lib/python3.7/site-packages/sklearn/preprocessing/_label.py:268: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", + " y = column_or_1d(y, warn=True)\n", + "final state: succeeded\n" ] }, { @@ -1097,26 +989,26 @@ " \n", " \n", " \n", - "
...0bf135
\n", + "
...4b6117
\n", " 0\n", - " Jan 23 11:52:52\n", - "
error
\n", + " Jan 26 13:18:08\n", + " completed\n", " sklearn-classifier\n", - "
host=train-qwzg9
kind=job
owner=admin
\n", - " \n", - "
SKClassifier=xgboost.XGBClassifier
callbacks=[]
key=xgb-classifier
name=xgb-classifier.pkl
random_state=1
src_file=/User/mlrun/sklearn-classifier/simdata.pqt
verbose=False
\n", - " \n", + "
host=train-wr4m2
kind=job
owner=admin
\n", " \n", + "
SKClassifier=lightgbm.sklearn.LGBMClassifier
callbacks=[]
key=model
name=lgb-classifier.pkl
src_file=None
target_path=/User/mlrun/models
verbose=False
xtrain=/User/mlrun/sklearn-classifier/xtrain.pqt
xvalid=/User/mlrun/sklearn-classifier/xvalid.pqt
ytrain=/User/mlrun/sklearn-classifier/ytrain.pqt
yvalid=/User/mlrun/sklearn-classifier/yvalid.pqt
\n", + "
train_accuracy=0.9781481481481481
\n", + "
training-validation-plot.html
model
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -1132,64 +1024,35 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 3539274893904935adea979b410bf135 , !mlrun logs 3539274893904935adea979b410bf135 \n", - "[mlrun] 2020-01-23 11:53:05,425 run executed, status=error\n", - "runtime error: fit() got an unexpected keyword argument 'eval_names'\n" - ] - }, - { - "ename": "RunError", - "evalue": "fit() got an unexpected keyword argument 'eval_names'", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mRunError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mtsk3\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtrainfn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask3\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhandler\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'train'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, runspec, handler, name, project, params, inputs, out_path, workdir, watch, schedule)\u001b[0m\n\u001b[1;32m 266\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_post_run\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 267\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 268\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 269\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 270\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_api_server\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkfp\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36m_wrap_result\u001b[0;34m(self, result, runspec, err)\u001b[0m\n\u001b[1;32m 334\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_child\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 335\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'runtime error: {}'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 336\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mRunError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 337\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrun\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 338\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mRunError\u001b[0m: fit() got an unexpected keyword argument 'eval_names'" + "!mlrun get run b57510063377418ab0f90b33d14b6117 , !mlrun logs b57510063377418ab0f90b33d14b6117 \n", + "[mlrun] 2020-01-26 13:18:19,266 run executed, status=completed\n" ] } ], "source": [ - "tsk3 = trainfn.run(task3, handler='train')" + "tsk2 = trainfn.run(task2, handler='train')" ] }, { "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "tsk3.outputs" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## evaluation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "run plots here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## model optimization" - ] - }, - { - "cell_type": "markdown", + "execution_count": 22, "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'train_accuracy': 0.9781481481481481,\n", + " 'training-validation-plot.html': 'training-validation-plot.html',\n", + " 'model': '/User/mlrun/models/lgb-classifier.pkl'}" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "onnx here" + "tsk2.outputs" ] }, { diff --git a/tests/train_valid_test_split.ipynb b/tests/train_valid_test_split.ipynb index 41c4064cc..22b5040a4 100644 --- a/tests/train_valid_test_split.ipynb +++ b/tests/train_valid_test_split.ipynb @@ -28,17 +28,18 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "CODE_BASE = '/User/repos/functions/datagen' \n", - "N_SAMPLES = 10_000_000 # size of HIGGS data\n", + "N_SAMPLES = 100_000\n", "M_FEATURES = 28\n", "NEG_WEIGHT = 0.5\n", "RNG = 1\n", "TARGET_DATA_PATH = '/User/mlrun/splitter'\n", - "SRC_FILE = 'simdata.pqt'" + "SRC_FILE = 'simdata.pqt'\n", + "KEY = 'simdata'" ] }, { @@ -50,148 +51,20 @@ }, { "cell_type": "code", - "execution_count": 4, - "metadata": { - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, + "execution_count": 15, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-25 23:30:01,679 starting remote build, image: .mlrun/func-default-binary-latest\n", - "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Built cross stage deps: map[] \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory \n", - "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", - "\u001b[36mINFO\u001b[0m[0000] Unpacking rootfs as cmd RUN pip install mlrun requires it. \n", - "\u001b[36mINFO\u001b[0m[0047] Taking snapshot of full filesystem... \n", - "\u001b[36mINFO\u001b[0m[0067] RUN pip install mlrun \n", - "\u001b[36mINFO\u001b[0m[0067] cmd: /bin/sh \n", - "\u001b[36mINFO\u001b[0m[0067] args: [-c pip install mlrun] \n", - "Requirement already satisfied: mlrun in /opt/conda/lib/python3.7/site-packages (0.4.3)\n", - "Requirement already satisfied: aiohttp>=3.5.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.6.2)\n", - "Requirement already satisfied: croniter==0.3.31 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.3.31)\n", - "Requirement already satisfied: gunicorn==19.9.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (19.9.0)\n", - "Requirement already satisfied: gevent==1.4.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.4.0)\n", - "Requirement already satisfied: kfp>=0.1.29 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.2.0)\n", - "Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (7.0)\n", - "Requirement already satisfied: nuclio-sdk>=0.0.3 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.0.5)\n", - "Requirement already satisfied: pandas>=0.23.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.25.1)\n", - "Requirement already satisfied: tabulate<=0.8.3,>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.3)\n", - "Requirement already satisfied: requests>=2.20.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (2.20.1)\n", - "Requirement already satisfied: sqlalchemy==1.3.11 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.3.11)\n", - "Requirement already satisfied: boto3>=1.9 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.11.7)\n", - "Requirement already satisfied: GitPython>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.0.5)\n", - "Requirement already satisfied: Flask>=1.1.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.1.1)\n", - "Requirement already satisfied: nest-asyncio>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.2.2)\n", - "Requirement already satisfied: nuclio-jupyter>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.0)\n", - "Requirement already satisfied: pyyaml>=5.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (5.1.1)\n", - "Requirement already satisfied: multidict<5.0,>=4.5 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (4.7.4)\n", - "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (1.4.2)\n", - "Requirement already satisfied: chardet<4.0,>=2.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.4)\n", - "Requirement already satisfied: async-timeout<4.0,>=3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.1)\n", - "Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (19.3.0)\n", - "Requirement already satisfied: python-dateutil in /opt/conda/lib/python3.7/site-packages (from croniter==0.3.31->mlrun) (2.8.0)\n", - "Requirement already satisfied: greenlet>=0.4.14; platform_python_implementation == \"CPython\" in /opt/conda/lib/python3.7/site-packages (from gevent==1.4.0->mlrun) (0.4.15)\n", - "Requirement already satisfied: google-cloud-storage>=1.13.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.25.0)\n", - "Requirement already satisfied: requests-toolbelt>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.9.1)\n", - "Requirement already satisfied: cryptography>=2.4.2 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.7)\n", - "Requirement already satisfied: PyJWT>=1.6.4 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.7.1)\n", - "Requirement already satisfied: kfp-server-api<=0.1.40,>=0.1.18 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.1.40)\n", - "Requirement already satisfied: Deprecated in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.2.7)\n", - "Requirement already satisfied: jsonschema>=3.0.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (3.2.0)\n", - "Requirement already satisfied: cloudpickle==1.1.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.1.1)\n", - "Requirement already satisfied: argo-models==2.2.1a in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.2.1a0)\n", - "Requirement already satisfied: google-auth>=1.6.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.10.1)\n", - "Requirement already satisfied: urllib3<1.25,>=1.15 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.24.1)\n", - "Requirement already satisfied: certifi in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2019.9.11)\n", - "Requirement already satisfied: six>=1.10 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.12.0)\n", - "Requirement already satisfied: kubernetes<=10.0.0,>=8.0.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (10.0.0)\n", - "Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (2019.1)\n", - "Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (1.17.4)\n", - "Requirement already satisfied: idna<2.8,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (2.6)\n", - "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.9.4)\n", - "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.3.1)\n", - "Requirement already satisfied: botocore<1.15.0,>=1.14.7 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (1.14.7)\n", - "Requirement already satisfied: gitdb2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from GitPython>=2.1.0->mlrun) (2.0.6)\n", - "Requirement already satisfied: Jinja2>=2.10.1 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (2.10.3)\n", - "Requirement already satisfied: itsdangerous>=0.24 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (1.1.0)\n", - "Requirement already satisfied: Werkzeug>=0.15 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (0.16.0)\n", - "Requirement already satisfied: tornado<6,>=5 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.1.1)\n", - "Requirement already satisfied: jupyterlab>=0.35.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (1.2.5)\n", - "Requirement already satisfied: ipython>=7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (7.11.1)\n", - "Requirement already satisfied: nbconvert>=5.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.6.1)\n", - "Requirement already satisfied: notebook>=5.7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.3)\n", - "Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (0.5.0)\n", - "Requirement already satisfied: google-cloud-core<2.0dev,>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.2.0)\n", - "Requirement already satisfied: asn1crypto>=0.21.0 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (0.24.0)\n", - "Requirement already satisfied: cffi!=1.11.3,>=1.8 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (1.12.3)\n", - "Requirement already satisfied: wrapt<2,>=1.10 in /opt/conda/lib/python3.7/site-packages (from Deprecated->kfp>=0.1.29->mlrun) (1.11.2)\n", - "Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (41.0.1.post20191122)\n", - "Requirement already satisfied: pyrsistent>=0.14.0 in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (0.15.7)\n", - "Requirement already satisfied: importlib-metadata; python_version < \"3.8\" in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (1.4.0)\n", - "Requirement already satisfied: rsa<4.1,>=3.1.4 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0)\n", - "Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.2.8)\n", - "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0.0)\n", - "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (0.57.0)\n", - "Requirement already satisfied: requests-oauthlib in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (1.3.0)\n", - "Requirement already satisfied: docutils<0.16,>=0.10 in /opt/conda/lib/python3.7/site-packages (from botocore<1.15.0,>=1.14.7->boto3>=1.9->mlrun) (0.15.2)\n", - "Requirement already satisfied: smmap2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from gitdb2>=2.0.0->GitPython>=2.1.0->mlrun) (2.0.5)\n", - "Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.7/site-packages (from Jinja2>=2.10.1->Flask>=1.1.1->mlrun) (1.1.1)\n", - "Requirement already satisfied: jupyterlab-server~=1.0.0 in /opt/conda/lib/python3.7/site-packages (from jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (1.0.6)\n", - "Requirement already satisfied: traitlets>=4.2 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.3.3)\n", - "Requirement already satisfied: pickleshare in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.5)\n", - "Requirement already satisfied: decorator in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.4.1)\n", - "Requirement already satisfied: pygments in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (2.5.2)\n", - "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.8.0)\n", - "Requirement already satisfied: backcall in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.0)\n", - "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (3.0.2)\n", - "Requirement already satisfied: jedi>=0.10 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.15.2)\n", - "Requirement already satisfied: entrypoints>=0.2.2 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.3)\n", - "Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (1.4.2)\n", - "Requirement already satisfied: defusedxml in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", - "Requirement already satisfied: bleach in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (3.1.0)\n", - "Requirement already satisfied: jupyter-core in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (4.6.1)\n", - "Requirement already satisfied: mistune<2,>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.4)\n", - "Requirement already satisfied: nbformat>=4.4 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (5.0.3)\n", - "Requirement already satisfied: testpath in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.4.4)\n", - "Requirement already satisfied: terminado>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.8.3)\n", - "Requirement already satisfied: prometheus-client in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.1)\n", - "Requirement already satisfied: jupyter-client>=5.3.4 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.3.4)\n", - "Requirement already satisfied: pyzmq>=17 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (18.1.1)\n", - "Requirement already satisfied: Send2Trash in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (1.5.0)\n", - "Requirement already satisfied: ipykernel in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.1.3)\n", - "Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.2.0)\n", - "Requirement already satisfied: google-api-core<2.0.0dev,>=1.16.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.16.0)\n", - "Requirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.4.2->kfp>=0.1.29->mlrun) (2.18)\n", - "Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (2.0.0)\n", - "Requirement already satisfied: pyasn1>=0.1.3 in /opt/conda/lib/python3.7/site-packages (from rsa<4.1,>=3.1.4->google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.4.8)\n", - "Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from requests-oauthlib->kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (3.1.0)\n", - "Requirement already satisfied: json5 in /opt/conda/lib/python3.7/site-packages (from jupyterlab-server~=1.0.0->jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.5)\n", - "Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.7/site-packages (from pexpect; sys_platform != \"win32\"->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", - "Requirement already satisfied: wcwidth in /opt/conda/lib/python3.7/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.8)\n", - "Requirement already satisfied: parso>=0.5.2 in /opt/conda/lib/python3.7/site-packages (from jedi>=0.10->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.5.2)\n", - "Requirement already satisfied: webencodings in /opt/conda/lib/python3.7/site-packages (from bleach->nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.5.1)\n", - "Requirement already satisfied: protobuf>=3.4.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (3.11.2)\n", - "Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.51.0)\n", - "Requirement already satisfied: more-itertools in /opt/conda/lib/python3.7/site-packages (from zipp>=0.5->importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (8.1.0)\n", - "\u001b[36mINFO\u001b[0m[0069] Taking snapshot of full filesystem... \n", - "[mlrun] 2020-01-25 23:31:19,830 starting run create_binary_classification uid=c0e0d32541bb4312aaba3c223860ca7d -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-25 23:31:19,909 Job is running in the background, pod: create-binary-classification-j8pzv\n", - "[mlrun] 2020-01-25 23:32:19,610 log artifact simdata at /User/mlrun/splitter/simdata.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 12:00:09,764 starting run create_binary_classification uid=561adae458ac4df1a16d7ac371d2e450 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 12:00:09,844 Job is running in the background, pod: create-binary-classification-qqm6z\n", + "[mlrun] 2020-01-26 12:00:22,759 log artifact simdata at /User/mlrun/splitter/simdata.pqt, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-25 23:32:19,918 run executed, status=completed\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", " result = infer_dtype(pandas_collection)\n", + "[mlrun] 2020-01-26 12:00:22,773 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -364,26 +237,26 @@ " \n", " \n", " \n", - "
...60ca7d
\n", + "
...d2e450
\n", " 0\n", - " Jan 25 23:31:30\n", + " Jan 26 12:00:21\n", " completed\n", " binary\n", - "
host=create-binary-classification-j8pzv
kind=job
owner=admin
\n", + "
host=create-binary-classification-qqm6z
kind=job
owner=admin
\n", " \n", - "
filename=simdata.pqt
key=simdata
m_features=28
n_samples=10000000
random_state=1
target_path=/User/mlrun/splitter
weight=0.5
\n", + "
filename=/User/mlrun/splitter/simdata.pqt
key=simdata
m_features=28
n_samples=100000
random_state=1
target_path=/User/mlrun/splitter
weight=0.5
\n", " \n", "
simdata
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -399,8 +272,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run c0e0d32541bb4312aaba3c223860ca7d , !mlrun logs c0e0d32541bb4312aaba3c223860ca7d \n", - "[mlrun] 2020-01-25 23:32:29,316 run executed, status=completed\n" + "!mlrun get run 561adae458ac4df1a16d7ac371d2e450 , !mlrun logs 561adae458ac4df1a16d7ac371d2e450 \n", + "[mlrun] 2020-01-26 12:00:29,005 run executed, status=completed\n" ] } ], @@ -417,8 +290,8 @@ " m_features=M_FEATURES,\n", " weight=NEG_WEIGHT,\n", " target_path=TARGET_DATA_PATH,\n", - " filename='simdata.pqt',\n", - " key='simdata',\n", + " filename=TARGET_DATA_PATH + '/' + SRC_FILE,\n", + " key=KEY,\n", " random_state=RNG)\n", "\n", "tsk1 = binarydatagen.run(task1, handler='create_binary_classification')" @@ -426,7 +299,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 16, "metadata": {}, "outputs": [ { @@ -435,7 +308,7 @@ "{'simdata': '/User/mlrun/splitter/simdata.pqt'}" ] }, - "execution_count": 5, + "execution_count": 16, "metadata": {}, "output_type": "execute_result" } @@ -453,20 +326,31 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 28, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-26 13:01:10,237 function spec saved to path: /User/repos/functions/datagen/splitters/train_valid_test.yaml\n" + ] + } + ], "source": [ - "# splitfn = mlrun.code_to_function(\n", - "# kind='job', \n", - "# filename=os.path.join(CODE_BASE, 'splitters', 'train_valid_test.py'))\n", - "\n", - "# splitfn.export(os.path.join(CODE_BASE, 'splitters', 'train_valid_test.yaml'))" + "yaml_name = os.path.join(CODE_BASE, 'splitters', 'train_valid_test.yaml')\n", + "if not os.path.isfile(yaml_name):\n", + " testfn = mlrun.code_to_function(\n", + " kind='job', \n", + " image='yjbds/mlrun-ds:latest',\n", + " filename=os.path.join(CODE_BASE, 'splitters', 'train_valid_test.py'))\n", + " testfn.build_config(base_image='yjbds/mlrun-ds:latest', commands=[])\n", + " testfn.export(yaml_name)" ] }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 18, "metadata": {}, "outputs": [], "source": [ @@ -477,23 +361,16 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 19, "metadata": {}, "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-25 23:32:48,510 starting remote build, image: .mlrun/func-default-train-valid-test-latest\n" - ] - }, { "data": { "text/plain": [ - "True" + "'ready'" ] }, - "execution_count": 8, + "execution_count": 19, "metadata": {}, "output_type": "execute_result" } @@ -504,24 +381,24 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-25 23:32:50,077 starting run train_valid_test_splitter uid=eaf9cfd6724b437da1e926bbd2daa040 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-25 23:32:50,176 Job is running in the background, pod: train-valid-test-splitter-kqxj9\n", - "[mlrun] 2020-01-25 23:33:23,093 log artifact header at /User/mlrun/splitter/header.pkl, size: None, db: Y\n", - "[mlrun] 2020-01-25 23:33:36,963 log artifact xtrain at /User/mlrun/splitter/xtrain.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-25 23:33:42,116 log artifact xvalid at /User/mlrun/splitter/xvalid.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-25 23:33:44,494 log artifact xtest at /User/mlrun/splitter/xtest.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-25 23:33:45,314 log artifact ytrain at /User/mlrun/splitter/ytrain.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-25 23:33:45,599 log artifact yvalid at /User/mlrun/splitter/yvalid.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-25 23:33:45,811 log artifact ytest at /User/mlrun/splitter/ytest.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 12:00:29,121 starting run train_valid_test_splitter uid=5be904e5c6a14a9ba59e016e212f4499 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 12:00:29,206 Job is running in the background, pod: train-valid-test-splitter-kqfdd\n", + "[mlrun] 2020-01-26 12:00:39,440 log artifact header at /User/mlrun/splitter/header.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-26 12:00:39,753 log artifact xtrain at /User/mlrun/splitter/xtrain.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 12:00:39,889 log artifact xvalid at /User/mlrun/splitter/xvalid.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 12:00:40,042 log artifact xtest at /User/mlrun/splitter/xtest.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 12:00:40,073 log artifact ytrain at /User/mlrun/splitter/ytrain.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 12:00:40,100 log artifact yvalid at /User/mlrun/splitter/yvalid.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 12:00:40,124 log artifact ytest at /User/mlrun/splitter/ytest.pqt, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-25 23:33:46,258 run executed, status=completed\n", + "[mlrun] 2020-01-26 12:00:40,141 run executed, status=completed\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:708: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", " labels = getattr(columns, 'labels', None) or [\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:735: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead\n", @@ -702,12 +579,12 @@ " \n", " \n", " \n", - "
...daa040
\n", + "
...2f4499
\n", " 0\n", - " Jan 25 23:33:02\n", + " Jan 26 12:00:39\n", " completed\n", " train-valid-test\n", - "
host=train-valid-test-splitter-kqxj9
kind=job
owner=admin
\n", + "
host=train-valid-test-splitter-kqfdd
kind=job
owner=admin
\n", " \n", "
random_state=1
src_file=/User/mlrun/splitter/simdata.pqt
target_path=/User/mlrun/splitter
\n", " \n", @@ -716,12 +593,12 @@ " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -737,8 +614,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run eaf9cfd6724b437da1e926bbd2daa040 , !mlrun logs eaf9cfd6724b437da1e926bbd2daa040 \n", - "[mlrun] 2020-01-25 23:33:49,482 run executed, status=completed\n" + "!mlrun get run 5be904e5c6a14a9ba59e016e212f4499 , !mlrun logs 5be904e5c6a14a9ba59e016e212f4499 \n", + "[mlrun] 2020-01-26 12:00:48,372 run executed, status=completed\n" ] } ], @@ -754,7 +631,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 21, "metadata": {}, "outputs": [ { @@ -769,7 +646,7 @@ " 'ytest': '/User/mlrun/splitter/ytest.pqt'}" ] }, - "execution_count": 10, + "execution_count": 21, "metadata": {}, "output_type": "execute_result" } @@ -787,7 +664,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 22, "metadata": {}, "outputs": [], "source": [ @@ -796,7 +673,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 23, "metadata": {}, "outputs": [], "source": [ @@ -812,7 +689,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 24, "metadata": {}, "outputs": [], "source": [ @@ -825,7 +702,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 25, "metadata": {}, "outputs": [], "source": [ @@ -834,7 +711,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 26, "metadata": {}, "outputs": [], "source": [ diff --git a/train/sklearn-classifier.py b/train/sklearn-classifier.py index 795d38ace..35f9bb297 100644 --- a/train/sklearn-classifier.py +++ b/train/sklearn-classifier.py @@ -1,21 +1,30 @@ -from mlrun.execution import MLClientCtx -from mlrun.datastore import DataItem +import numpy as np import pandas as pd -import lightgbm as lgb + +import matplotlib.pyplot as plt +from matplotlib.figure import Figure +import seaborn as sns + from typing import Optional, Union import os -from sklearn.model_selection import train_test_split import importlib from cloudpickle import dump +from mlrun.execution import MLClientCtx +from mlrun.datastore import DataItem +from mlrun.artifacts import TableArtifact, PlotArtifact + +import warnings +warnings.simplefilter(action='ignore', category=FutureWarning) + def train( context: Optional[MLClientCtx] = None, - src_file: Union[DataItem, str] = '', SKClassifier: str = '', callbacks = [], - test_size: float = 0.1, - train_val_split: float = 0.75, - sample: int = -1, + xtrain: Union[DataItem, str] = '', + ytrain: Union[DataItem, str] = '', + xvalid: Union[DataItem, str] = '', + yvalid: Union[DataItem, str] = '', target_path: str = '', name: str = '', key: str = '', @@ -30,16 +39,12 @@ def train( :param context: the function context - :param src_file: ('raw') name of raw data file - :param sample: (-1). Selects the first n rows, or select a sample starting - from the first. If negative <-1, select a random sample from - the entire file - :param header: (None) header artifact or list of column names. :param SKClassifier: string module and classname of classifier - :param callbacks - :param test_size: (0.1) test set size - :param train_val_split: (0.75) Once the test set has been removed the - training set gets this proportion. + :param callbacks: sklearn classifier fit function callbacks + :param xtrain: + :param ytrain: + :param xvalid: + :param yvalid: :param target_path: folder location of files :param name: destination name for model file :param key: key for model artifact @@ -54,30 +59,11 @@ def train( ``` """ # load data - if isinstance(src_file, DataItem): - src_file = str(src_file) - srcfilepath = os.path.join(target_path, src_file) + xtrain = pd.read_parquet(str(xtrain), engine='pyarrow') + ytrain = pd.read_parquet(str(ytrain), engine='pyarrow') + xvalid = pd.read_parquet(str(xvalid), engine='pyarrow') + yvalid = pd.read_parquet(str(yvalid), engine='pyarrow') - # save only a sample, intended for debugging - if (sample == -1) or (sample >= 1): - # get all rows, or contiguous sample starting at row 1. - raw = pd.read_parquet(srcfilepath, engine='pyarrow') - labels = raw.pop('labels') - raw = raw.iloc[:sample, :] - labels = labels.iloc[:sample] - else: - # grab a random sample - raw = pd.read_parquet(srcfilepath, engine='pyarrow').sample(sample*-1) - labels = raw.pop('labels') - - # double split tp generate 3 data sets: train, validation and test - x, xtest, y, ytest = train_test_split(raw, labels, train_size=1-test_size, - random_state=random_state) - - xtrain, xvalid, ytrain, yvalid = train_test_split(x, y, - train_size=train_val_split, - random_state=random_state) - # create classifier class from string and instantiate splits = SKClassifier.split(".") clfclass = getattr(importlib.import_module(".".join(splits[:-1])), splits[-1]) @@ -90,33 +76,35 @@ def train( callbacks=callbacks, verbose=verbose) - context.log_result("train_accuracy", float(clf.score(xtrain, ytrain))) + context.log_result("train_accuracy", float(model.score(xtrain, ytrain))) # plot train and validation history, save and log loss = np.asarray(model.evals_result_['train']['binary_logloss'], dtype=np.float) val_loss = np.asarray(model.evals_result_['valid']['binary_logloss'], dtype=np.float) - plot_validation(loss, val_loss) + plot_validation(context, loss, val_loss, target_path) # save model filepath = os.path.join(target_path, name) - dump(clf, open(filepath, 'wb')) - context.log_artifact(key, target_path=filepath) #, labels=exp_labels) - # save test data - for t in ['x', 'y']: - fname = t + 'test.pkl' - filepath = os.path.join(target_path, fname) - dump(xtest, open(filepath, 'wb')) - context.log_artifact(t+'test', target_path=filepath) + dump(model, open(filepath, 'wb')) + context.log_artifact(key, target_path=filepath) - -def plot_validation(train_metric, valid_metric): +def plot_validation( + context: MLClientCtx, + train_metric, + valid_metric, + target_path: str = '', + name: str = "history.png", + key: str = 'training-validation-plot' +): """Plot train and validation loss curves These curves represent the training round losses from the training and validation sets. + :param context: the function context :param train_metric: train metric :param valid_metric: validation metric + :param target_path: destinatin path for train/volidation history plot artifact """ # generate plot plt.plot(train_metric) @@ -128,9 +116,9 @@ def plot_validation(train_metric, valid_metric): fig = plt.gcf() # save figure and log artifact - plotpath = path.join(target_path, "history.png") + plotpath = os.path.join(target_path, name) plt.savefig(plotpath) - context.log_artifact(PlotArtifact('training-validation-plot', body=fig, target_path=plotpath)) + context.log_artifact(PlotArtifact(key, body=fig)) # plot cleanup plt.cla() diff --git a/train/sklearn-classifier.yaml b/train/sklearn-classifier.yaml index 77bfd49b7..21d86f528 100644 --- a/train/sklearn-classifier.yaml +++ b/train/sklearn-classifier.yaml @@ -1,9 +1,19 @@ kind: job metadata: name: sklearn-classifier + tag: '' + hash: 3d4b7b654a757bb047ac767b082b8529bd7b009e + project: '' spec: + command: '' + args: [] + image: yjbds/mlrun-ds:latest + volumes: [] + volume_mounts: [] + env: [] + description: '' build: - functionSourceCode: ZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gbWxydW4uZGF0YXN0b3JlIGltcG9ydCBEYXRhSXRlbQppbXBvcnQgcGFuZGFzIGFzIHBkCmltcG9ydCBsaWdodGdibSBhcyBsZ2IKZnJvbSB0eXBpbmcgaW1wb3J0IE9wdGlvbmFsLCBVbmlvbgppbXBvcnQgb3MKZnJvbSBza2xlYXJuLm1vZGVsX3NlbGVjdGlvbiBpbXBvcnQgdHJhaW5fdGVzdF9zcGxpdAppbXBvcnQgaW1wb3J0bGliCmZyb20gY2xvdWRwaWNrbGUgaW1wb3J0IGR1bXAKCmRlZiB0cmFpbigKICAgIGNvbnRleHQ6IE9wdGlvbmFsW01MQ2xpZW50Q3R4XSA9IE5vbmUsCiAgICBzcmNfZmlsZTogVW5pb25bRGF0YUl0ZW0sIHN0cl0gPSAnJywKICAgIFNLQ2xhc3NpZmllcjogc3RyICA9ICcnLAogICAgY2FsbGJhY2tzICA9IFtdLAogICAgdGVzdF9zaXplOiBmbG9hdCA9IDAuMSwKICAgIHRyYWluX3ZhbF9zcGxpdDogZmxvYXQgPSAwLjc1LAogICAgc2FtcGxlOiBpbnQgPSAtMSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAnJywKICAgIG5hbWU6IHN0ciA9ICcnLAogICAga2V5OiBzdHIgPSAnJywKICAgIHZlcmJvc2U6IGJvb2wgPSBGYWxzZSwKICAgIHJhbmRvbV9zdGF0ZSA9IDEKKSAtPiBOb25lOgogICAgIiIiVHJhaW4gYW5kIHNhdmUgYW4gU2Npa2l0bGVhcm4gbW9kZWwuCiAgICAKICAgIFRoZSBkYXRhIHNvdXJjZSBjYW4gZWl0aGVyIGJlIGEgc3RyaW5nIGZpbGUgbmFtZSBvciBhbiBhcnRpZmFjdCBpdGVtLgogICAgCiAgICBUaGUgaGVhZGVyIGlzIGVpdGggYSBsaXN0IG9mIGNvbHVtbiBuYW1lcywgYW4gYXJ0aWZhY3QgaGVhZGVyIGl0ZW0sIG9yIE5vbmUuCiAgICAKICAgIAogICAgOnBhcmFtIGNvbnRleHQ6ICAgICAgICAgdGhlIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBzcmNfZmlsZTogICAgICAgICgncmF3JykgbmFtZSBvZiByYXcgZGF0YSBmaWxlCiAgICA6cGFyYW0gc2FtcGxlOiAgICAgICAgICAoLTEpLiBTZWxlY3RzIHRoZSBmaXJzdCBuIHJvd3MsIG9yIHNlbGVjdCBhIHNhbXBsZSBzdGFydGluZwogICAgICAgICAgICAgICAgICAgICAgICAgICAgZnJvbSB0aGUgZmlyc3QuIElmIG5lZ2F0aXZlIDwtMSwgc2VsZWN0IGEgcmFuZG9tIHNhbXBsZSBmcm9tIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgdGhlIGVudGlyZSBmaWxlCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgICAgICAoTm9uZSkgaGVhZGVyIGFydGlmYWN0IG9yIGxpc3Qgb2YgY29sdW1uIG5hbWVzLgogICAgOnBhcmFtIFNLQ2xhc3NpZmllcjogICAgc3RyaW5nIG1vZHVsZSBhbmQgY2xhc3NuYW1lIG9mIGNsYXNzaWZpZXIKICAgIDpwYXJhbSBjYWxsYmFja3MKICAgIDpwYXJhbSB0ZXN0X3NpemU6ICAgICAgICgwLjEpIHRlc3Qgc2V0IHNpemUKICAgIDpwYXJhbSB0cmFpbl92YWxfc3BsaXQ6ICgwLjc1KSBPbmNlIHRoZSB0ZXN0IHNldCBoYXMgYmVlbiByZW1vdmVkIHRoZSAKICAgICAgICAgICAgICAgICAgICAgICAgICAgIHRyYWluaW5nIHNldCBnZXRzIHRoaXMgcHJvcG9ydGlvbi4KICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogICAgIGZvbGRlciBsb2NhdGlvbiBvZiBmaWxlcwogICAgOnBhcmFtIG5hbWU6ICAgICAgICAgICAgZGVzdGluYXRpb24gbmFtZSBmb3IgbW9kZWwgZmlsZQogICAgOnBhcmFtIGtleTogICAgICAgICAgICAga2V5IGZvciBtb2RlbCBhcnRpZmFjdAogICAgOnBhcmFtIHZlcmJvc2UgOiAgICAgICAgKEZhbHNlKSBzaG93IG1ldHJpY3MgZm9yIHRyYWluaW5nL3ZhbGlkYXRpb24gc3RlcHMuCiAgICA6cGFyYW0gcmFuZG9tX3N0YXRlOiAgICAoMSkgc2tsZWFybiBybmcgc2VlZAogICAgCiAgICBleGFtcGxlIGNhbGxiYWNrczoKICAgIGBgYAogICAgZnJvbSBsaWdodGdibSBpbXBvcnQgcmVjb3JkX2V2YWx1YXRpb24KICAgIGV2YWxfcmVzdWx0cyA9IGRpY3QoKQogICAgY2FsbGJhY2tzID0gW3JlY29yZF9ldmFsdWF0aW9uKGV2YWxfcmVzdWx0cyldCiAgICBgYGAKICAgICIiIgogICAgIyBsb2FkIGRhdGEKICAgIGlmIGlzaW5zdGFuY2Uoc3JjX2ZpbGUsIERhdGFJdGVtKToKICAgICAgICBzcmNfZmlsZSA9IHN0cihzcmNfZmlsZSkKICAgIHNyY2ZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBzcmNfZmlsZSkKCiAgICAjIHNhdmUgb25seSBhIHNhbXBsZSwgaW50ZW5kZWQgZm9yIGRlYnVnZ2luZwogICAgaWYgKHNhbXBsZSA9PSAtMSkgb3IgKHNhbXBsZSA+PSAxKToKICAgICAgICAjIGdldCBhbGwgcm93cywgb3IgY29udGlndW91cyBzYW1wbGUgc3RhcnRpbmcgYXQgcm93IDEuCiAgICAgICAgcmF3ID0gcGQucmVhZF9wYXJxdWV0KHNyY2ZpbGVwYXRoLCBlbmdpbmU9J3B5YXJyb3cnKQogICAgICAgIGxhYmVscyA9IHJhdy5wb3AoJ2xhYmVscycpCiAgICAgICAgcmF3ID0gcmF3Lmlsb2NbOnNhbXBsZSwgOl0KICAgICAgICBsYWJlbHMgPSBsYWJlbHMuaWxvY1s6c2FtcGxlXQogICAgZWxzZToKICAgICAgICAjIGdyYWIgYSByYW5kb20gc2FtcGxlCiAgICAgICAgcmF3ID0gcGQucmVhZF9wYXJxdWV0KHNyY2ZpbGVwYXRoLCBlbmdpbmU9J3B5YXJyb3cnKS5zYW1wbGUoc2FtcGxlKi0xKQogICAgICAgIGxhYmVscyA9IHJhdy5wb3AoJ2xhYmVscycpCiAgICAKICAgICMgZG91YmxlIHNwbGl0IHRwIGdlbmVyYXRlIDMgZGF0YSBzZXRzOiB0cmFpbiwgdmFsaWRhdGlvbiBhbmQgdGVzdAogICAgeCwgeHRlc3QsIHksIHl0ZXN0ID0gdHJhaW5fdGVzdF9zcGxpdChyYXcsIGxhYmVscywgdHJhaW5fc2l6ZT0xLXRlc3Rfc2l6ZSwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHJhbmRvbV9zdGF0ZT1yYW5kb21fc3RhdGUpCiAgIAogICAgeHRyYWluLCB4dmFsaWQsIHl0cmFpbiwgeXZhbGlkID0gdHJhaW5fdGVzdF9zcGxpdCh4LCB5LCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgdHJhaW5fc2l6ZT10cmFpbl92YWxfc3BsaXQsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICByYW5kb21fc3RhdGU9cmFuZG9tX3N0YXRlKSAgICAgICAgCiAgIAogICAgIyBjcmVhdGUgY2xhc3NpZmllciBjbGFzcyBmcm9tIHN0cmluZyBhbmQgaW5zdGFudGlhdGUKICAgIHNwbGl0cyA9IFNLQ2xhc3NpZmllci5zcGxpdCgiLiIpCiAgICBjbGZjbGFzcyA9IGdldGF0dHIoaW1wb3J0bGliLmltcG9ydF9tb2R1bGUoIi4iLmpvaW4oc3BsaXRzWzotMV0pKSwgc3BsaXRzWy0xXSkKICAgIGNsZiA9IGNsZmNsYXNzKHJhbmRvbV9zdGF0ZT1yYW5kb21fc3RhdGUsIHZlcmJvc2U9aW50KHZlcmJvc2UgPT0gVHJ1ZSkpCgogICAgY2xmLmZpdCh4dHJhaW4sIAogICAgICAgICAgICB5dHJhaW4sCiAgICAgICAgICAgIGV2YWxfc2V0PVsoeHZhbGlkLCB5dmFsaWQpLCAoeHRyYWluLCB5dHJhaW4pXSwKICAgICAgICAgICAgZXZhbF9uYW1lcz1bJ3ZhbGlkJywgJ3RyYWluJ10sCiAgICAgICAgICAgIGNhbGxiYWNrcz1jYWxsYmFja3MsCiAgICAgICAgICAgIHZlcmJvc2U9dmVyYm9zZSkKICAgICAKICAgIGNvbnRleHQubG9nX3Jlc3VsdCgidHJhaW5fYWNjdXJhY3kiLCBmbG9hdChjbGYuc2NvcmUoeHRyYWluLCB5dHJhaW4pKSkKCiAgICAjIHNhdmUgbW9kZWwKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgZHVtcChjbGYsIG9wZW4oZmlsZXBhdGgsICd3YicpKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3Qoa2V5LCB0YXJnZXRfcGF0aD1maWxlcGF0aCkgIywgbGFiZWxzPWV4cF9sYWJlbHMpCiAgICAjIHNhdmUgdGVzdCBkYXRhCiAgICBmb3IgdCBpbiBbJ3gnLCAneSddOgogICAgICAgIGZuYW1lID0gdCArICd0ZXN0LnBrbCcKICAgICAgICBmaWxlcGF0aCA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgZm5hbWUpCiAgICAgICAgZHVtcCh4dGVzdCwgb3BlbihmaWxlcGF0aCwgJ3diJykpCiAgICAgICAgY29udGV4dC5sb2dfYXJ0aWZhY3QodCsndGVzdCcsIHRhcmdldF9wYXRoPWZpbGVwYXRoKQ== + functionSourceCode: aW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBwYW5kYXMgYXMgcGQKCmltcG9ydCBtYXRwbG90bGliLnB5cGxvdCBhcyBwbHQKZnJvbSBtYXRwbG90bGliLmZpZ3VyZSBpbXBvcnQgRmlndXJlCmltcG9ydCBzZWFib3JuIGFzIHNucwoKZnJvbSB0eXBpbmcgaW1wb3J0IE9wdGlvbmFsLCBVbmlvbgppbXBvcnQgb3MKaW1wb3J0IGltcG9ydGxpYgpmcm9tIGNsb3VkcGlja2xlIGltcG9ydCBkdW1wCgpmcm9tIG1scnVuLmV4ZWN1dGlvbiBpbXBvcnQgTUxDbGllbnRDdHgKZnJvbSBtbHJ1bi5kYXRhc3RvcmUgaW1wb3J0IERhdGFJdGVtCmZyb20gbWxydW4uYXJ0aWZhY3RzIGltcG9ydCBUYWJsZUFydGlmYWN0LCBQbG90QXJ0aWZhY3QKCmltcG9ydCB3YXJuaW5ncwp3YXJuaW5ncy5zaW1wbGVmaWx0ZXIoYWN0aW9uPSdpZ25vcmUnLCBjYXRlZ29yeT1GdXR1cmVXYXJuaW5nKQoKZGVmIHRyYWluKAogICAgY29udGV4dDogT3B0aW9uYWxbTUxDbGllbnRDdHhdID0gTm9uZSwKICAgIFNLQ2xhc3NpZmllcjogc3RyICA9ICcnLAogICAgY2FsbGJhY2tzICA9IFtdLAogICAgeHRyYWluOiBVbmlvbltEYXRhSXRlbSwgc3RyXSA9ICcnLAogICAgeXRyYWluOiBVbmlvbltEYXRhSXRlbSwgc3RyXSA9ICcnLAogICAgeHZhbGlkOiBVbmlvbltEYXRhSXRlbSwgc3RyXSA9ICcnLAogICAgeXZhbGlkOiBVbmlvbltEYXRhSXRlbSwgc3RyXSA9ICcnLAogICAgdGFyZ2V0X3BhdGg6IHN0ciA9ICcnLAogICAgbmFtZTogc3RyID0gJycsCiAgICBrZXk6IHN0ciA9ICcnLAogICAgdmVyYm9zZTogYm9vbCA9IEZhbHNlLAogICAgcmFuZG9tX3N0YXRlID0gMQopIC0+IE5vbmU6CiAgICAiIiJUcmFpbiBhbmQgc2F2ZSBhbiBTY2lraXRsZWFybiBtb2RlbC4KICAgIAogICAgVGhlIGRhdGEgc291cmNlIGNhbiBlaXRoZXIgYmUgYSBzdHJpbmcgZmlsZSBuYW1lIG9yIGFuIGFydGlmYWN0IGl0ZW0uCiAgICAKICAgIFRoZSBoZWFkZXIgaXMgZWl0aCBhIGxpc3Qgb2YgY29sdW1uIG5hbWVzLCBhbiBhcnRpZmFjdCBoZWFkZXIgaXRlbSwgb3IgTm9uZS4KICAgIAogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgICB0aGUgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIFNLQ2xhc3NpZmllcjogICAgc3RyaW5nIG1vZHVsZSBhbmQgY2xhc3NuYW1lIG9mIGNsYXNzaWZpZXIKICAgIDpwYXJhbSBjYWxsYmFja3M6ICAgICAgIHNrbGVhcm4gY2xhc3NpZmllciBmaXQgZnVuY3Rpb24gY2FsbGJhY2tzCiAgICA6cGFyYW0geHRyYWluOiAgICAgICAgICAKICAgIDpwYXJhbSB5dHJhaW46CiAgICA6cGFyYW0geHZhbGlkOgogICAgOnBhcmFtIHl2YWxpZDoKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogICAgIGZvbGRlciBsb2NhdGlvbiBvZiBmaWxlcwogICAgOnBhcmFtIG5hbWU6ICAgICAgICAgICAgZGVzdGluYXRpb24gbmFtZSBmb3IgbW9kZWwgZmlsZQogICAgOnBhcmFtIGtleTogICAgICAgICAgICAga2V5IGZvciBtb2RlbCBhcnRpZmFjdAogICAgOnBhcmFtIHZlcmJvc2UgOiAgICAgICAgKEZhbHNlKSBzaG93IG1ldHJpY3MgZm9yIHRyYWluaW5nL3ZhbGlkYXRpb24gc3RlcHMuCiAgICA6cGFyYW0gcmFuZG9tX3N0YXRlOiAgICAoMSkgc2tsZWFybiBybmcgc2VlZAogICAgCiAgICBleGFtcGxlIGNhbGxiYWNrczoKICAgIGBgYAogICAgZnJvbSBsaWdodGdibSBpbXBvcnQgcmVjb3JkX2V2YWx1YXRpb24KICAgIGV2YWxfcmVzdWx0cyA9IGRpY3QoKQogICAgY2FsbGJhY2tzID0gW3JlY29yZF9ldmFsdWF0aW9uKGV2YWxfcmVzdWx0cyldCiAgICBgYGAKICAgICIiIgogICAgIyBsb2FkIGRhdGEKICAgIHh0cmFpbiA9IHBkLnJlYWRfcGFycXVldChzdHIoeHRyYWluKSwgZW5naW5lPSdweWFycm93JykKICAgIHl0cmFpbiA9IHBkLnJlYWRfcGFycXVldChzdHIoeXRyYWluKSwgZW5naW5lPSdweWFycm93JykKICAgIHh2YWxpZCA9IHBkLnJlYWRfcGFycXVldChzdHIoeHZhbGlkKSwgZW5naW5lPSdweWFycm93JykKICAgIHl2YWxpZCA9IHBkLnJlYWRfcGFycXVldChzdHIoeXZhbGlkKSwgZW5naW5lPSdweWFycm93JykKCiAgICAjIGNyZWF0ZSBjbGFzc2lmaWVyIGNsYXNzIGZyb20gc3RyaW5nIGFuZCBpbnN0YW50aWF0ZQogICAgc3BsaXRzID0gU0tDbGFzc2lmaWVyLnNwbGl0KCIuIikKICAgIGNsZmNsYXNzID0gZ2V0YXR0cihpbXBvcnRsaWIuaW1wb3J0X21vZHVsZSgiLiIuam9pbihzcGxpdHNbOi0xXSkpLCBzcGxpdHNbLTFdKQogICAgbW9kZWwgPSBjbGZjbGFzcyhyYW5kb21fc3RhdGU9cmFuZG9tX3N0YXRlLCB2ZXJib3NlPWludCh2ZXJib3NlID09IFRydWUpKQoKICAgIG1vZGVsLmZpdCh4dHJhaW4sIAogICAgICAgICAgICAgIHl0cmFpbiwKICAgICAgICAgICAgICBldmFsX3NldD1bKHh2YWxpZCwgeXZhbGlkKSwgKHh0cmFpbiwgeXRyYWluKV0sCiAgICAgICAgICAgICAgZXZhbF9uYW1lcz1bJ3ZhbGlkJywgJ3RyYWluJ10sCiAgICAgICAgICAgICAgY2FsbGJhY2tzPWNhbGxiYWNrcywKICAgICAgICAgICAgICB2ZXJib3NlPXZlcmJvc2UpCiAgICAgCiAgICBjb250ZXh0LmxvZ19yZXN1bHQoInRyYWluX2FjY3VyYWN5IiwgZmxvYXQobW9kZWwuc2NvcmUoeHRyYWluLCB5dHJhaW4pKSkKICAgIAogICAgIyBwbG90IHRyYWluIGFuZCB2YWxpZGF0aW9uIGhpc3RvcnksIHNhdmUgYW5kIGxvZwogICAgbG9zcyA9IG5wLmFzYXJyYXkobW9kZWwuZXZhbHNfcmVzdWx0X1sndHJhaW4nXVsnYmluYXJ5X2xvZ2xvc3MnXSwgZHR5cGU9bnAuZmxvYXQpCiAgICB2YWxfbG9zcyA9IG5wLmFzYXJyYXkobW9kZWwuZXZhbHNfcmVzdWx0X1sndmFsaWQnXVsnYmluYXJ5X2xvZ2xvc3MnXSwgZHR5cGU9bnAuZmxvYXQpCiAgICBwbG90X3ZhbGlkYXRpb24oY29udGV4dCwgbG9zcywgdmFsX2xvc3MsIHRhcmdldF9wYXRoKQogICAgCiAgICAjIHNhdmUgbW9kZWwKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgZHVtcChtb2RlbCwgb3BlbihmaWxlcGF0aCwgJ3diJykpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWZpbGVwYXRoKQogICAgICAgIApkZWYgcGxvdF92YWxpZGF0aW9uKAogICAgY29udGV4dDogTUxDbGllbnRDdHgsCiAgICB0cmFpbl9tZXRyaWMsCiAgICB2YWxpZF9tZXRyaWMsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gJycsCiAgICBuYW1lOiBzdHIgPSAiaGlzdG9yeS5wbmciLAogICAga2V5OiBzdHIgPSAndHJhaW5pbmctdmFsaWRhdGlvbi1wbG90JwopOgogICAgIiIiUGxvdCB0cmFpbiBhbmQgdmFsaWRhdGlvbiBsb3NzIGN1cnZlcwoKICAgIFRoZXNlIGN1cnZlcyByZXByZXNlbnQgdGhlIHRyYWluaW5nIHJvdW5kIGxvc3NlcyBmcm9tIHRoZSB0cmFpbmluZwogICAgYW5kIHZhbGlkYXRpb24gc2V0cy4KICAgIAogICAgOnBhcmFtIGNvbnRleHQ6ICAgICAgICAgdGhlIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSB0cmFpbl9tZXRyaWM6ICAgIHRyYWluIG1ldHJpYwogICAgOnBhcmFtIHZhbGlkX21ldHJpYzogICAgdmFsaWRhdGlvbiBtZXRyaWMKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogICAgIGRlc3RpbmF0aW4gcGF0aCBmb3IgdHJhaW4vdm9saWRhdGlvbiBoaXN0b3J5IHBsb3QgYXJ0aWZhY3QKICAgICIiIgogICAgIyBnZW5lcmF0ZSBwbG90CiAgICBwbHQucGxvdCh0cmFpbl9tZXRyaWMpCiAgICBwbHQucGxvdCh2YWxpZF9tZXRyaWMpCiAgICBwbHQudGl0bGUoInRyYWluaW5nIHZhbGlkYXRpb24gcmVzdWx0cyIpCiAgICBwbHQueGxhYmVsKCJlcG9jaCIpCiAgICBwbHQueWxhYmVsKCIiKQogICAgcGx0LmxlZ2VuZChbInRyYWluIiwgInZhbGlkIl0pCiAgICBmaWcgPSBwbHQuZ2NmKCkKCiAgICAjIHNhdmUgZmlndXJlIGFuZCBsb2cgYXJ0aWZhY3QKICAgIHBsb3RwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgcGx0LnNhdmVmaWcocGxvdHBhdGgpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChQbG90QXJ0aWZhY3Qoa2V5LCBib2R5PWZpZykpCgogICAgIyBwbG90IGNsZWFudXAKICAgIHBsdC5jbGEoKQogICAgcGx0LmNsZigpCiAgICBwbHQuY2xvc2UoKSAgICAgICAgCg== base_image: yjbds/mlrun-ds:latest commands: [] - code_origin: https://github.com/yjb-ds/functions.git#3f7e0c78313c0f8da3f2ae8535b625f06f5c3ee4:/User/repos/functions/train/sklearn-classifier.py + code_origin: https://github.com/yjb-ds/functions.git#25e611e4bd05320d342708ce786522bfecaa0e51:/User/repos/functions/train/sklearn-classifier.py From e613e55761fd1ed325ad88155877924aa5b49ccc Mon Sep 17 00:00:00 2001 From: yasha Date: Sun, 26 Jan 2020 14:45:06 +0000 Subject: [PATCH 19/32] tests: all output to models folder --- datagen/classification/binary.yaml | 4 +- fileutils/arc_to_parquet/arc_to_parquet.py | 4 +- fileutils/arc_to_parquet/arc_to_parquet.yaml | 9 +- tests/arc_to_parquet.ipynb | 387 +++++++++++++++---- tests/create_binary_data.ipynb | 63 +-- tests/test_classifier.ipynb | 90 ++--- tests/train_classifier.ipynb | 193 +++++---- tests/train_valid_test_split.ipynb | 135 +++---- 8 files changed, 542 insertions(+), 343 deletions(-) diff --git a/datagen/classification/binary.yaml b/datagen/classification/binary.yaml index 90434e184..ae6a759cf 100644 --- a/datagen/classification/binary.yaml +++ b/datagen/classification/binary.yaml @@ -2,7 +2,7 @@ kind: job metadata: name: binary tag: '' - hash: 0a0a5369f0fcf38a0f26b29aa8295046e8fcb4a7 + hash: 0527f27939f7f6b39d435d9e62d484c0bab308c8 project: '' spec: command: '' @@ -15,4 +15,4 @@ spec: functionSourceCode: IyBDb3B5cmlnaHQgMjAxOSBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgppbXBvcnQgb3MKaW1wb3J0IHBhbmRhcyBhcyBwZAppbXBvcnQgcHlhcnJvdyBhcyBwYQppbXBvcnQgcHlhcnJvdy5wYXJxdWV0IGFzIHBxCmZyb20gdHlwaW5nIGltcG9ydCBPcHRpb25hbCwgTGlzdCwgQW55CmZyb20gc2tsZWFybi5kYXRhc2V0cyBpbXBvcnQgbWFrZV9jbGFzc2lmaWNhdGlvbgoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CgoKZGVmIGNyZWF0ZV9iaW5hcnlfY2xhc3NpZmljYXRpb24oCiAgICBjb250ZXh0IDogTUxDbGllbnRDdHggPSBOb25lLAogICAgbl9zYW1wbGVzIDogaW50ID0gMTAwXzAwMCwKICAgIG1fZmVhdHVyZXMgOiBpbnQgPSAyMCwKICAgIGZlYXR1cmVzX2hkciA6IE9wdGlvbmFsW0xpc3Rbc3RyXV0gPSBOb25lLAogICAgd2VpZ2h0IDogZmxvYXQgPSAwLjUwLAogICAgcmFuZG9tX3N0YXRlIDogaW50ID0xLAogICAgZmlsZW5hbWUgOiBPcHRpb25hbFtzdHJdID0gTm9uZSwKICAgIHRhcmdldF9wYXRoIDogc3RyID0gIiIsCiAgICBrZXkgOiBzdHIgPSAiIgopOgogICAgIiIiQ3JlYXRlIGEgYmluYXJ5IGNsYXNzaWZpY2F0aW9uIHNhbXBsZSBkYXRhc2V0IGFuZCBzYXZlLgogICAgSWYgbm8gZmlsZW5hbWUgaXMgZ2l2ZW4gaXQgd2lsbCBkZWZhdWx0IHRvOgogICAgJ3NpbWRhdGEte25fc2FtcGxlc31Ye21fZmVhdHVyZXN9LnBhcnF1ZXQnLgogICAgQWxsIG9mIHRoZSBzY2lraXQtbGVhcm4gcGFyYW1ldGVycyBjYW4gYmUgc2V0IHVzaW5nICoqc2tfcGFyYW1zCiAgICAKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgICBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gbl9zYW1wbGVzOiAgICAgbnVtYmVyIG9mIHJvd3Mvc2FtcGxlcwogICAgOnBhcmFtIG1fZmVhdHVyZXM6ICAgIG51bWJlciBvZiBjb2xzL2ZlYXR1cmVzCiAgICA6cGFyYW0gZmVhdHVyZXNfaGRyOiAgaGVhZGVyIGZvciBmZWF0dXJlcyBhcnJheQogICAgOnBhcmFtIHdlaWdodDogICAgICAgIGZyYWN0aW9uIG9mIHNhbXBsZSAobmVnKQogICAgOnBhcmFtIHJhbmRvbV9zdGF0ZTogIHJuZyBzZWVkIChzZWUgaHR0cHM6Ly9zY2lraXQtbGVhcm4ub3JnL3N0YWJsZS9nbG9zc2FyeS5odG1sI3Rlcm0tcmFuZG9tLXN0YXRlKQogICAgOnBhcmFtIGZpbGVuYW1lOiAgICAgIG9wdGlvbmFsIG5hbWUgZm9yIHN0b3JlZCBkYXRhIGZpbGUKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogICBkZXN0aW1hdGlvbiBmb3IgZmlsZQogICAgOnBhcmFtIGtleTogICAgICAgICAgIGtleSBvZiBkYXRhIGluIGFydGlmYWN0IHN0b3JlCiAgICBSZXR1cm5zIGZpbGVuYW1lIG9mIGNyZWF0ZWQgZGF0YSAoaW5jbHVkZXMgcGF0aCkuCiAgICAiIiIKICAgICMgY2hlY2sgZGlyZWN0b3JpZXMgZXhpc3QgYW5kIGNyZWF0ZSBmaWxlbmFtZSBpZiBOb25lOgogICAgb3MubWFrZWRpcnModGFyZ2V0X3BhdGgsIGV4aXN0X29rPVRydWUpCiAgICBpZiBub3QgZmlsZW5hbWU6CiAgICAgICAgbmFtZSA9IGYic2ltZGF0YS17bl9zYW1wbGVzOjAuMGV9WHttX2ZlYXR1cmVzfS5wYXJxdWV0Ii5yZXBsYWNlKCIrIiwgIiIpCiAgICAgICAgZmlsZW5hbWUgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUpCiAgICBlbHNlOgogICAgICAgIGZpbGVuYW1lID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBmaWxlbmFtZSkKICAgIAogICAgZmVhdHVyZXMsIGxhYmVscyA9IG1ha2VfY2xhc3NpZmljYXRpb24oCiAgICAgICAgbl9zYW1wbGVzPW5fc2FtcGxlcywKICAgICAgICBuX2ZlYXR1cmVzPW1fZmVhdHVyZXMsCiAgICAgICAgd2VpZ2h0cz1bd2VpZ2h0XSwgICMgRmFsc2UKICAgICAgICBuX2NsYXNzZXM9MiwKICAgICAgICByYW5kb21fc3RhdGU9cmFuZG9tX3N0YXRlKQoKICAgICMgbWFrZSBkYXRhZnJhbWVzLCBhZGQgY29sdW1uIG5hbWVzLCBjb25jYXRlbmF0ZSAoWCwgeSkKICAgIFggPSBwZC5EYXRhRnJhbWUoZmVhdHVyZXMpCiAgICBpZiBub3QgZmVhdHVyZXNfaGRyOgogICAgICAgIFguY29sdW1ucyA9IFsiZmVhdF8iICsgc3RyKHgpIGZvciB4IGluIHJhbmdlKG1fZmVhdHVyZXMpXQogICAgZWxzZToKICAgICAgICBYLmNvbHVtbnMgPSBmZWF0dXJlc19oZHIKCiAgICB5ID0gcGQuRGF0YUZyYW1lKGxhYmVscywgY29sdW1ucz1bImxhYmVscyJdKQogICAgZGF0YSA9IHBkLmNvbmNhdChbWCwgeV0sIGF4aXM9MSkKCiAgICBwcS53cml0ZV90YWJsZShwYS5UYWJsZS5mcm9tX3BhbmRhcyhkYXRhKSwgZmlsZW5hbWUpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWZpbGVuYW1lKQo= base_image: yjbds/mlrun-ds:latest commands: [] - code_origin: https://github.com/yjb-ds/functions.git#25e611e4bd05320d342708ce786522bfecaa0e51:/User/repos/functions/datagen/classification/binary.py + code_origin: https://github.com/yjb-ds/functions.git#e4d74d784d42fb25cc75cbcab6d817bb1d2b150c:/User/repos/functions/datagen/classification/binary.py diff --git a/fileutils/arc_to_parquet/arc_to_parquet.py b/fileutils/arc_to_parquet/arc_to_parquet.py index 56cf8f5a5..fd7b697e0 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.py +++ b/fileutils/arc_to_parquet/arc_to_parquet.py @@ -46,8 +46,8 @@ def arc_to_parquet( :param chunksize: (0) row size retrieved per iteration :param key: key in artifact store (when log_data=True) """ - if not name.endswith(".parquet"): - name += ".parquet" + if not name.endswith(".pqt"): + name += ".pqt" dest_path = os.path.join(target_path, name) os.makedirs(os.path.join(target_path), exist_ok=True) diff --git a/fileutils/arc_to_parquet/arc_to_parquet.yaml b/fileutils/arc_to_parquet/arc_to_parquet.yaml index 0233eb15a..74e2278b3 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.yaml +++ b/fileutils/arc_to_parquet/arc_to_parquet.yaml @@ -2,7 +2,7 @@ kind: job metadata: name: arc-to-parquet tag: '' - hash: c6826488913674ec359334111ac9612c79881a2e + hash: 51478ccddd3f23791f0b4f370fa08781cac91cd1 project: '' spec: command: '' @@ -12,8 +12,7 @@ spec: env: [] description: '' build: - functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlciBvbiAyMDIwLTAxLTIyIDE3OjQyCgppbXBvcnQgbWxydW4KbWxydW4ubWxjb25mLmRicGF0aCA9ICdodHRwOi8vbWxydW4tYXBpOjgwODAnCgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IHB5YXJyb3cucGFycXVldCBhcyBwcQppbXBvcnQgcHlhcnJvdyBhcyBwYQpmcm9tIHBpY2tsZSBpbXBvcnQgZHVtcCwgbG9hZAoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gdHlwaW5nIGltcG9ydCBJTywgQW55U3RyLCBVbmlvbiwgTGlzdCwgT3B0aW9uYWwKCgpkZWYgYXJjX3RvX3BhcnF1ZXQoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwKICAgIGFyY2hpdmVfdXJsOiBVbmlvbltzdHIsIFBhdGgsIElPW0FueVN0cl1dLAogICAgaGVhZGVyOiBPcHRpb25hbFtMaXN0W3N0cl1dID0gTm9uZSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAiIiwKICAgIG5hbWU6IHN0ciA9ICIiLAogICAgY2h1bmtzaXplOiBpbnQgPSAxMF8wMDAsCiAgICBsb2dfZGF0YTogYm9vbCA9IFRydWUsCiAgICBhZGRfdWlkOiBib29sID0gRmFsc2UsCiAgICBrZXk6IHN0ciA9ICJyYXdfZGF0YSIsCikgLT4gTm9uZToKICAgICIiIk9wZW4gYSBmaWxlL29iamVjdCBhcmNoaXZlIGFuZCBzYXZlIGFzIGEgcGFycXVldCBmaWxlLgogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBhcmNoaXZlX3VybDogYW55IHZhbGlkIHN0cmluZyBwYXRoIGNvbnNpc3RlbnQgd2l0aCB0aGUgcGF0aCB2YXJpYWJsZQogICAgICAgICAgICAgICAgICAgICAgICBvZiBwYW5kYXMucmVhZF9jc3YsIGluY2x1ZGluZyBzdHJpbmdzIGFzIGZpbGUgcGF0aHMsIGFzIHVybHMsIAogICAgICAgICAgICAgICAgICAgICAgICBwYXRobGliLlBhdGggb2JqZWN0cywgZXRjLi4uCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgIGNvbHVtbiBuYW1lcwogICAgOnBhcmFtIHRhcmdldF9wYXRoOiBkZXN0aW5hdGlvbiBmb2xkZXIgb2YgdGFibGUKICAgIDpwYXJhbSBuYW1lOiAgICAgICAgbmFtZSBmaWxlIHRvIGJlIHNhdmVkIGxvY2FsbHksIGFsc28KICAgIDpwYXJhbSBjaHVua3NpemU6ICAgKDApIHJvdyBzaXplIHJldHJpZXZlZCBwZXIgaXRlcmF0aW9uCiAgICA6cGFyYW0ga2V5OiAgICAgICAgIGtleSBpbiBhcnRpZmFjdCBzdG9yZSAod2hlbiBsb2dfZGF0YT1UcnVlKQogICAgIiIiCiAgICBpZiBub3QgbmFtZS5lbmRzd2l0aCgiLnBhcnF1ZXQiKToKICAgICAgICBuYW1lICs9ICIucGFycXVldCIKCiAgICBkZXN0X3BhdGggPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUpCiAgICBvcy5tYWtlZGlycyhvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgpLCBleGlzdF9vaz1UcnVlKQogICAgaWYgbm90IG9zLnBhdGguaXNmaWxlKGRlc3RfcGF0aCk6CiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbygiZGVzdGluYXRpb24gZmlsZSBkb2VzIG5vdCBleGlzdCwgZG93bmxvYWRpbmciKQogICAgICAgIHBxd3JpdGVyID0gTm9uZQogICAgICAgIGZvciBpLCBkZiBpbiBlbnVtZXJhdGUocGQucmVhZF9jc3YoYXJjaGl2ZV91cmwsIGNodW5rc2l6ZT1jaHVua3NpemUsIG5hbWVzPWhlYWRlcikpOgogICAgICAgICAgICBwYXJxdWV0X3NjaGVtYSA9IHBhLlRhYmxlLmZyb21fcGFuZGFzKGRmPWRmKS5zY2hlbWEKICAgICAgICAgICAgaWYgaSA9PSAwOgogICAgICAgICAgICAgICAgcHF3cml0ZXIgPSBwcS5QYXJxdWV0V3JpdGVyKGRlc3RfcGF0aCwgcGFycXVldF9zY2hlbWEpCiAgICAgICAgICAgIHRhYmxlID0gcGEuVGFibGUuZnJvbV9wYW5kYXMoZGYsIHBhcnF1ZXRfc2NoZW1hKQogICAgICAgICAgICBwcXdyaXRlci53cml0ZV90YWJsZSh0YWJsZSkKICAgICAgICBpZiBwcXdyaXRlcjoKICAgICAgICAgICAgcHF3cml0ZXIuY2xvc2UoKQoKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKGYic2F2ZWQgdGFibGUgdG8ge2Rlc3RfcGF0aH0iKQogICAgZWxzZToKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCJkZXN0aW5hdGlvbiBmaWxlIGFscmVhZHkgZXhpc3RzIikKCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWRlc3RfcGF0aCkKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCAnaGVhZGVyLnBrbCcpCiAgICBkdW1wKGhlYWRlciwgb3BlbihmaWxlcGF0aCwgJ3diJykpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgnaGVhZGVyJywgdGFyZ2V0X3BhdGg9ZmlsZXBhdGgpICAgICAgIAoK - base_image: yjbds/mlrun-files:latest + functionSourceCode: IyBDb3B5cmlnaHQgMjAxOCBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IHB5YXJyb3cucGFycXVldCBhcyBwcQppbXBvcnQgcHlhcnJvdyBhcyBwYQpmcm9tIHBpY2tsZSBpbXBvcnQgZHVtcCwgbG9hZAoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gdHlwaW5nIGltcG9ydCBJTywgQW55U3RyLCBVbmlvbiwgTGlzdCwgT3B0aW9uYWwKCgpkZWYgYXJjX3RvX3BhcnF1ZXQoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwKICAgIGFyY2hpdmVfdXJsOiBVbmlvbltzdHIsIFBhdGgsIElPW0FueVN0cl1dLAogICAgaGVhZGVyOiBPcHRpb25hbFtMaXN0W3N0cl1dID0gTm9uZSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAiIiwKICAgIG5hbWU6IHN0ciA9ICIiLAogICAgY2h1bmtzaXplOiBpbnQgPSAxMF8wMDAsCiAgICBsb2dfZGF0YTogYm9vbCA9IFRydWUsCiAgICBhZGRfdWlkOiBib29sID0gRmFsc2UsCiAgICBrZXk6IHN0ciA9ICJyYXdfZGF0YSIsCikgLT4gTm9uZToKICAgICIiIk9wZW4gYSBmaWxlL29iamVjdCBhcmNoaXZlIGFuZCBzYXZlIGFzIGEgcGFycXVldCBmaWxlLgogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBhcmNoaXZlX3VybDogYW55IHZhbGlkIHN0cmluZyBwYXRoIGNvbnNpc3RlbnQgd2l0aCB0aGUgcGF0aCB2YXJpYWJsZQogICAgICAgICAgICAgICAgICAgICAgICBvZiBwYW5kYXMucmVhZF9jc3YsIGluY2x1ZGluZyBzdHJpbmdzIGFzIGZpbGUgcGF0aHMsIGFzIHVybHMsIAogICAgICAgICAgICAgICAgICAgICAgICBwYXRobGliLlBhdGggb2JqZWN0cywgZXRjLi4uCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgIGNvbHVtbiBuYW1lcwogICAgOnBhcmFtIHRhcmdldF9wYXRoOiBkZXN0aW5hdGlvbiBmb2xkZXIgb2YgdGFibGUKICAgIDpwYXJhbSBuYW1lOiAgICAgICAgbmFtZSBmaWxlIHRvIGJlIHNhdmVkIGxvY2FsbHksIGFsc28KICAgIDpwYXJhbSBjaHVua3NpemU6ICAgKDApIHJvdyBzaXplIHJldHJpZXZlZCBwZXIgaXRlcmF0aW9uCiAgICA6cGFyYW0ga2V5OiAgICAgICAgIGtleSBpbiBhcnRpZmFjdCBzdG9yZSAod2hlbiBsb2dfZGF0YT1UcnVlKQogICAgIiIiCiAgICBpZiBub3QgbmFtZS5lbmRzd2l0aCgiLnBxdCIpOgogICAgICAgIG5hbWUgKz0gIi5wcXQiCgogICAgZGVzdF9wYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgb3MubWFrZWRpcnMob3MucGF0aC5qb2luKHRhcmdldF9wYXRoKSwgZXhpc3Rfb2s9VHJ1ZSkKICAgIGlmIG5vdCBvcy5wYXRoLmlzZmlsZShkZXN0X3BhdGgpOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oImRlc3RpbmF0aW9uIGZpbGUgZG9lcyBub3QgZXhpc3QsIGRvd25sb2FkaW5nIikKICAgICAgICBwcXdyaXRlciA9IE5vbmUKICAgICAgICBmb3IgaSwgZGYgaW4gZW51bWVyYXRlKHBkLnJlYWRfY3N2KGFyY2hpdmVfdXJsLCBjaHVua3NpemU9Y2h1bmtzaXplLCBuYW1lcz1oZWFkZXIpKToKICAgICAgICAgICAgdGFibGUgPSBwYS5UYWJsZS5mcm9tX3BhbmRhcyhkZikKICAgICAgICAgICAgaWYgaSA9PSAwOgogICAgICAgICAgICAgICAgcHF3cml0ZXIgPSBwcS5QYXJxdWV0V3JpdGVyKGRlc3RfcGF0aCwgdGFibGUuc2NoZW1hKQogICAgICAgICAgICBwcXdyaXRlci53cml0ZV90YWJsZSh0YWJsZSkKCiAgICAgICAgaWYgcHF3cml0ZXI6CiAgICAgICAgICAgIHBxd3JpdGVyLmNsb3NlKCkKCiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbyhmInNhdmVkIHRhYmxlIHRvIHtkZXN0X3BhdGh9IikKICAgIGVsc2U6CiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbygiZGVzdGluYXRpb24gZmlsZSBhbHJlYWR5IGV4aXN0cyIpCgogICAgY29udGV4dC5sb2dfYXJ0aWZhY3Qoa2V5LCB0YXJnZXRfcGF0aD1kZXN0X3BhdGgpCiAgICAjIGxvZyBoZWFkZXIKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCAnaGVhZGVyLnBrbCcpCiAgICBkdW1wKGhlYWRlciwgb3BlbihmaWxlcGF0aCwgJ3diJykpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgnaGVhZGVyJywgdGFyZ2V0X3BhdGg9ZmlsZXBhdGgpCg== + base_image: yjbds/mlrun-ds:latest commands: [] - code_origin: https://github.com/yjb-ds/functions.git#3f7e0c78313c0f8da3f2ae8535b625f06f5c3ee4:arc - to parquet.ipynb + code_origin: https://github.com/yjb-ds/functions.git#e4d74d784d42fb25cc75cbcab6d817bb1d2b150c:/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py diff --git a/tests/arc_to_parquet.ipynb b/tests/arc_to_parquet.ipynb index 84530415e..81c407f05 100644 --- a/tests/arc_to_parquet.ipynb +++ b/tests/arc_to_parquet.ipynb @@ -9,7 +9,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -20,11 +20,21 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 13, "metadata": {}, "outputs": [], "source": [ - "CODE_BASE = '/User/repos/functions/' \n" + "CODE_BASE = '/User/repos/functions/'\n", + "TARGET_PATH = '/User/mlrun/models'\n", + "# ARCHIVE = \"https://fpsignals-public.s3.amazonaws.com/higgs-small.tar.gz\"\n", + "ARCHIVE = \"https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz\"\n", + "FILE_NAME = 'higgs.pqt'\n", + "KEY = 'higgs'\n", + "\n", + "HEADER = ['labels', 'lepton_pT', 'lepton_eta', 'lepton_phi', 'missing_energy_magnitude', 'missing_energy_phi',\n", + " 'jet_1_pt', 'jet_1_eta', 'jet_1_phi', 'jet_1_b-tag', 'jet_2_pt', 'jet_2_eta', 'jet_2_phi', 'jet_2_b-tag', 'jet_3_pt',\n", + " 'jet_3_eta', 'jet_3_phi', 'jet_3_b-tag', 'jet_4_pt', 'jet_4_eta', 'jet_4_phi', 'jet_4_b-tag', 'm_jj', 'm_jjj', 'm_lv',\n", + " 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']" ] }, { @@ -36,98 +46,323 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 20, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-26 14:25:57,524 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" + ] + } + ], "source": [ "# load function from a local Python file\n", + "arctoparq = mlrun.code_to_function(\n", + " filename=os.path.join(CODE_BASE, 'fileutils/arc_to_parquet', 'arc_to_parquet.py'), \n", + " kind='job')\n", + "arctoparq.build_config(base_image='yjbds/mlrun-ds:latest', commands=[])\n", "yaml_name = os.path.join(CODE_BASE, 'fileutils/arc_to_parquet', 'arc_to_parquet.yaml')\n", - "if not os.path.isfile(yaml_name):\n", - " testfn = mlrun.code_to_function(CODE_BASE + '/arc_to_parquet/arc_to_parquet.py', \n", - " kind='job')\n", - " testfn.build_config(base_image='yjbds/mlrun-ds:latest')\n", - " testfn.export(yaml_name)\n", - " testfn.apply(mlrun.mount_v3io())\n", - " fn.interactive = True" + "arctoparq.export(yaml_name)" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 21, "metadata": {}, + "outputs": [], "source": [ - "#### deploy / build" + "arctoparq = mlrun.import_function(\n", + " os.path.join(CODE_BASE, 'fileutils/arc_to_parquet', 'arc_to_parquet.yaml')\n", + ").apply(mlrun.mount_v3io())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The following triggers a build when run for the first time using specs found in the yaml file above. Unless that file changes, this only needs to be run once, even after the notebook has been restarted:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "fn.deploy(skip_deployed=True, with_mlrun=False)" + "#### deploy / build" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Also note that the build time can be reduced if you specifiy a pre-built image with all required packages pre-installed." + "The following triggers a build when run for the first time using specs found in the yaml file above. Unless that file changes, this only needs to be run once, even after the notebook has been restarted:" ] }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 22, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "'ready'" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "# useful constants\n", - "target_path = '/User/mlrun/models'\n", - "# archive = \"https://fpsignals-public.s3.amazonaws.com/higgs-small.tar.gz\"\n", - "archive = \"https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz\"\n", - "parquet_file = 'higgs.parquet' # the file extension is not necessary\n", - "parquet_file_path = target_path + \"/\" + parquet_file\n", - "artifact_key = 'higgs_large'" + "arctoparq.deploy(skip_deployed=True, with_mlrun=False)" ] }, { - "cell_type": "code", - "execution_count": 19, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "HIGGS_HEADER = ['labels', 'lepton_pT', 'lepton_eta', 'lepton_phi', 'missing_energy_magnitude', 'missing_energy_phi',\n", - " 'jet_1_pt', 'jet_1_eta', 'jet_1_phi', 'jet_1_b-tag', 'jet_2_pt', 'jet_2_eta', 'jet_2_phi', 'jet_2_b-tag', 'jet_3_pt',\n", - " 'jet_3_eta', 'jet_3_phi', 'jet_3_b-tag', 'jet_4_pt', 'jet_4_eta', 'jet_4_phi', 'jet_4_b-tag', 'm_jj', 'm_jjj', 'm_lv',\n", - " 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']" + "Also note that the build time can be reduced if you specifiy a pre-built image with all required packages pre-installed." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 23, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-26 14:26:02,799 starting run arc2parq uid=9bfb3ce77e0549f2b1016c128423130e -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 14:26:02,882 Job is running in the background, pod: arc2parq-9t75c\n", + "[mlrun] 2020-01-26 14:26:07,133 destination file does not exist, downloading\n", + "[mlrun] 2020-01-26 14:31:08,548 saved table to /User/mlrun/models/higgs.pqt\n", + "[mlrun] 2020-01-26 14:31:08,577 log artifact higgs at /User/mlrun/models/higgs.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:31:08,596 log artifact header at /User/mlrun/models/header.pkl, size: None, db: Y\n", + "\n", + "[mlrun] 2020-01-26 14:31:08,619 run executed, status=completed\n", + "final state: succeeded\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...23130e
0Jan 26 14:26:07completedarc-to-parquet
host=arc2parq-9t75c
kind=job
owner=admin
archive_url=https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
header=['labels', 'lepton_pT', 'lepton_eta', 'lepton_phi', 'missing_energy_magnitude', 'missing_energy_phi', 'jet_1_pt', 'jet_1_eta', 'jet_1_phi', 'jet_1_b-tag', 'jet_2_pt', 'jet_2_eta', 'jet_2_phi', 'jet_2_b-tag', 'jet_3_pt', 'jet_3_eta', 'jet_3_phi', 'jet_3_b-tag', 'jet_4_pt', 'jet_4_eta', 'jet_4_phi', 'jet_4_b-tag', 'm_jj', 'm_jjj', 'm_lv', 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']
key=higgs
name=higgs.pqt
target_path=/User/mlrun/models
higgs
header
\n", + "
\n", + "
\n", + "
\n", + " Title\n", + " ×\n", + "
\n", + " \n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "to track results use .show() or .logs() or in CLI: \n", + "!mlrun get run 9bfb3ce77e0549f2b1016c128423130e , !mlrun logs 9bfb3ce77e0549f2b1016c128423130e \n", + "[mlrun] 2020-01-26 14:31:13,245 run executed, status=completed\n" + ] + } + ], "source": [ "# create and run the task\n", "arc_to_parq_task = mlrun.NewTask(\n", " 'arc2parq', \n", " handler='arc_to_parquet', \n", " params={\n", - " 'target_path': target_path,\n", - " 'name' : parquet_file, \n", - " 'key' : artifact_key,\n", - " 'archive_url': archive,\n", - " 'header' : HIGGS_HEADER},\n", - " outputs=[artifact_key])\n", + " 'target_path': TARGET_PATH,\n", + " 'name' : FILE_NAME, \n", + " 'key' : KEY,\n", + " 'archive_url': ARCHIVE,\n", + " 'header' : HEADER},\n", + " outputs=[KEY])\n", "\n", "# run\n", - "run = fn.run(arc_to_parq_task)" + "run = arctoparq.run(arc_to_parq_task)" ] }, { @@ -146,7 +381,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 24, "metadata": {}, "outputs": [], "source": [ @@ -157,7 +392,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 25, "metadata": {}, "outputs": [], "source": [ @@ -167,26 +402,26 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 27, "metadata": {}, "outputs": [], "source": [ - "assert artifact_key in run.outputs.keys(), f\"mlrun.functions: key {artifact_key} not found in outputs\"\n", - "assert os.path.isfile(parquet_file_path), f\"mlrun.functions: artifact source not found at {parquet_file_path}\"" + "assert KEY in run.outputs.keys(), f\"mlrun.functions: key {KEY} not found in outputs\"\n", + "assert os.path.isfile(TARGET_PATH+'/'+ FILE_NAME), f\"mlrun.functions: artifact source not found at {TARGET_PATH+'/'+ FILE_NAME}\"" ] }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 28, "metadata": {}, "outputs": [], "source": [ - "copied = pd.read_parquet(parquet_file_path, engine=\"pyarrow\")" + "copied = pd.read_parquet(TARGET_PATH+'/'+ FILE_NAME, engine=\"pyarrow\")" ] }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 29, "metadata": {}, "outputs": [ { @@ -361,25 +596,25 @@ ], "text/plain": [ " labels lepton_pT lepton_eta lepton_phi missing_energy_magnitude \\\n", - "0 1.0 0.869293 -0.635082 0.225690 0.327470 \n", - "1 1.0 0.907542 0.329147 0.359412 1.497970 \n", - "2 1.0 0.798835 1.470639 -1.635975 0.453773 \n", - "3 0.0 1.344385 -0.876626 0.935913 1.992050 \n", - "4 1.0 1.105009 0.321356 1.522401 0.882808 \n", + "0 1.0 0.869293 -0.635082 0.225690 0.327470 \n", + "1 1.0 0.907542 0.329147 0.359412 1.497970 \n", + "2 1.0 0.798835 1.470639 -1.635975 0.453773 \n", + "3 0.0 1.344385 -0.876626 0.935913 1.992050 \n", + "4 1.0 1.105009 0.321356 1.522401 0.882808 \n", "\n", " missing_energy_phi jet_1_pt jet_1_eta jet_1_phi jet_1_b-tag ... \\\n", - "0 -0.689993 0.754202 -0.248573 -1.092064 0.000000 ... \n", - "1 -0.313010 1.095531 -0.557525 -1.588230 2.173076 ... \n", - "2 0.425629 1.104875 1.282322 1.381664 0.000000 ... \n", - "3 0.882454 1.786066 -1.646778 -0.942383 0.000000 ... \n", - "4 -1.205349 0.681466 -1.070464 -0.921871 0.000000 ... \n", + "0 -0.689993 0.754202 -0.248573 -1.092064 0.000000 ... \n", + "1 -0.313010 1.095531 -0.557525 -1.588230 2.173076 ... \n", + "2 0.425629 1.104875 1.282322 1.381664 0.000000 ... \n", + "3 0.882454 1.786066 -1.646778 -0.942383 0.000000 ... \n", + "4 -1.205349 0.681466 -1.070464 -0.921871 0.000000 ... \n", "\n", " jet_4_eta jet_4_phi jet_4_b-tag m_jj m_jjj m_lv m_jlv \\\n", - "0 -0.010455 -0.045767 3.101961 1.353760 0.979563 0.978076 0.920005 \n", - "1 -1.138930 -0.000819 0.000000 0.302220 0.833048 0.985700 0.978098 \n", - "2 1.128848 0.900461 0.000000 0.909753 1.108330 0.985692 0.951331 \n", - "3 -0.678379 -1.360356 0.000000 0.946652 1.028704 0.998656 0.728281 \n", - "4 -0.373566 0.113041 0.000000 0.755856 1.361057 0.986610 0.838085 \n", + "0 -0.010455 -0.045767 3.101961 1.353760 0.979563 0.978076 0.920005 \n", + "1 -1.138930 -0.000819 0.000000 0.302220 0.833048 0.985700 0.978098 \n", + "2 1.128848 0.900461 0.000000 0.909753 1.108330 0.985692 0.951331 \n", + "3 -0.678379 -1.360356 0.000000 0.946652 1.028704 0.998656 0.728281 \n", + "4 -0.373566 0.113041 0.000000 0.755856 1.361057 0.986610 0.838085 \n", "\n", " m_bb m_wbb m_wwbb \n", "0 0.721657 0.988751 0.876678 \n", @@ -391,7 +626,7 @@ "[5 rows x 29 columns]" ] }, - "execution_count": 21, + "execution_count": 29, "metadata": {}, "output_type": "execute_result" } @@ -402,7 +637,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 30, "metadata": {}, "outputs": [ { @@ -411,7 +646,7 @@ "(11000000, 29)" ] }, - "execution_count": 22, + "execution_count": 30, "metadata": {}, "output_type": "execute_result" } diff --git a/tests/create_binary_data.ipynb b/tests/create_binary_data.ipynb index 9e35912ae..c2f5fcdd6 100644 --- a/tests/create_binary_data.ipynb +++ b/tests/create_binary_data.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "code", - "execution_count": 1, + "execution_count": 20, "metadata": {}, "outputs": [], "source": [ @@ -14,7 +14,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 21, "metadata": {}, "outputs": [], "source": [ @@ -22,20 +22,20 @@ "N_SAMPLES = 100_000\n", "M_FEATURES = 28\n", "NEG_WEIGHT = 0.5\n", - "TARGET_DATA_PATH = '/User/mlrun/datagen'\n", + "TARGET_DATA_PATH = '/User/mlrun/models'\n", "KEY = 'simdata'" ] }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 13:24:11,657 function spec saved to path: /User/repos/functions/datagen/classification/binary.yaml\n" + "[mlrun] 2020-01-26 14:32:06,367 function spec saved to path: /User/repos/functions/datagen/classification/binary.yaml\n" ] } ], @@ -49,7 +49,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 23, "metadata": {}, "outputs": [], "source": [ @@ -60,16 +60,23 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 24, "metadata": {}, "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-26 14:32:09,075 starting remote build, image: .mlrun/func-default-binary-latest\n" + ] + }, { "data": { "text/plain": [ - "'ready'" + "True" ] }, - "execution_count": 17, + "execution_count": 24, "metadata": {}, "output_type": "execute_result" } @@ -80,18 +87,18 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 13:24:12,576 starting run create_binary_classification uid=86d601dc16eb4fac90f08974153d0d9d -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-26 13:24:12,655 Job is running in the background, pod: create-binary-classification-7zlqd\n", - "[mlrun] 2020-01-26 13:24:25,635 log artifact simdata at /User/mlrun/datagen/simdata-1e05X28.parquet, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:32:15,900 starting run create_binary_classification uid=9dd358cd04554c9aa138275c0ec080aa -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 14:32:15,989 Job is running in the background, pod: create-binary-classification-crlf9\n", + "[mlrun] 2020-01-26 14:32:26,865 log artifact simdata at /User/mlrun/models/simdata-1e05X28.parquet, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-26 13:24:25,648 run executed, status=completed\n", + "[mlrun] 2020-01-26 14:32:26,877 run executed, status=completed\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", " result = infer_dtype(pandas_collection)\n", "final state: succeeded\n" @@ -266,26 +273,26 @@ " \n", " \n", " \n", - "
...3d0d9d
\n", + "
...c080aa
\n", " 0\n", - " Jan 26 13:24:24\n", + " Jan 26 14:32:26\n", " completed\n", " binary\n", - "
host=create-binary-classification-7zlqd
kind=job
owner=admin
\n", + "
host=create-binary-classification-crlf9
kind=job
owner=admin
\n", " \n", - "
key=simdata
m_features=28
n_samples=100000
target_path=/User/mlrun/datagen
weight=0.5
\n", + "
key=simdata
m_features=28
n_samples=100000
target_path=/User/mlrun/models
weight=0.5
\n", " \n", - "
simdata
\n", + "
simdata
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -301,17 +308,17 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 86d601dc16eb4fac90f08974153d0d9d , !mlrun logs 86d601dc16eb4fac90f08974153d0d9d \n", - "[mlrun] 2020-01-26 13:24:31,859 run executed, status=completed\n" + "!mlrun get run 9dd358cd04554c9aa138275c0ec080aa , !mlrun logs 9dd358cd04554c9aa138275c0ec080aa \n", + "[mlrun] 2020-01-26 14:32:35,166 run executed, status=completed\n" ] }, { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 18, + "execution_count": 25, "metadata": {}, "output_type": "execute_result" } @@ -337,7 +344,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 26, "metadata": {}, "outputs": [], "source": [ diff --git a/tests/test_classifier.ipynb b/tests/test_classifier.ipynb index 91510af1b..bf065d78a 100644 --- a/tests/test_classifier.ipynb +++ b/tests/test_classifier.ipynb @@ -11,7 +11,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 16, "metadata": {}, "outputs": [], "source": [ @@ -30,7 +30,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 21, "metadata": {}, "outputs": [], "source": [ @@ -38,9 +38,9 @@ "\n", "MODEL_FILE = '/User/mlrun/models/lgb-classifier.pkl'\n", "\n", - "TARGET_DATA_PATH = '/User/mlrun/splitter'\n", - "XTEST_FILE = '/User/mlrun/splitter/xtest.pqt'\n", - "YTEST_FILE = '/User/mlrun/splitter/ytest.pqt'" + "TARGET_DATA_PATH = '/User/mlrun/models'\n", + "XTEST_FILE = '/User/mlrun/models/xtest.pqt'\n", + "YTEST_FILE = '/User/mlrun/models/ytest.pqt'" ] }, { @@ -52,7 +52,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 22, "metadata": {}, "outputs": [], "source": [ @@ -69,7 +69,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 23, "metadata": {}, "outputs": [], "source": [ @@ -80,7 +80,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 24, "metadata": {}, "outputs": [ { @@ -89,7 +89,7 @@ "'ready'" ] }, - "execution_count": 5, + "execution_count": 24, "metadata": {}, "output_type": "execute_result" } @@ -100,16 +100,16 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 6, + "execution_count": 25, "metadata": {}, "output_type": "execute_result" } @@ -125,32 +125,22 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 13:19:07,552 starting run test_model uid=b2dff34b7d184eb89aac9e919764e9c4 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-26 13:19:07,644 Job is running in the background, pod: test-model-4qkft\n", - "[mlrun] 2020-01-26 13:19:17,924 Traceback (most recent call last):\n", - " File \"/opt/conda/lib/python3.7/site-packages/mlrun/runtimes/local.py\", line 174, in exec_from_params\n", - " val = handler(*args_list)\n", - " File \"main.py\", line 68, in test_model\n", - " ypred_probs = clf.predict_proba(xtest)[:, 1]\n", - " File \"/opt/conda/lib/python3.7/site-packages/lightgbm/sklearn.py\", line 858, in predict_proba\n", - " pred_leaf, pred_contrib, **kwargs)\n", - " File \"/opt/conda/lib/python3.7/site-packages/lightgbm/sklearn.py\", line 658, in predict\n", - " % (self._n_features, n_features))\n", - "ValueError: Number of features of the model must match the input. Model n_features_ is 20 and input n_features is 28 \n", + "[mlrun] 2020-01-26 14:37:14,256 starting run test_model uid=6f193cffb1714302ae174b06ca9538f6 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 14:37:14,354 Job is running in the background, pod: test-model-x4z5p\n", + "[mlrun] 2020-01-26 14:37:24,867 log artifact roc.html at roc.html, size: 32259, db: Y\n", + "[mlrun] 2020-01-26 14:37:25,080 log artifact confusion_matrix.html at confusion_matrix.html, size: 16188, db: Y\n", + "[mlrun] 2020-01-26 14:37:25,667 log artifact feature-importances-plot.html at feature-importances-plot.html, size: 68260, db: Y\n", + "[mlrun] 2020-01-26 14:37:25,682 log artifact feature-importances-table at /User/mlrun/models/feature-importances-table.csv, size: None, db: Y\n", "\n", - "\n", - "[mlrun] 2020-01-26 13:19:17,941 exec error - Number of features of the model must match the input. Model n_features_ is 20 and input n_features is 28 \n", - "[mlrun] 2020-01-26 13:19:17,968 run executed, status=error\n", - "Number of features of the model must match the input. Model n_features_ is 20 and input n_features is 28 \n", - "runtime error: Number of features of the model must match the input. Model n_features_ is 20 and input n_features is 28 \n", - "final state: failed\n" + "[mlrun] 2020-01-26 14:37:25,739 run executed, status=completed\n", + "final state: succeeded\n" ] }, { @@ -322,26 +312,26 @@ " \n", " \n", " \n", - "
...64e9c4
\n", + "
...9538f6
\n", " 0\n", - " Jan 26 13:19:14\n", - "
error
\n", + " Jan 26 14:37:20\n", + " completed\n", " test-classifier\n", - "
host=test-model-4qkft
kind=job
owner=admin
\n", - " \n", - "
model=/User/mlrun/models/lgb-classifier.pkl
target_path=/User/mlrun/splitter
xtest=/User/mlrun/splitter/xtest.pqt
ytest=/User/mlrun/splitter/ytest.pqt
\n", + "
host=test-model-x4z5p
kind=job
owner=admin
\n", " \n", + "
model=/User/mlrun/models/lgb-classifier.pkl
target_path=/User/mlrun/models
xtest=/User/mlrun/models/xtest.pqt
ytest=/User/mlrun/models/ytest.pqt
\n", " \n", + "
roc.html
confusion_matrix.html
feature-importances-plot.html
feature-importances-table
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -357,22 +347,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run b2dff34b7d184eb89aac9e919764e9c4 , !mlrun logs b2dff34b7d184eb89aac9e919764e9c4 \n", - "[mlrun] 2020-01-26 13:19:26,852 run executed, status=error\n", - "runtime error: Number of features of the model must match the input. Model n_features_ is 20 and input n_features is 28 \n" - ] - }, - { - "ename": "RunError", - "evalue": "Number of features of the model must match the input. Model n_features_ is 20 and input n_features is 28 ", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mRunError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mtsk_run\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtestfn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhandler\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'test_model'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, runspec, handler, name, project, params, inputs, out_path, workdir, watch, schedule)\u001b[0m\n\u001b[1;32m 266\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_post_run\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 267\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 268\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 269\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 270\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_api_server\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkfp\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36m_wrap_result\u001b[0;34m(self, result, runspec, err)\u001b[0m\n\u001b[1;32m 334\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_child\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 335\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'runtime error: {}'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 336\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mRunError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 337\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrun\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 338\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mRunError\u001b[0m: Number of features of the model must match the input. Model n_features_ is 20 and input n_features is 28 " + "!mlrun get run 6f193cffb1714302ae174b06ca9538f6 , !mlrun logs 6f193cffb1714302ae174b06ca9538f6 \n", + "[mlrun] 2020-01-26 14:37:33,579 run executed, status=completed\n" ] } ], diff --git a/tests/train_classifier.ipynb b/tests/train_classifier.ipynb index f5bc34c61..449aea8e5 100644 --- a/tests/train_classifier.ipynb +++ b/tests/train_classifier.ipynb @@ -24,7 +24,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 23, "metadata": {}, "outputs": [], "source": [ @@ -43,15 +43,15 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "CODE_BASE = '/User/repos/functions/' \n", "N_SAMPLES = 100_000 # size of HIGGS data\n", - "M_FEATURES = 20\n", + "M_FEATURES = 28\n", "NEG_WEIGHT = 0.5\n", - "TARGET_DATA_PATH = '/User/mlrun/sklearn-classifier'\n", + "TARGET_DATA_PATH = '/User/mlrun/models'\n", "FILE_NAME = 'simdata.pqt'\n", "KEY = 'simdata'\n", "RNG = 1\n", @@ -70,7 +70,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 30, "metadata": {}, "outputs": [], "source": [ @@ -81,7 +81,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 31, "metadata": {}, "outputs": [ { @@ -90,7 +90,7 @@ "'ready'" ] }, - "execution_count": 4, + "execution_count": 31, "metadata": {}, "output_type": "execute_result" } @@ -101,16 +101,16 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 5, + "execution_count": 32, "metadata": {}, "output_type": "execute_result" } @@ -129,18 +129,18 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 13:15:51,762 starting run create_binary_classification uid=39417bbf476c45b7a5cb0809e883b979 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-26 13:15:51,849 Job is running in the background, pod: create-binary-classification-vsdvh\n", - "[mlrun] 2020-01-26 13:16:02,850 log artifact simdata at /User/mlrun/sklearn-classifier/simdata.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:35:40,509 starting run create_binary_classification uid=245e550ff213469681114228327a8e02 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 14:35:40,606 Job is running in the background, pod: create-binary-classification-7295j\n", + "[mlrun] 2020-01-26 14:35:53,548 log artifact simdata at /User/mlrun/models/simdata.pqt, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-26 13:16:02,862 run executed, status=completed\n", + "[mlrun] 2020-01-26 14:35:53,560 run executed, status=completed\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", " result = infer_dtype(pandas_collection)\n", "final state: succeeded\n" @@ -315,26 +315,26 @@ " \n", " \n", " \n", - "
...83b979
\n", + "
...7a8e02
\n", " 0\n", - " Jan 26 13:16:02\n", + " Jan 26 14:35:52\n", " completed\n", " binary\n", - "
host=create-binary-classification-vsdvh
kind=job
owner=admin
\n", + "
host=create-binary-classification-7295j
kind=job
owner=admin
\n", " \n", - "
filename=simdata.pqt
key=simdata
m_features=20
n_samples=100000
random_state=1
target_path=/User/mlrun/sklearn-classifier
weight=0.5
\n", + "
filename=simdata.pqt
key=simdata
m_features=28
n_samples=100000
random_state=1
target_path=/User/mlrun/models
weight=0.5
\n", " \n", - "
simdata
\n", + "
simdata
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -350,8 +350,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 39417bbf476c45b7a5cb0809e883b979 , !mlrun logs 39417bbf476c45b7a5cb0809e883b979 \n", - "[mlrun] 2020-01-26 13:16:11,051 run executed, status=completed\n" + "!mlrun get run 245e550ff213469681114228327a8e02 , !mlrun logs 245e550ff213469681114228327a8e02 \n", + "[mlrun] 2020-01-26 14:35:59,827 run executed, status=completed\n" ] } ], @@ -375,7 +375,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 34, "metadata": {}, "outputs": [], "source": [ @@ -386,7 +386,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 35, "metadata": {}, "outputs": [ { @@ -395,7 +395,7 @@ "'ready'" ] }, - "execution_count": 8, + "execution_count": 35, "metadata": {}, "output_type": "execute_result" } @@ -406,16 +406,16 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 9, + "execution_count": 36, "metadata": {}, "output_type": "execute_result" } @@ -431,22 +431,20 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 13:16:11,109 starting run train_valid_test_splitter uid=ecb802dffb1b43269c25f43fd7a4919a -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-26 13:16:11,191 Job is running in the background, pod: train-valid-test-splitter-7k25p\n", - "[mlrun] 2020-01-26 13:16:21,068 log artifact header at /User/mlrun/sklearn-classifier/header.pkl, size: None, db: Y\n", - "[mlrun] 2020-01-26 13:16:21,156 log artifact xtrain at /User/mlrun/sklearn-classifier/xtrain.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 13:16:21,220 log artifact xvalid at /User/mlrun/sklearn-classifier/xvalid.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 13:16:21,262 log artifact xtest at /User/mlrun/sklearn-classifier/xtest.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 13:16:21,280 log artifact ytrain at /User/mlrun/sklearn-classifier/ytrain.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 13:16:21,298 log artifact yvalid at /User/mlrun/sklearn-classifier/yvalid.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 13:16:21,312 log artifact ytest at /User/mlrun/sklearn-classifier/ytest.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:35:59,880 starting run train_valid_test_splitter uid=907ad4a876fa4205a40a668956446468 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 14:35:59,974 Job is running in the background, pod: train-valid-test-splitter-vdn6h\n", + "[mlrun] 2020-01-26 14:36:09,842 log artifact header at /User/mlrun/models/header.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:36:09,953 log artifact xtrain at /User/mlrun/models/xtrain.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:36:10,052 log artifact xvalid at /User/mlrun/models/xvalid.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:36:10,104 log artifact xtest at /User/mlrun/models/xtest.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:36:10,146 log artifact ytrain at /User/mlrun/models/ytrain.pqt, size: None, db: Y\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:708: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", " labels = getattr(columns, 'labels', None) or [\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:735: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead\n", @@ -455,8 +453,10 @@ " labels, = index.labels\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", " result = infer_dtype(pandas_collection)\n", + "[mlrun] 2020-01-26 14:36:10,182 log artifact yvalid at /User/mlrun/models/yvalid.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:36:10,211 log artifact ytest at /User/mlrun/models/ytest.pqt, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-26 13:16:21,325 run executed, status=completed\n", + "[mlrun] 2020-01-26 14:36:10,236 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -629,26 +629,26 @@ " \n", " \n", " \n", - "
...a4919a
\n", + "
...446468
\n", " 0\n", - " Jan 26 13:16:20\n", + " Jan 26 14:36:09\n", " completed\n", " train-valid-test\n", - "
host=train-valid-test-splitter-7k25p
kind=job
owner=admin
\n", + "
host=train-valid-test-splitter-vdn6h
kind=job
owner=admin
\n", " \n", - "
random_state=1
sample=20000
src_file=/User/mlrun/sklearn-classifier/simdata.pqt
target_path=/User/mlrun/sklearn-classifier
\n", + "
random_state=1
sample=20000
src_file=/User/mlrun/models/simdata.pqt
target_path=/User/mlrun/models
\n", " \n", - "
header
xtrain
xvalid
xtest
ytrain
yvalid
ytest
\n", + "
header
xtrain
xvalid
xtest
ytrain
yvalid
ytest
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -664,8 +664,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run ecb802dffb1b43269c25f43fd7a4919a , !mlrun logs ecb802dffb1b43269c25f43fd7a4919a \n", - "[mlrun] 2020-01-26 13:16:30,357 run executed, status=completed\n" + "!mlrun get run 907ad4a876fa4205a40a668956446468 , !mlrun logs 907ad4a876fa4205a40a668956446468 \n", + "[mlrun] 2020-01-26 14:36:19,229 run executed, status=completed\n" ] } ], @@ -675,22 +675,22 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'header': '/User/mlrun/sklearn-classifier/header.pkl',\n", - " 'xtrain': '/User/mlrun/sklearn-classifier/xtrain.pqt',\n", - " 'xvalid': '/User/mlrun/sklearn-classifier/xvalid.pqt',\n", - " 'xtest': '/User/mlrun/sklearn-classifier/xtest.pqt',\n", - " 'ytrain': '/User/mlrun/sklearn-classifier/ytrain.pqt',\n", - " 'yvalid': '/User/mlrun/sklearn-classifier/yvalid.pqt',\n", - " 'ytest': '/User/mlrun/sklearn-classifier/ytest.pqt'}" + "{'header': '/User/mlrun/models/header.pkl',\n", + " 'xtrain': '/User/mlrun/models/xtrain.pqt',\n", + " 'xvalid': '/User/mlrun/models/xvalid.pqt',\n", + " 'xtest': '/User/mlrun/models/xtest.pqt',\n", + " 'ytrain': '/User/mlrun/models/ytrain.pqt',\n", + " 'yvalid': '/User/mlrun/models/yvalid.pqt',\n", + " 'ytest': '/User/mlrun/models/ytest.pqt'}" ] }, - "execution_count": 11, + "execution_count": 38, "metadata": {}, "output_type": "execute_result" } @@ -709,23 +709,14 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 39, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-26 13:16:30,608 function spec saved to path: /User/repos/functions/train/sklearn-classifier.yaml\n" - ] - } - ], + "outputs": [], "source": [ "yaml_name = os.path.join(CODE_BASE, 'train', 'sklearn-classifier.yaml')\n", "if not os.path.isfile(yaml_name):\n", " testfn = mlrun.code_to_function(\n", " kind='job', \n", - " image='yjbds/mlrun-ds:latest',\n", " filename=os.path.join(CODE_BASE, 'train', 'sklearn-classifier.py'))\n", " testfn.build_config(base_image='yjbds/mlrun-ds:latest', commands=[])\n", " testfn.export(os.path.join(CODE_BASE, 'train', 'sklearn-classifier.yaml'))" @@ -733,7 +724,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 40, "metadata": {}, "outputs": [], "source": [ @@ -744,7 +735,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 41, "metadata": {}, "outputs": [ { @@ -753,7 +744,7 @@ "'ready'" ] }, - "execution_count": 14, + "execution_count": 41, "metadata": {}, "output_type": "execute_result" } @@ -764,16 +755,16 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 19, + "execution_count": 42, "metadata": {}, "output_type": "execute_result" } @@ -796,23 +787,23 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 13:17:59,993 starting run train uid=b57510063377418ab0f90b33d14b6117 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-26 13:18:00,097 Job is running in the background, pod: train-wr4m2\n", + "[mlrun] 2020-01-26 14:36:19,413 starting run train uid=6eb38fb5166b40099e6e579f00c1ad22 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 14:36:19,496 Job is running in the background, pod: train-n4qgh\n", "[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the \"boost_from_average\" parameter in \"binary\" objective is true.\n", "This may cause significantly different results comparing to the previous versions of LightGBM.\n", "Try to set boost_from_average=false, if your old models produce bad results\n", "[LightGBM] [Warning] Cannot change bin_construct_sample_cnt after constructed Dataset handle.\n", - "[mlrun] 2020-01-26 13:18:15,384 log artifact training-validation-plot.html at training-validation-plot.html, size: 32700, db: Y\n", - "[mlrun] 2020-01-26 13:18:15,498 log artifact model at /User/mlrun/models/lgb-classifier.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:36:31,442 log artifact training-validation-plot.html at training-validation-plot.html, size: 32968, db: Y\n", + "[mlrun] 2020-01-26 14:36:31,512 log artifact model at /User/mlrun/models/lgb-classifier.pkl, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-26 13:18:15,519 run executed, status=completed\n", + "[mlrun] 2020-01-26 14:36:31,540 run executed, status=completed\n", "/opt/conda/lib/python3.7/site-packages/sklearn/preprocessing/_label.py:235: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "/opt/conda/lib/python3.7/site-packages/sklearn/preprocessing/_label.py:268: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", @@ -989,26 +980,26 @@ " \n", " \n", " \n", - "
...4b6117
\n", + "
...c1ad22
\n", " 0\n", - " Jan 26 13:18:08\n", + " Jan 26 14:36:25\n", " completed\n", " sklearn-classifier\n", - "
host=train-wr4m2
kind=job
owner=admin
\n", + "
host=train-n4qgh
kind=job
owner=admin
\n", " \n", - "
SKClassifier=lightgbm.sklearn.LGBMClassifier
callbacks=[]
key=model
name=lgb-classifier.pkl
src_file=None
target_path=/User/mlrun/models
verbose=False
xtrain=/User/mlrun/sklearn-classifier/xtrain.pqt
xvalid=/User/mlrun/sklearn-classifier/xvalid.pqt
ytrain=/User/mlrun/sklearn-classifier/ytrain.pqt
yvalid=/User/mlrun/sklearn-classifier/yvalid.pqt
\n", - "
train_accuracy=0.9781481481481481
\n", - "
training-validation-plot.html
model
\n", + "
SKClassifier=lightgbm.sklearn.LGBMClassifier
callbacks=[]
key=model
name=lgb-classifier.pkl
src_file=None
target_path=/User/mlrun/models
verbose=False
xtrain=/User/mlrun/models/xtrain.pqt
xvalid=/User/mlrun/models/xvalid.pqt
ytrain=/User/mlrun/models/ytrain.pqt
yvalid=/User/mlrun/models/yvalid.pqt
\n", + "
train_accuracy=0.9856296296296296
\n", + "
training-validation-plot.html
model
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -1024,8 +1015,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run b57510063377418ab0f90b33d14b6117 , !mlrun logs b57510063377418ab0f90b33d14b6117 \n", - "[mlrun] 2020-01-26 13:18:19,266 run executed, status=completed\n" + "!mlrun get run 6eb38fb5166b40099e6e579f00c1ad22 , !mlrun logs 6eb38fb5166b40099e6e579f00c1ad22 \n", + "[mlrun] 2020-01-26 14:36:38,665 run executed, status=completed\n" ] } ], @@ -1035,18 +1026,18 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'train_accuracy': 0.9781481481481481,\n", + "{'train_accuracy': 0.9856296296296296,\n", " 'training-validation-plot.html': 'training-validation-plot.html',\n", " 'model': '/User/mlrun/models/lgb-classifier.pkl'}" ] }, - "execution_count": 22, + "execution_count": 45, "metadata": {}, "output_type": "execute_result" } diff --git a/tests/train_valid_test_split.ipynb b/tests/train_valid_test_split.ipynb index 22b5040a4..6eafca913 100644 --- a/tests/train_valid_test_split.ipynb +++ b/tests/train_valid_test_split.ipynb @@ -9,7 +9,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 30, "metadata": {}, "outputs": [], "source": [ @@ -28,7 +28,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 31, "metadata": {}, "outputs": [], "source": [ @@ -37,7 +37,7 @@ "M_FEATURES = 28\n", "NEG_WEIGHT = 0.5\n", "RNG = 1\n", - "TARGET_DATA_PATH = '/User/mlrun/splitter'\n", + "TARGET_DATA_PATH = '/User/mlrun/models'\n", "SRC_FILE = 'simdata.pqt'\n", "KEY = 'simdata'" ] @@ -51,20 +51,20 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 12:00:09,764 starting run create_binary_classification uid=561adae458ac4df1a16d7ac371d2e450 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-26 12:00:09,844 Job is running in the background, pod: create-binary-classification-qqm6z\n", - "[mlrun] 2020-01-26 12:00:22,759 log artifact simdata at /User/mlrun/splitter/simdata.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:33:32,512 starting run create_binary_classification uid=e1d104202c33479eaf866d1276a911d1 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 14:33:32,615 Job is running in the background, pod: create-binary-classification-c7gwh\n", + "[mlrun] 2020-01-26 14:33:43,575 log artifact simdata at /User/mlrun/models/simdata.pqt, size: None, db: Y\n", "\n", + "[mlrun] 2020-01-26 14:33:43,586 run executed, status=completed\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", " result = infer_dtype(pandas_collection)\n", - "[mlrun] 2020-01-26 12:00:22,773 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -237,26 +237,26 @@ " \n", " \n", " \n", - "
...d2e450
\n", + "
...a911d1
\n", " 0\n", - " Jan 26 12:00:21\n", + " Jan 26 14:33:42\n", " completed\n", " binary\n", - "
host=create-binary-classification-qqm6z
kind=job
owner=admin
\n", + "
host=create-binary-classification-c7gwh
kind=job
owner=admin
\n", " \n", - "
filename=/User/mlrun/splitter/simdata.pqt
key=simdata
m_features=28
n_samples=100000
random_state=1
target_path=/User/mlrun/splitter
weight=0.5
\n", + "
filename=/User/mlrun/models/simdata.pqt
key=simdata
m_features=28
n_samples=100000
random_state=1
target_path=/User/mlrun/models
weight=0.5
\n", " \n", - "
simdata
\n", + "
simdata
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -272,8 +272,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 561adae458ac4df1a16d7ac371d2e450 , !mlrun logs 561adae458ac4df1a16d7ac371d2e450 \n", - "[mlrun] 2020-01-26 12:00:29,005 run executed, status=completed\n" + "!mlrun get run e1d104202c33479eaf866d1276a911d1 , !mlrun logs e1d104202c33479eaf866d1276a911d1 \n", + "[mlrun] 2020-01-26 14:33:51,818 run executed, status=completed\n" ] } ], @@ -299,16 +299,16 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'simdata': '/User/mlrun/splitter/simdata.pqt'}" + "{'simdata': '/User/mlrun/models/simdata.pqt'}" ] }, - "execution_count": 16, + "execution_count": 33, "metadata": {}, "output_type": "execute_result" } @@ -326,23 +326,14 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 34, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-26 13:01:10,237 function spec saved to path: /User/repos/functions/datagen/splitters/train_valid_test.yaml\n" - ] - } - ], + "outputs": [], "source": [ "yaml_name = os.path.join(CODE_BASE, 'splitters', 'train_valid_test.yaml')\n", "if not os.path.isfile(yaml_name):\n", " testfn = mlrun.code_to_function(\n", " kind='job', \n", - " image='yjbds/mlrun-ds:latest',\n", " filename=os.path.join(CODE_BASE, 'splitters', 'train_valid_test.py'))\n", " testfn.build_config(base_image='yjbds/mlrun-ds:latest', commands=[])\n", " testfn.export(yaml_name)" @@ -350,7 +341,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 35, "metadata": {}, "outputs": [], "source": [ @@ -361,7 +352,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 36, "metadata": {}, "outputs": [ { @@ -370,7 +361,7 @@ "'ready'" ] }, - "execution_count": 19, + "execution_count": 36, "metadata": {}, "output_type": "execute_result" } @@ -381,24 +372,24 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 12:00:29,121 starting run train_valid_test_splitter uid=5be904e5c6a14a9ba59e016e212f4499 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-26 12:00:29,206 Job is running in the background, pod: train-valid-test-splitter-kqfdd\n", - "[mlrun] 2020-01-26 12:00:39,440 log artifact header at /User/mlrun/splitter/header.pkl, size: None, db: Y\n", - "[mlrun] 2020-01-26 12:00:39,753 log artifact xtrain at /User/mlrun/splitter/xtrain.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 12:00:39,889 log artifact xvalid at /User/mlrun/splitter/xvalid.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 12:00:40,042 log artifact xtest at /User/mlrun/splitter/xtest.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 12:00:40,073 log artifact ytrain at /User/mlrun/splitter/ytrain.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 12:00:40,100 log artifact yvalid at /User/mlrun/splitter/yvalid.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 12:00:40,124 log artifact ytest at /User/mlrun/splitter/ytest.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:34:07,430 starting run train_valid_test_splitter uid=6c61b309093146f0bd97a387d76aa71e -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 14:34:07,521 Job is running in the background, pod: train-valid-test-splitter-vrdv6\n", + "[mlrun] 2020-01-26 14:34:17,884 log artifact header at /User/mlrun/models/header.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:34:18,258 log artifact xtrain at /User/mlrun/models/xtrain.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:34:18,389 log artifact xvalid at /User/mlrun/models/xvalid.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:34:18,469 log artifact xtest at /User/mlrun/models/xtest.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:34:18,496 log artifact ytrain at /User/mlrun/models/ytrain.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:34:18,520 log artifact yvalid at /User/mlrun/models/yvalid.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 14:34:18,536 log artifact ytest at /User/mlrun/models/ytest.pqt, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-26 12:00:40,141 run executed, status=completed\n", + "[mlrun] 2020-01-26 14:34:18,559 run executed, status=completed\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:708: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", " labels = getattr(columns, 'labels', None) or [\n", "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:735: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead\n", @@ -579,26 +570,26 @@ " \n", " \n", " \n", - "
...2f4499
\n", + "
...6aa71e
\n", " 0\n", - " Jan 26 12:00:39\n", + " Jan 26 14:34:17\n", " completed\n", " train-valid-test\n", - "
host=train-valid-test-splitter-kqfdd
kind=job
owner=admin
\n", + "
host=train-valid-test-splitter-vrdv6
kind=job
owner=admin
\n", " \n", - "
random_state=1
src_file=/User/mlrun/splitter/simdata.pqt
target_path=/User/mlrun/splitter
\n", + "
random_state=1
src_file=/User/mlrun/models/simdata.pqt
target_path=/User/mlrun/models
\n", " \n", - "
header
xtrain
xvalid
xtest
ytrain
yvalid
ytest
\n", + "
header
xtrain
xvalid
xtest
ytrain
yvalid
ytest
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -614,8 +605,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 5be904e5c6a14a9ba59e016e212f4499 , !mlrun logs 5be904e5c6a14a9ba59e016e212f4499 \n", - "[mlrun] 2020-01-26 12:00:48,372 run executed, status=completed\n" + "!mlrun get run 6c61b309093146f0bd97a387d76aa71e , !mlrun logs 6c61b309093146f0bd97a387d76aa71e \n", + "[mlrun] 2020-01-26 14:34:26,698 run executed, status=completed\n" ] } ], @@ -631,22 +622,22 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'header': '/User/mlrun/splitter/header.pkl',\n", - " 'xtrain': '/User/mlrun/splitter/xtrain.pqt',\n", - " 'xvalid': '/User/mlrun/splitter/xvalid.pqt',\n", - " 'xtest': '/User/mlrun/splitter/xtest.pqt',\n", - " 'ytrain': '/User/mlrun/splitter/ytrain.pqt',\n", - " 'yvalid': '/User/mlrun/splitter/yvalid.pqt',\n", - " 'ytest': '/User/mlrun/splitter/ytest.pqt'}" + "{'header': '/User/mlrun/models/header.pkl',\n", + " 'xtrain': '/User/mlrun/models/xtrain.pqt',\n", + " 'xvalid': '/User/mlrun/models/xvalid.pqt',\n", + " 'xtest': '/User/mlrun/models/xtest.pqt',\n", + " 'ytrain': '/User/mlrun/models/ytrain.pqt',\n", + " 'yvalid': '/User/mlrun/models/yvalid.pqt',\n", + " 'ytest': '/User/mlrun/models/ytest.pqt'}" ] }, - "execution_count": 21, + "execution_count": 44, "metadata": {}, "output_type": "execute_result" } @@ -664,7 +655,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 45, "metadata": {}, "outputs": [], "source": [ @@ -673,7 +664,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 46, "metadata": {}, "outputs": [], "source": [ @@ -689,7 +680,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 47, "metadata": {}, "outputs": [], "source": [ @@ -702,7 +693,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 48, "metadata": {}, "outputs": [], "source": [ @@ -711,7 +702,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 49, "metadata": {}, "outputs": [], "source": [ From 98867b1abd4157fa17cfe7cc932e09dcf3f047f5 Mon Sep 17 00:00:00 2001 From: yasha Date: Sun, 26 Jan 2020 19:51:00 +0000 Subject: [PATCH 20/32] resolved pyarrow versions issue --- datagen/splitters/train_valid_test.py | 17 +- datagen/splitters/train_valid_test.yaml | 7 +- evaluation/test-classifier.py | 5 +- evaluation/test-classifier.yaml | 4 +- fileutils/arc_to_parquet/arc_to_parquet.py | 12 + fileutils/arc_to_parquet/arc_to_parquet.yaml | 6 +- tests/arc_to_parquet.ipynb | 89 +- tests/test_classifier.ipynb | 78 +- tests/train_classifier.ipynb | 873 ++++--------------- tests/train_valid_test_split.ipynb | 534 +++++------- train/sklearn-classifier.yaml | 5 +- 11 files changed, 508 insertions(+), 1122 deletions(-) diff --git a/datagen/splitters/train_valid_test.py b/datagen/splitters/train_valid_test.py index dc1deb5e3..9a8976331 100644 --- a/datagen/splitters/train_valid_test.py +++ b/datagen/splitters/train_valid_test.py @@ -5,11 +5,17 @@ import pyarrow as pa from cloudpickle import dump +import pyarrow.parquet as pq +import pyarrow as pa + from sklearn.model_selection import train_test_split from typing import Optional, Union from mlrun.execution import MLClientCtx from mlrun.datastore import DataItem +import warnings +warnings.simplefilter(action='ignore', category=FutureWarning) + def train_valid_test_splitter( context: Optional[MLClientCtx] = None, src_file: Union[DataItem, str] = '', @@ -40,19 +46,18 @@ def train_valid_test_splitter( :param key: key for model artifact :param random_state: (1) sklearn rng seed """ - if isinstance(src_file, DataItem): - src_file = str(src_file) - srcfilepath = os.path.join(target_path, src_file) + srcfilepath = os.path.join(target_path, str(src_file)) if (sample == -1) or (sample >= 1): # get all rows, or contiguous sample starting at row 1. - raw = pd.read_parquet(srcfilepath, engine='pyarrow') + raw = pq.read_table(srcfilepath).to_pandas() labels = raw.pop(label_column) raw = raw.iloc[:sample, :] labels = labels.iloc[:sample] else: # grab a random sample - raw = pd.read_parquet(srcfilepath, engine='pyarrow').sample(sample*-1) + #raw = pd.read_parquet(srcfilepath, engine='pyarrow').sample(sample*-1) + raw = pq.read_table(srcfilepath).to_pandas().sample(sample*-1) labels = raw.pop(label_column) # double split tp generate 3 data sets: train, validation and test @@ -94,4 +99,4 @@ def train_valid_test_splitter( f = os.path.join(target_path, name + 'ytest.pqt') pd.DataFrame({'labels': ytest}).to_parquet(f) - context.log_artifact('ytest', target_path=f) \ No newline at end of file + context.log_artifact('ytest', target_path=f) diff --git a/datagen/splitters/train_valid_test.yaml b/datagen/splitters/train_valid_test.yaml index 2df2f0e88..1c1603e3c 100644 --- a/datagen/splitters/train_valid_test.yaml +++ b/datagen/splitters/train_valid_test.yaml @@ -2,18 +2,17 @@ kind: job metadata: name: train-valid-test tag: '' - hash: 877224f2dd10beff5f7e9cb9b4821a8685aab9db + hash: a20a8322b51297f4491727c3a2beb3b3ec505999 project: '' spec: command: '' args: [] - image: yjbds/mlrun-ds:latest volumes: [] volume_mounts: [] env: [] description: '' build: - functionSourceCode: aW1wb3J0IHBhbmRhcyBhcyBwZAppbXBvcnQgb3MKaW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBweWFycm93LnBhcnF1ZXQgYXMgcHEKaW1wb3J0IHB5YXJyb3cgYXMgcGEKZnJvbSBjbG91ZHBpY2tsZSBpbXBvcnQgZHVtcAoKZnJvbSBza2xlYXJuLm1vZGVsX3NlbGVjdGlvbiBpbXBvcnQgdHJhaW5fdGVzdF9zcGxpdApmcm9tIHR5cGluZyBpbXBvcnQgT3B0aW9uYWwsIFVuaW9uCmZyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIG1scnVuLmRhdGFzdG9yZSBpbXBvcnQgRGF0YUl0ZW0KCmRlZiB0cmFpbl92YWxpZF90ZXN0X3NwbGl0dGVyKAogICAgY29udGV4dDogT3B0aW9uYWxbTUxDbGllbnRDdHhdID0gTm9uZSwKICAgIHNyY19maWxlOiBVbmlvbltEYXRhSXRlbSwgc3RyXSA9ICcnLAogICAgaGVhZGVyOiBVbmlvbltEYXRhSXRlbSwgc3RyLCBsaXN0XSA9ICcnLAogICAgc2FtcGxlOiBpbnQgPSAtMSwKICAgIGxhYmVsX2NvbHVtbjogc3RyID0gJ2xhYmVscycsCiAgICB0ZXN0X3NpemU6IGZsb2F0ID0gMC4xLAogICAgdHJhaW5fdmFsX3NwbGl0OiBmbG9hdCA9IDAuNzUsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gJycsCiAgICBuYW1lOiBzdHIgPSAnJywKICAgIGtleTogc3RyID0gJycsCiAgICByYW5kb21fc3RhdGUgPSAxCikgLT4gTm9uZToKICAgICIiIlNwbGl0IHJhdyBkYXRhIGlucHV0IGludG8gdHJhaW4sIHZhbGlkYXRpb24gYW5kIHRlc3Qgc2V0cy4KCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgICB0aGUgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIHNyY19maWxlOiAgICAgICAgKCdyYXcnKSBuYW1lIG9mIHJhdyBkYXRhIGZpbGUKICAgIDpwYXJhbSBoZWFkZXI6ICAgICAgICAgIChOb25lKSBoZWFkZXIgYXJ0aWZhY3Qgb3IgbGlzdCBvZiBjb2x1bW4gbmFtZXMuCiAgICA6cGFyYW0gc2FtcGxlOiAgICAgICAgICAoLTEpLiBTZWxlY3RzIHRoZSBmaXJzdCBuIHJvd3MsIG9yIHNlbGVjdCBhIHNhbXBsZSBzdGFydGluZwogICAgICAgICAgICAgICAgICAgICAgICAgICAgZnJvbSB0aGUgZmlyc3QuIElmIG5lZ2F0aXZlIDwtMSwgc2VsZWN0IGEgcmFuZG9tIHNhbXBsZSBmcm9tIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgdGhlIGVudGlyZSBmaWxlCiAgICA6cGFyYW0gbGFiZWxfY29sdW1uOiAgICBncm91bmQtdHJ1dGggKHkpIGxhYmVscwogICAgOnBhcmFtIHRlc3Rfc2l6ZTogICAgICAgKDAuMSkgdGVzdCBzZXQgc2l6ZQogICAgOnBhcmFtIHRyYWluX3ZhbF9zcGxpdDogKDAuNzUpIE9uY2UgdGhlIHRlc3Qgc2V0IGhhcyBiZWVuIHJlbW92ZWQgdGhlIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgdHJhaW5pbmcgc2V0IGdldHMgdGhpcyBwcm9wb3J0aW9uLgogICAgOnBhcmFtIHRhcmdldF9wYXRoOiAgICAgZm9sZGVyIGxvY2F0aW9uIG9mIGZpbGVzCiAgICA6cGFyYW0gbmFtZTogICAgICAgICAgICBkZXN0aW5hdGlvbiBwcmVmaXggbmFtZSBmb3IgbW9kZWwgZmlsZXMKICAgIDpwYXJhbSBrZXk6ICAgICAgICAgICAgIGtleSBmb3IgbW9kZWwgYXJ0aWZhY3QKICAgIDpwYXJhbSByYW5kb21fc3RhdGU6ICAgICgxKSBza2xlYXJuIHJuZyBzZWVkCiAgICAiIiIKICAgIGlmIGlzaW5zdGFuY2Uoc3JjX2ZpbGUsIERhdGFJdGVtKToKICAgICAgICBzcmNfZmlsZSA9IHN0cihzcmNfZmlsZSkKICAgIHNyY2ZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBzcmNfZmlsZSkKCiAgICBpZiAoc2FtcGxlID09IC0xKSBvciAoc2FtcGxlID49IDEpOgogICAgICAgICMgZ2V0IGFsbCByb3dzLCBvciBjb250aWd1b3VzIHNhbXBsZSBzdGFydGluZyBhdCByb3cgMS4KICAgICAgICByYXcgPSBwZC5yZWFkX3BhcnF1ZXQoc3JjZmlsZXBhdGgsIGVuZ2luZT0ncHlhcnJvdycpCiAgICAgICAgbGFiZWxzID0gcmF3LnBvcChsYWJlbF9jb2x1bW4pCiAgICAgICAgcmF3ID0gcmF3Lmlsb2NbOnNhbXBsZSwgOl0KICAgICAgICBsYWJlbHMgPSBsYWJlbHMuaWxvY1s6c2FtcGxlXQogICAgZWxzZToKICAgICAgICAjIGdyYWIgYSByYW5kb20gc2FtcGxlCiAgICAgICAgcmF3ID0gcGQucmVhZF9wYXJxdWV0KHNyY2ZpbGVwYXRoLCBlbmdpbmU9J3B5YXJyb3cnKS5zYW1wbGUoc2FtcGxlKi0xKQogICAgICAgIGxhYmVscyA9IHJhdy5wb3AobGFiZWxfY29sdW1uKQogICAgCiAgICAjIGRvdWJsZSBzcGxpdCB0cCBnZW5lcmF0ZSAzIGRhdGEgc2V0czogdHJhaW4sIHZhbGlkYXRpb24gYW5kIHRlc3QKICAgIHgsIHh0ZXN0LCB5LCB5dGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQocmF3LCBsYWJlbHMsIHRlc3Rfc2l6ZT10ZXN0X3NpemUsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICByYW5kb21fc3RhdGU9cmFuZG9tX3N0YXRlKQogICAKICAgIHh0cmFpbiwgeHZhbGlkLCB5dHJhaW4sIHl2YWxpZCA9IHRyYWluX3Rlc3Rfc3BsaXQoeCwgeSwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHRyYWluX3NpemU9dHJhaW5fdmFsX3NwbGl0LCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgcmFuZG9tX3N0YXRlPXJhbmRvbV9zdGF0ZSkgICAgICAgIAoKICAgIGlmIG5hbWU6CiAgICAgICAgbmFtZSA9ICctJyArIG5hbWUKICAgIAogICAgIyBzYXZlIGhlYWRlcgogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICdoZWFkZXIucGtsJykKICAgIGR1bXAocmF3LmNvbHVtbnMudmFsdWVzLCBvcGVuKGYsICd3YicpKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ2hlYWRlcicsIHRhcmdldF9wYXRoPWYpCiAgICAKICAgICMgc2F2ZSBkYXRhIHNldHMKICAgIGYgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUgKyAneHRyYWluLnBxdCcpCiAgICB4dHJhaW4udG9fcGFycXVldChmKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ3h0cmFpbicsIHRhcmdldF9wYXRoPWYpCiAgICAKICAgIGYgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUgKyAneHZhbGlkLnBxdCcpCiAgICB4dmFsaWQudG9fcGFycXVldChmKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ3h2YWxpZCcsIHRhcmdldF9wYXRoPWYpCiAgICAKICAgIGYgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUgKyAneHRlc3QucHF0JykKICAgIHh0ZXN0LnRvX3BhcnF1ZXQoZikKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KCd4dGVzdCcsIHRhcmdldF9wYXRoPWYpCiAgICAKICAgIGYgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUgKyAneXRyYWluLnBxdCcpCiAgICBwZC5EYXRhRnJhbWUoeydsYWJlbHMnOiB5dHJhaW59KS50b19wYXJxdWV0KGYpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgneXRyYWluJywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICd5dmFsaWQucHF0JykKICAgIHBkLkRhdGFGcmFtZSh7J2xhYmVscyc6IHl2YWxpZH0pLnRvX3BhcnF1ZXQoZikKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KCd5dmFsaWQnLCB0YXJnZXRfcGF0aD1mKQogICAgCiAgICBmID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lICsgJ3l0ZXN0LnBxdCcpCiAgICBwZC5EYXRhRnJhbWUoeydsYWJlbHMnOiB5dGVzdH0pLnRvX3BhcnF1ZXQoZikKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KCd5dGVzdCcsIHRhcmdldF9wYXRoPWYp + functionSourceCode: aW1wb3J0IHBhbmRhcyBhcyBwZAppbXBvcnQgb3MKaW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBweWFycm93LnBhcnF1ZXQgYXMgcHEKaW1wb3J0IHB5YXJyb3cgYXMgcGEKZnJvbSBjbG91ZHBpY2tsZSBpbXBvcnQgZHVtcAoKaW1wb3J0IHB5YXJyb3cucGFycXVldCBhcyBwcQppbXBvcnQgcHlhcnJvdyBhcyBwYQoKZnJvbSBza2xlYXJuLm1vZGVsX3NlbGVjdGlvbiBpbXBvcnQgdHJhaW5fdGVzdF9zcGxpdApmcm9tIHR5cGluZyBpbXBvcnQgT3B0aW9uYWwsIFVuaW9uCmZyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIG1scnVuLmRhdGFzdG9yZSBpbXBvcnQgRGF0YUl0ZW0KCmltcG9ydCB3YXJuaW5ncwp3YXJuaW5ncy5zaW1wbGVmaWx0ZXIoYWN0aW9uPSdpZ25vcmUnLCBjYXRlZ29yeT1GdXR1cmVXYXJuaW5nKQoKZGVmIHRyYWluX3ZhbGlkX3Rlc3Rfc3BsaXR0ZXIoCiAgICBjb250ZXh0OiBPcHRpb25hbFtNTENsaWVudEN0eF0gPSBOb25lLAogICAgc3JjX2ZpbGU6IFVuaW9uW0RhdGFJdGVtLCBzdHJdID0gJycsCiAgICBoZWFkZXI6IFVuaW9uW0RhdGFJdGVtLCBzdHIsIGxpc3RdID0gJycsCiAgICBzYW1wbGU6IGludCA9IC0xLAogICAgbGFiZWxfY29sdW1uOiBzdHIgPSAnbGFiZWxzJywKICAgIHRlc3Rfc2l6ZTogZmxvYXQgPSAwLjEsCiAgICB0cmFpbl92YWxfc3BsaXQ6IGZsb2F0ID0gMC43NSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAnJywKICAgIG5hbWU6IHN0ciA9ICcnLAogICAga2V5OiBzdHIgPSAnJywKICAgIHJhbmRvbV9zdGF0ZSA9IDEKKSAtPiBOb25lOgogICAgIiIiU3BsaXQgcmF3IGRhdGEgaW5wdXQgaW50byB0cmFpbiwgdmFsaWRhdGlvbiBhbmQgdGVzdCBzZXRzLgoKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgICAgIHRoZSBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gc3JjX2ZpbGU6ICAgICAgICAoJ3JhdycpIG5hbWUgb2YgcmF3IGRhdGEgZmlsZQogICAgOnBhcmFtIGhlYWRlcjogICAgICAgICAgKE5vbmUpIGhlYWRlciBhcnRpZmFjdCBvciBsaXN0IG9mIGNvbHVtbiBuYW1lcy4KICAgIDpwYXJhbSBzYW1wbGU6ICAgICAgICAgICgtMSkuIFNlbGVjdHMgdGhlIGZpcnN0IG4gcm93cywgb3Igc2VsZWN0IGEgc2FtcGxlIHN0YXJ0aW5nCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBmcm9tIHRoZSBmaXJzdC4gSWYgbmVnYXRpdmUgPC0xLCBzZWxlY3QgYSByYW5kb20gc2FtcGxlIGZyb20gCiAgICAgICAgICAgICAgICAgICAgICAgICAgICB0aGUgZW50aXJlIGZpbGUKICAgIDpwYXJhbSBsYWJlbF9jb2x1bW46ICAgIGdyb3VuZC10cnV0aCAoeSkgbGFiZWxzCiAgICA6cGFyYW0gdGVzdF9zaXplOiAgICAgICAoMC4xKSB0ZXN0IHNldCBzaXplCiAgICA6cGFyYW0gdHJhaW5fdmFsX3NwbGl0OiAoMC43NSkgT25jZSB0aGUgdGVzdCBzZXQgaGFzIGJlZW4gcmVtb3ZlZCB0aGUgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICB0cmFpbmluZyBzZXQgZ2V0cyB0aGlzIHByb3BvcnRpb24uCiAgICA6cGFyYW0gdGFyZ2V0X3BhdGg6ICAgICBmb2xkZXIgbG9jYXRpb24gb2YgZmlsZXMKICAgIDpwYXJhbSBuYW1lOiAgICAgICAgICAgIGRlc3RpbmF0aW9uIHByZWZpeCBuYW1lIGZvciBtb2RlbCBmaWxlcwogICAgOnBhcmFtIGtleTogICAgICAgICAgICAga2V5IGZvciBtb2RlbCBhcnRpZmFjdAogICAgOnBhcmFtIHJhbmRvbV9zdGF0ZTogICAgKDEpIHNrbGVhcm4gcm5nIHNlZWQKICAgICIiIgogICAgc3JjZmlsZXBhdGggPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIHN0cihzcmNfZmlsZSkpCgogICAgaWYgKHNhbXBsZSA9PSAtMSkgb3IgKHNhbXBsZSA+PSAxKToKICAgICAgICAjIGdldCBhbGwgcm93cywgb3IgY29udGlndW91cyBzYW1wbGUgc3RhcnRpbmcgYXQgcm93IDEuCiAgICAgICAgcmF3ID0gcHEucmVhZF90YWJsZShzcmNmaWxlcGF0aCkudG9fcGFuZGFzKCkKICAgICAgICBsYWJlbHMgPSByYXcucG9wKGxhYmVsX2NvbHVtbikKICAgICAgICByYXcgPSByYXcuaWxvY1s6c2FtcGxlLCA6XQogICAgICAgIGxhYmVscyA9IGxhYmVscy5pbG9jWzpzYW1wbGVdCiAgICBlbHNlOgogICAgICAgICMgZ3JhYiBhIHJhbmRvbSBzYW1wbGUKICAgICAgICAjcmF3ID0gcGQucmVhZF9wYXJxdWV0KHNyY2ZpbGVwYXRoLCBlbmdpbmU9J3B5YXJyb3cnKS5zYW1wbGUoc2FtcGxlKi0xKQogICAgICAgIHJhdyA9IHBxLnJlYWRfdGFibGUoc3JjZmlsZXBhdGgpLnRvX3BhbmRhcygpLnNhbXBsZShzYW1wbGUqLTEpCiAgICAgICAgbGFiZWxzID0gcmF3LnBvcChsYWJlbF9jb2x1bW4pCiAgICAKICAgICMgZG91YmxlIHNwbGl0IHRwIGdlbmVyYXRlIDMgZGF0YSBzZXRzOiB0cmFpbiwgdmFsaWRhdGlvbiBhbmQgdGVzdAogICAgeCwgeHRlc3QsIHksIHl0ZXN0ID0gdHJhaW5fdGVzdF9zcGxpdChyYXcsIGxhYmVscywgdGVzdF9zaXplPXRlc3Rfc2l6ZSwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHJhbmRvbV9zdGF0ZT1yYW5kb21fc3RhdGUpCiAgIAogICAgeHRyYWluLCB4dmFsaWQsIHl0cmFpbiwgeXZhbGlkID0gdHJhaW5fdGVzdF9zcGxpdCh4LCB5LCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgdHJhaW5fc2l6ZT10cmFpbl92YWxfc3BsaXQsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICByYW5kb21fc3RhdGU9cmFuZG9tX3N0YXRlKSAgICAgICAgCgogICAgaWYgbmFtZToKICAgICAgICBuYW1lID0gJy0nICsgbmFtZQogICAgCiAgICAjIHNhdmUgaGVhZGVyCiAgICBmID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lICsgJ2hlYWRlci5wa2wnKQogICAgZHVtcChyYXcuY29sdW1ucy52YWx1ZXMsIG9wZW4oZiwgJ3diJykpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgnaGVhZGVyJywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgIyBzYXZlIGRhdGEgc2V0cwogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICd4dHJhaW4ucHF0JykKICAgIHh0cmFpbi50b19wYXJxdWV0KGYpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgneHRyYWluJywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICd4dmFsaWQucHF0JykKICAgIHh2YWxpZC50b19wYXJxdWV0KGYpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgneHZhbGlkJywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICd4dGVzdC5wcXQnKQogICAgeHRlc3QudG9fcGFycXVldChmKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ3h0ZXN0JywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICd5dHJhaW4ucHF0JykKICAgIHBkLkRhdGFGcmFtZSh7J2xhYmVscyc6IHl0cmFpbn0pLnRvX3BhcnF1ZXQoZikKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KCd5dHJhaW4nLCB0YXJnZXRfcGF0aD1mKQogICAgCiAgICBmID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lICsgJ3l2YWxpZC5wcXQnKQogICAgcGQuRGF0YUZyYW1lKHsnbGFiZWxzJzogeXZhbGlkfSkudG9fcGFycXVldChmKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ3l2YWxpZCcsIHRhcmdldF9wYXRoPWYpCiAgICAKICAgIGYgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUgKyAneXRlc3QucHF0JykKICAgIHBkLkRhdGFGcmFtZSh7J2xhYmVscyc6IHl0ZXN0fSkudG9fcGFycXVldChmKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ3l0ZXN0JywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgY29udGV4dC5sb2dnZXIuaW5mbygnbnVtcHknLCBucC5fX3ZlcnNpb25fXykKICAgIGNvbnRleHQubG9nZ2VyLmluZm8oJ3BhbmRhcyAnLCBwZC5fX3ZlcnNpb25fXykKICAgIGNvbnRleHQubG9nZ2VyLmluZm8oJ3B5YXJyb3cnLCBwYS5fX3ZlcnNpb25fXyk= base_image: yjbds/mlrun-ds:latest commands: [] - code_origin: https://github.com/yjb-ds/functions.git#25e611e4bd05320d342708ce786522bfecaa0e51:/User/repos/functions/datagen/splitters/train_valid_test.py + code_origin: https://github.com/yjb-ds/functions.git#e613e55761fd1ed325ad88155877924aa5b49ccc:/User/repos/functions/datagen/splitters/train_valid_test.py diff --git a/evaluation/test-classifier.py b/evaluation/test-classifier.py index 82e693f95..80a71ad03 100644 --- a/evaluation/test-classifier.py +++ b/evaluation/test-classifier.py @@ -52,10 +52,7 @@ def test_model( :param key: key for model artifact """ # load model and data - if isinstance(model, DataItem): - clf = load(open(str(model), 'rb')) - else: - clf = load(open(model, 'rb')) + clf = load(open(str(model), 'rb')) if isinstance(xtest, DataItem): xtest = pd.read_parquet(str(xtest)) diff --git a/evaluation/test-classifier.yaml b/evaluation/test-classifier.yaml index eb865089f..893ec30e4 100644 --- a/evaluation/test-classifier.yaml +++ b/evaluation/test-classifier.yaml @@ -2,7 +2,7 @@ kind: job metadata: name: test-classifier tag: '' - hash: c768dc0c66298e9a0ebe79a118698713fc84cfbd + hash: 2946ecad9f24c488575cc7b4476528df09027080 project: '' spec: command: '' @@ -16,4 +16,4 @@ spec: functionSourceCode: aW1wb3J0IG9zCmltcG9ydCBpbXBvcnRsaWIKZnJvbSBjbG91ZHBpY2tsZSBpbXBvcnQgbG9hZAoKaW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IGxpZ2h0Z2JtIGFzIGxnYgoKZnJvbSBza2xlYXJuLm1ldHJpY3MgaW1wb3J0IChyb2NfY3VydmUsIGNvbmZ1c2lvbl9tYXRyaXgpCmZyb20gc2tsZWFybi5tb2RlbF9zZWxlY3Rpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXQKCmltcG9ydCBtYXRwbG90bGliLnB5cGxvdCBhcyBwbHQKZnJvbSBtYXRwbG90bGliLmZpZ3VyZSBpbXBvcnQgRmlndXJlCmltcG9ydCBzZWFib3JuIGFzIHNucwoKZnJvbSB0eXBpbmcgaW1wb3J0IE9wdGlvbmFsLCBVbmlvbiwgTGlzdAoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gbWxydW4uZGF0YXN0b3JlIGltcG9ydCBEYXRhSXRlbQpmcm9tIG1scnVuLmFydGlmYWN0cyBpbXBvcnQgVGFibGVBcnRpZmFjdCwgUGxvdEFydGlmYWN0CgppbXBvcnQgd2FybmluZ3MKd2FybmluZ3Muc2ltcGxlZmlsdGVyKGFjdGlvbj0naWdub3JlJywgY2F0ZWdvcnk9RnV0dXJlV2FybmluZykKCmRlZiB0ZXN0X21vZGVsKAogICAgY29udGV4dDogT3B0aW9uYWxbTUxDbGllbnRDdHhdLAogICAgbW9kZWw6IFVuaW9uW0RhdGFJdGVtLCBzdHJdLAogICAgeHRlc3QsIAogICAgeXRlc3QsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gJycsCiAgICBuYW1lOiBzdHIgPSAnJywKICAgIGtleTogc3RyID0gJycsCiAgICByYW5kb21fc3RhdGUgPSAxCikgLT4gTm9uZToKICAgICIiIlRlc3QgYSBjbGFzc2lmaWVyIG1vZGVsCiAgICAKICAgIFVzaW5nIGhlbGQtb3V0IHRlc3QgZmVhdHVyZXMsIGNhbGxzIGBtb2RlbC5wcmVkaWN0KHh0ZXN0KWAgYW5kIGV2YWx1YXRlcyB0aGUgYWNjdXJhY3kgb2YgdGhlIAogICAgZXN0aW1hdGVkIG1vZGVsLgogICAgCiAgICBDYW4gYmUgcGFydCBvZiBhIGt1YmVmbG93IHBpcGVsaW5lIGFzIGEgdGVzdCBzdGVwIG9yIGNhbGxlZAogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgICB0aGUgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIG1vZGVsOiAgICAgICAgICAgZXN0aW1hdGVkIG1vZGVsIGZpbGUgbmFtZSBhcyBhcnRpZmFjdCBzdG9yZSBpdGVtCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBvciBwaWNrbGUgZmlsZSBuYW1lCiAgICA6cGFyYW0geHRlc3Q6ICAgICAgICAgICB0ZXN0IGZlYXR1cmVzIGZpbGUgbmFtZSBhcyBhcnRpZmFjdCBzdG9yZSBpdGVtCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBvciBwaWNrbGUgZmlsZSBuYW1lCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgICAgICAoT3B0aW9uYWwpIHVzZSBpZiB4dGVzdCBkb2VzIG5vdCBoYXZlIGEgaGVhZGVyCiAgICA6cGFyYW0geXRlc3Q6ICAgICAgICAgICB0ZXN0IGxhYmVscyBmaWxlIG5hbWUgYXMgYXJ0aWZhY3Qgc3RvcmUgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBpdGVtIG9yIHBpY2tsZSBmaWxlIG5hbWUKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogICAgIGZvbGRlciBsb2NhdGlvbiBvZiBmaWxlcwogICAgOnBhcmFtIG5hbWU6ICAgICAgICAgICAgZGVzdGluYXRpb24gbmFtZSBmb3IgdGVzdCByZXN1bHRzCiAgICA6cGFyYW0ga2V5OiAgICAgICAgICAgICBrZXkgZm9yIG1vZGVsIGFydGlmYWN0CiAgICAiIiIKICAgICMgbG9hZCBtb2RlbCBhbmQgZGF0YQogICAgaWYgaXNpbnN0YW5jZShtb2RlbCwgRGF0YUl0ZW0pOgogICAgICAgIGNsZiA9IGxvYWQob3BlbihzdHIobW9kZWwpLCAncmInKSkKICAgIGVsc2U6CiAgICAgICAgY2xmID0gbG9hZChvcGVuKG1vZGVsLCAncmInKSkKCiAgICBpZiBpc2luc3RhbmNlKHh0ZXN0LCBEYXRhSXRlbSk6CiAgICAgICAgeHRlc3QgPSBwZC5yZWFkX3BhcnF1ZXQoc3RyKHh0ZXN0KSkKICAgICAgICB5dGVzdCA9IHBkLnJlYWRfcGFycXVldChzdHIoeXRlc3QpKQogICAgZWxzZToKICAgICAgICB4dGVzdCA9IHBkLnJlYWRfcGFycXVldCh4dGVzdCkKICAgICAgICB5dGVzdCA9IHBkLnJlYWRfcGFycXVldCh5dGVzdCkKICAgIAogICAgaWYgY2FsbGFibGUoZ2V0YXR0cihjbGYsICdwcmVkaWN0X3Byb2JhJykpOgogICAgICAgIHlwcmVkX3Byb2JzID0gY2xmLnByZWRpY3RfcHJvYmEoeHRlc3QpWzosIDFdCiAgICAgICAgeXByZWQgPSBucC53aGVyZSh5cHJlZF9wcm9icyA+PSAwLjUsIDEsIDApCiAgICAgICAgcGxvdF9yb2MoY29udGV4dCwgeXRlc3QsIHlwcmVkX3Byb2JzLCB0YXJnZXRfcGF0aCkKICAgIGVsc2U6CiAgICAgICAgeXByZWQgPSBjbGYucHJlZGljdCh4dGVzdCkKICAgICAgICB5cHJlZF9wcm9icyA9IE5vbmUKICAgIAogICAgcGxvdF9jb25mdXNpb25fbWF0cml4KGNvbnRleHQsIHl0ZXN0LCB5cHJlZCwgdGFyZ2V0X3BhdGgpCgogICAgaWYgaGFzYXR0cihjbGYsICdmZWF0dXJlX2ltcG9ydGFuY2VzXycpOgogICAgICAgIHBsb3RfaW1wb3J0YW5jZShjb250ZXh0LCBjbGYsIHh0ZXN0LmNvbHVtbnMudmFsdWVzLCB0YXJnZXRfcGF0aCkKCmRlZiBfZ2NmX2NsZWFyKHBsdCk6CiAgICBwbHQuY2xhKCkKICAgIHBsdC5jbGYoKQogICAgcGx0LmNsb3NlKCkgICAgICAgIAoKZGVmIHBsb3Rfcm9jKAogICAgY29udGV4dDogTUxDbGllbnRDdHgsIAogICAgeV9sYWJlbHMsCiAgICB5X3Byb2JzLAogICAgdGFyZ2V0X3BhdGg6IHN0ciA9ICcnLAogICAgbmFtZT0ncm9jLnBuZycsCiAgICBrZXk9J3JvYycsCiAgICBmbXQ9J3BuZycKKToKICAgICIiIlBsb3QgYW4gUk9DIGN1cnZlIGZyb20gdGVzdCBkYXRhIHNhdmVkIGluIGFuIGFydGlmYWN0IHN0b3JlLgogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgICBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0geV9sYWJlbHM6ICAgICAgICB0ZXN0IGRhdGEgbGFiZWxzCiAgICA6cGFyYW0geV9wcm9iczogICAgICAgICB0ZXN0IGRhdGEgCiAgICAiIiIKICAgIGZwcl94ZywgdHByX3hnLCBfID0gcm9jX2N1cnZlKHlfbGFiZWxzLCB5X3Byb2JzKQogICAgcGx0LnBsb3QoWzAsIDFdLCBbMCwgMV0sICJrLS0iKQogICAgcGx0LnBsb3QoZnByX3hnLCB0cHJfeGcsIGxhYmVsPSJyb2MiKQogICAgcGx0LnhsYWJlbCgiZmFsc2UgcG9zaXRpdmUgcmF0ZSIpCiAgICBwbHQueWxhYmVsKCJ0cnVlIHBvc2l0aXZlIHJhdGUiKQogICAgcGx0LnRpdGxlKCJyb2MgY3VydmUiKQogICAgcGx0LmxlZ2VuZChsb2M9ImJlc3QiKQogICAgZmlnID0gcGx0LmdjZigpCgogICAgcGxvdHBhdGggPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUpCiAgICBmaWcuc2F2ZWZpZyhwbG90cGF0aCwgZm9ybWF0PWZtdCkKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KFBsb3RBcnRpZmFjdChrZXksIGJvZHk9ZmlnKSkKCiAgICBfZ2NmX2NsZWFyKHBsdCkKCmRlZiBwbG90X2NvbmZ1c2lvbl9tYXRyaXgoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwgCiAgICBsYWJlbHMsIAogICAgcHJlZGljdGlvbnMsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gJycsIAogICAgbmFtZTogc3RyID0iY29uZnVzaW9uLnBuZyIsIAogICAga2V5OiBzdHIgPSdjb25mdXNpb25fbWF0cml4JywKICAgIGZtdDogc3RyID0gJ3BuZycKKToKICAgICIiIkNyZWF0ZSBhIGNvbmZ1c2lvbiBtYXRyaXguCiAgICBQbG90IGFuZCBzYXZlIGEgY29uZnVzaW9uIG1hdHJpeCB1c2luZyB0ZXN0IGRhdGEgZnJvbSBhCiAgICBwaXBlbGluZSBzdGVwLgoKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBsYWJlbHM6ICAgICAgICAgIHRlc3QgZGF0YSBsYWJlbHMKICAgIDpwYXJhbSBwcmVkaWN0aW9uczogICAgIHRlc3QgZGF0YSBwcmVkaWN0aW9ucwogICAgIiIiCiAgICBjbSA9IGNvbmZ1c2lvbl9tYXRyaXgobGFiZWxzLAogICAgICAgICAgICAgICAgICAgICAgICAgICAgcHJlZGljdGlvbnMsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBzYW1wbGVfd2VpZ2h0PU5vbmUsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBub3JtYWxpemU9J2FsbCcpCiAgICBzbnMuaGVhdG1hcChjbSwgYW5ub3Q9VHJ1ZSwgY21hcD0iQmx1ZXMiKQogICAgcGxvdHBhdGggPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUpCiAgICBmaWcgPSBwbHQuZ2NmKCkKICAgIGZpZy5zYXZlZmlnKHBsb3RwYXRoLCBmb3JtYXQ9Zm10KQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoUGxvdEFydGlmYWN0KGtleSwgYm9keT1maWcpKQoKICAgIF9nY2ZfY2xlYXIocGx0KQoKZGVmIHBsb3RfaW1wb3J0YW5jZSgKICAgIGNvbnRleHQsCiAgICBtb2RlbCwKICAgIGhlYWRlcjogTGlzdCA9IFtdLAogICAgdGFyZ2V0X3BhdGg6IHN0ciA9ICcnLAogICAgbmFtZTogc3RyID0gJ2ZlYXR1cmUtaW1wb3J0YW5jZXMucG5nJywKICAgIGtleTogc3RyID0gJ2ZlYXR1cmUtaW1wb3J0YW5jZXMnLAogICAgZm10ID0gJ3BuZycKKToKICAgICIiIkRpc3BsYXkgZXN0aW1hdGVkIGZlYXR1cmUgaW1wb3J0YW5jZXMuCgogICAgOnBhcmFtIGNvbnRleHQ6ICAgICBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gbW9kZWw6ICAgICAgIGZpdHRlZCBsaWdodGdibSBtb2RlbAogICAgOnBhcmFtIGhlYWRlcjogICAgICBsaXN0IG9mIGZlYXR1cmUgbmFtZXMKICAgICIiIgogICAgIyBjcmVhdGUgYSBmZWF0dXJlIGltcG9ydGFuY2UgdGFibGUgd2l0aCBkZXNpcmVkIGxhYmVscwogICAgemlwcGVkID0gemlwKG1vZGVsLmZlYXR1cmVfaW1wb3J0YW5jZXNfLCBoZWFkZXIpCgogICAgZmVhdHVyZV9pbXAgPSBwZC5EYXRhRnJhbWUoc29ydGVkKHppcHBlZCksIGNvbHVtbnM9WydmcmVxJywnZmVhdHVyZSddCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgKS5zb3J0X3ZhbHVlcyhieT0iZnJlcSIsIGFzY2VuZGluZz1GYWxzZSkKCiAgICBwbHQuZmlndXJlKGZpZ3NpemU9KDIwLCAxMCkpCiAgICBzbnMuYmFycGxvdCh4PSJmcmVxIiwgeT0iZmVhdHVyZSIsIGRhdGE9ZmVhdHVyZV9pbXApCiAgICBwbHQudGl0bGUoJ0xpZ2h0R0JNIEZlYXR1cmVzJykKICAgIHBsdC50aWdodF9sYXlvdXQoKQogICAgZmlnID0gcGx0LmdjZigpCiAgICBwbG90cGF0aCA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSkKICAgIGZpZy5zYXZlZmlnKHBsb3RwYXRoLCBmb3JtYXQ9J3BuZycpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChQbG90QXJ0aWZhY3Qoa2V5ICsgJy1wbG90JywgYm9keT1maWcpKQoKICAgICMgZmVhdHVyZSBpbXBvcnRhbmNlcyBhcmUgYWxzbyBzYXZlZCBhcyBhIHRhYmxlOgogICAgdGFibGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBrZXkgKyAnLXRhYmxlLmNzdicpCiAgICBmZWF0dXJlX2ltcC50b19jc3YodGFibGVwYXRoKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoVGFibGVBcnRpZmFjdChrZXkgKyAnLXRhYmxlJywgdGFyZ2V0X3BhdGg9dGFibGVwYXRoKSkKCiAgICAjIHRvIGVuc3VyZSB3ZSBkb24ndCBvdmVyd3JpdGUgdGhpcyBmaWd1cmUgd2hlbiBjcmVhdGluZyB0aGUgbmV4dDoKICAgIF9nY2ZfY2xlYXIocGx0KQo= base_image: yjbds/mlrun-ds:latest commands: [] - code_origin: https://github.com/yjb-ds/functions.git#25e611e4bd05320d342708ce786522bfecaa0e51:/User/repos/functions/evaluation/test-classifier.py + code_origin: https://github.com/yjb-ds/functions.git#e613e55761fd1ed325ad88155877924aa5b49ccc:/User/repos/functions/evaluation/test-classifier.py diff --git a/fileutils/arc_to_parquet/arc_to_parquet.py b/fileutils/arc_to_parquet/arc_to_parquet.py index fd7b697e0..1b7e9887b 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.py +++ b/fileutils/arc_to_parquet/arc_to_parquet.py @@ -11,6 +11,18 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. + +import ssl + +try: + _create_unverified_https_context = ssl._create_unverified_context +except AttributeError: + # Legacy Python that doesn't verify HTTPS certificates by default + pass +else: + # Handle target environment that doesn't support HTTPS verification + ssl._create_default_https_context = _create_unverified_https_context + import os import json from pathlib import Path diff --git a/fileutils/arc_to_parquet/arc_to_parquet.yaml b/fileutils/arc_to_parquet/arc_to_parquet.yaml index 74e2278b3..a88ddf2cd 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.yaml +++ b/fileutils/arc_to_parquet/arc_to_parquet.yaml @@ -2,7 +2,7 @@ kind: job metadata: name: arc-to-parquet tag: '' - hash: 51478ccddd3f23791f0b4f370fa08781cac91cd1 + hash: bc46a566c14672288af0ee28d3a7e2d031d0d37d project: '' spec: command: '' @@ -12,7 +12,7 @@ spec: env: [] description: '' build: - functionSourceCode: IyBDb3B5cmlnaHQgMjAxOCBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IHB5YXJyb3cucGFycXVldCBhcyBwcQppbXBvcnQgcHlhcnJvdyBhcyBwYQpmcm9tIHBpY2tsZSBpbXBvcnQgZHVtcCwgbG9hZAoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gdHlwaW5nIGltcG9ydCBJTywgQW55U3RyLCBVbmlvbiwgTGlzdCwgT3B0aW9uYWwKCgpkZWYgYXJjX3RvX3BhcnF1ZXQoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwKICAgIGFyY2hpdmVfdXJsOiBVbmlvbltzdHIsIFBhdGgsIElPW0FueVN0cl1dLAogICAgaGVhZGVyOiBPcHRpb25hbFtMaXN0W3N0cl1dID0gTm9uZSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAiIiwKICAgIG5hbWU6IHN0ciA9ICIiLAogICAgY2h1bmtzaXplOiBpbnQgPSAxMF8wMDAsCiAgICBsb2dfZGF0YTogYm9vbCA9IFRydWUsCiAgICBhZGRfdWlkOiBib29sID0gRmFsc2UsCiAgICBrZXk6IHN0ciA9ICJyYXdfZGF0YSIsCikgLT4gTm9uZToKICAgICIiIk9wZW4gYSBmaWxlL29iamVjdCBhcmNoaXZlIGFuZCBzYXZlIGFzIGEgcGFycXVldCBmaWxlLgogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBhcmNoaXZlX3VybDogYW55IHZhbGlkIHN0cmluZyBwYXRoIGNvbnNpc3RlbnQgd2l0aCB0aGUgcGF0aCB2YXJpYWJsZQogICAgICAgICAgICAgICAgICAgICAgICBvZiBwYW5kYXMucmVhZF9jc3YsIGluY2x1ZGluZyBzdHJpbmdzIGFzIGZpbGUgcGF0aHMsIGFzIHVybHMsIAogICAgICAgICAgICAgICAgICAgICAgICBwYXRobGliLlBhdGggb2JqZWN0cywgZXRjLi4uCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgIGNvbHVtbiBuYW1lcwogICAgOnBhcmFtIHRhcmdldF9wYXRoOiBkZXN0aW5hdGlvbiBmb2xkZXIgb2YgdGFibGUKICAgIDpwYXJhbSBuYW1lOiAgICAgICAgbmFtZSBmaWxlIHRvIGJlIHNhdmVkIGxvY2FsbHksIGFsc28KICAgIDpwYXJhbSBjaHVua3NpemU6ICAgKDApIHJvdyBzaXplIHJldHJpZXZlZCBwZXIgaXRlcmF0aW9uCiAgICA6cGFyYW0ga2V5OiAgICAgICAgIGtleSBpbiBhcnRpZmFjdCBzdG9yZSAod2hlbiBsb2dfZGF0YT1UcnVlKQogICAgIiIiCiAgICBpZiBub3QgbmFtZS5lbmRzd2l0aCgiLnBxdCIpOgogICAgICAgIG5hbWUgKz0gIi5wcXQiCgogICAgZGVzdF9wYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgb3MubWFrZWRpcnMob3MucGF0aC5qb2luKHRhcmdldF9wYXRoKSwgZXhpc3Rfb2s9VHJ1ZSkKICAgIGlmIG5vdCBvcy5wYXRoLmlzZmlsZShkZXN0X3BhdGgpOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oImRlc3RpbmF0aW9uIGZpbGUgZG9lcyBub3QgZXhpc3QsIGRvd25sb2FkaW5nIikKICAgICAgICBwcXdyaXRlciA9IE5vbmUKICAgICAgICBmb3IgaSwgZGYgaW4gZW51bWVyYXRlKHBkLnJlYWRfY3N2KGFyY2hpdmVfdXJsLCBjaHVua3NpemU9Y2h1bmtzaXplLCBuYW1lcz1oZWFkZXIpKToKICAgICAgICAgICAgdGFibGUgPSBwYS5UYWJsZS5mcm9tX3BhbmRhcyhkZikKICAgICAgICAgICAgaWYgaSA9PSAwOgogICAgICAgICAgICAgICAgcHF3cml0ZXIgPSBwcS5QYXJxdWV0V3JpdGVyKGRlc3RfcGF0aCwgdGFibGUuc2NoZW1hKQogICAgICAgICAgICBwcXdyaXRlci53cml0ZV90YWJsZSh0YWJsZSkKCiAgICAgICAgaWYgcHF3cml0ZXI6CiAgICAgICAgICAgIHBxd3JpdGVyLmNsb3NlKCkKCiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbyhmInNhdmVkIHRhYmxlIHRvIHtkZXN0X3BhdGh9IikKICAgIGVsc2U6CiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbygiZGVzdGluYXRpb24gZmlsZSBhbHJlYWR5IGV4aXN0cyIpCgogICAgY29udGV4dC5sb2dfYXJ0aWZhY3Qoa2V5LCB0YXJnZXRfcGF0aD1kZXN0X3BhdGgpCiAgICAjIGxvZyBoZWFkZXIKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCAnaGVhZGVyLnBrbCcpCiAgICBkdW1wKGhlYWRlciwgb3BlbihmaWxlcGF0aCwgJ3diJykpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgnaGVhZGVyJywgdGFyZ2V0X3BhdGg9ZmlsZXBhdGgpCg== + functionSourceCode: IyBDb3B5cmlnaHQgMjAxOCBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgoKaW1wb3J0IHNzbAoKdHJ5OgogICAgX2NyZWF0ZV91bnZlcmlmaWVkX2h0dHBzX2NvbnRleHQgPSBzc2wuX2NyZWF0ZV91bnZlcmlmaWVkX2NvbnRleHQKZXhjZXB0IEF0dHJpYnV0ZUVycm9yOgogICAgIyBMZWdhY3kgUHl0aG9uIHRoYXQgZG9lc24ndCB2ZXJpZnkgSFRUUFMgY2VydGlmaWNhdGVzIGJ5IGRlZmF1bHQKICAgIHBhc3MKZWxzZToKICAgICMgSGFuZGxlIHRhcmdldCBlbnZpcm9ubWVudCB0aGF0IGRvZXNuJ3Qgc3VwcG9ydCBIVFRQUyB2ZXJpZmljYXRpb24KICAgIHNzbC5fY3JlYXRlX2RlZmF1bHRfaHR0cHNfY29udGV4dCA9IF9jcmVhdGVfdW52ZXJpZmllZF9odHRwc19jb250ZXh0CgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IHB5YXJyb3cucGFycXVldCBhcyBwcQppbXBvcnQgcHlhcnJvdyBhcyBwYQpmcm9tIHBpY2tsZSBpbXBvcnQgZHVtcCwgbG9hZAoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gdHlwaW5nIGltcG9ydCBJTywgQW55U3RyLCBVbmlvbiwgTGlzdCwgT3B0aW9uYWwKCgpkZWYgYXJjX3RvX3BhcnF1ZXQoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwKICAgIGFyY2hpdmVfdXJsOiBVbmlvbltzdHIsIFBhdGgsIElPW0FueVN0cl1dLAogICAgaGVhZGVyOiBPcHRpb25hbFtMaXN0W3N0cl1dID0gTm9uZSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAiIiwKICAgIG5hbWU6IHN0ciA9ICIiLAogICAgY2h1bmtzaXplOiBpbnQgPSAxMF8wMDAsCiAgICBsb2dfZGF0YTogYm9vbCA9IFRydWUsCiAgICBhZGRfdWlkOiBib29sID0gRmFsc2UsCiAgICBrZXk6IHN0ciA9ICJyYXdfZGF0YSIsCikgLT4gTm9uZToKICAgICIiIk9wZW4gYSBmaWxlL29iamVjdCBhcmNoaXZlIGFuZCBzYXZlIGFzIGEgcGFycXVldCBmaWxlLgogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBhcmNoaXZlX3VybDogYW55IHZhbGlkIHN0cmluZyBwYXRoIGNvbnNpc3RlbnQgd2l0aCB0aGUgcGF0aCB2YXJpYWJsZQogICAgICAgICAgICAgICAgICAgICAgICBvZiBwYW5kYXMucmVhZF9jc3YsIGluY2x1ZGluZyBzdHJpbmdzIGFzIGZpbGUgcGF0aHMsIGFzIHVybHMsIAogICAgICAgICAgICAgICAgICAgICAgICBwYXRobGliLlBhdGggb2JqZWN0cywgZXRjLi4uCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgIGNvbHVtbiBuYW1lcwogICAgOnBhcmFtIHRhcmdldF9wYXRoOiBkZXN0aW5hdGlvbiBmb2xkZXIgb2YgdGFibGUKICAgIDpwYXJhbSBuYW1lOiAgICAgICAgbmFtZSBmaWxlIHRvIGJlIHNhdmVkIGxvY2FsbHksIGFsc28KICAgIDpwYXJhbSBjaHVua3NpemU6ICAgKDApIHJvdyBzaXplIHJldHJpZXZlZCBwZXIgaXRlcmF0aW9uCiAgICA6cGFyYW0ga2V5OiAgICAgICAgIGtleSBpbiBhcnRpZmFjdCBzdG9yZSAod2hlbiBsb2dfZGF0YT1UcnVlKQogICAgIiIiCiAgICBpZiBub3QgbmFtZS5lbmRzd2l0aCgiLnBxdCIpOgogICAgICAgIG5hbWUgKz0gIi5wcXQiCgogICAgZGVzdF9wYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgb3MubWFrZWRpcnMob3MucGF0aC5qb2luKHRhcmdldF9wYXRoKSwgZXhpc3Rfb2s9VHJ1ZSkKICAgIGlmIG5vdCBvcy5wYXRoLmlzZmlsZShkZXN0X3BhdGgpOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oImRlc3RpbmF0aW9uIGZpbGUgZG9lcyBub3QgZXhpc3QsIGRvd25sb2FkaW5nIikKICAgICAgICBwcXdyaXRlciA9IE5vbmUKICAgICAgICBmb3IgaSwgZGYgaW4gZW51bWVyYXRlKHBkLnJlYWRfY3N2KGFyY2hpdmVfdXJsLCBjaHVua3NpemU9Y2h1bmtzaXplLCBuYW1lcz1oZWFkZXIpKToKICAgICAgICAgICAgdGFibGUgPSBwYS5UYWJsZS5mcm9tX3BhbmRhcyhkZikKICAgICAgICAgICAgaWYgaSA9PSAwOgogICAgICAgICAgICAgICAgcHF3cml0ZXIgPSBwcS5QYXJxdWV0V3JpdGVyKGRlc3RfcGF0aCwgdGFibGUuc2NoZW1hKQogICAgICAgICAgICBwcXdyaXRlci53cml0ZV90YWJsZSh0YWJsZSkKCiAgICAgICAgaWYgcHF3cml0ZXI6CiAgICAgICAgICAgIHBxd3JpdGVyLmNsb3NlKCkKCiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbyhmInNhdmVkIHRhYmxlIHRvIHtkZXN0X3BhdGh9IikKICAgIGVsc2U6CiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbygiZGVzdGluYXRpb24gZmlsZSBhbHJlYWR5IGV4aXN0cyIpCgogICAgY29udGV4dC5sb2dfYXJ0aWZhY3Qoa2V5LCB0YXJnZXRfcGF0aD1kZXN0X3BhdGgpCiAgICAjIGxvZyBoZWFkZXIKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCAnaGVhZGVyLnBrbCcpCiAgICBkdW1wKGhlYWRlciwgb3BlbihmaWxlcGF0aCwgJ3diJykpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgnaGVhZGVyJywgdGFyZ2V0X3BhdGg9ZmlsZXBhdGgpCg== base_image: yjbds/mlrun-ds:latest commands: [] - code_origin: https://github.com/yjb-ds/functions.git#e4d74d784d42fb25cc75cbcab6d817bb1d2b150c:/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py + code_origin: https://github.com/yjb-ds/functions.git#e613e55761fd1ed325ad88155877924aa5b49ccc:/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py diff --git a/tests/arc_to_parquet.ipynb b/tests/arc_to_parquet.ipynb index 81c407f05..136174c78 100644 --- a/tests/arc_to_parquet.ipynb +++ b/tests/arc_to_parquet.ipynb @@ -9,7 +9,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ @@ -18,9 +18,16 @@ "mlrun.mlconf.dbpath = 'http://mlrun-api:8080'" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## parameters" + ] + }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ @@ -37,6 +44,15 @@ " 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']" ] }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "os.makedirs(TARGET_PATH, exist_ok=True)" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -46,14 +62,14 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 14:25:57,524 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" + "[mlrun] 2020-01-26 18:24:11,226 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" ] } ], @@ -69,7 +85,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 12, "metadata": {}, "outputs": [], "source": [ @@ -94,7 +110,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 13, "metadata": {}, "outputs": [ { @@ -103,7 +119,7 @@ "'ready'" ] }, - "execution_count": 22, + "execution_count": 13, "metadata": {}, "output_type": "execute_result" } @@ -112,30 +128,23 @@ "arctoparq.deploy(skip_deployed=True, with_mlrun=False)" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Also note that the build time can be reduced if you specifiy a pre-built image with all required packages pre-installed." - ] - }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 14:26:02,799 starting run arc2parq uid=9bfb3ce77e0549f2b1016c128423130e -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-26 14:26:02,882 Job is running in the background, pod: arc2parq-9t75c\n", - "[mlrun] 2020-01-26 14:26:07,133 destination file does not exist, downloading\n", - "[mlrun] 2020-01-26 14:31:08,548 saved table to /User/mlrun/models/higgs.pqt\n", - "[mlrun] 2020-01-26 14:31:08,577 log artifact higgs at /User/mlrun/models/higgs.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 14:31:08,596 log artifact header at /User/mlrun/models/header.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-26 18:24:16,244 starting run arc2parq uid=ddb58fa1dfb644b7875bc4d92033a1e4 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 18:24:16,326 Job is running in the background, pod: arc2parq-42tm7\n", + "[mlrun] 2020-01-26 18:24:21,234 destination file does not exist, downloading\n", + "[mlrun] 2020-01-26 18:29:29,278 saved table to /User/mlrun/models/higgs.pqt\n", + "[mlrun] 2020-01-26 18:29:29,294 log artifact higgs at /User/mlrun/models/higgs.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 18:29:29,307 log artifact header at /User/mlrun/models/header.pkl, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-26 14:31:08,619 run executed, status=completed\n", + "[mlrun] 2020-01-26 18:29:29,332 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -308,12 +317,12 @@ " \n", " \n", " \n", - "
...23130e
\n", + "
...33a1e4
\n", " 0\n", - " Jan 26 14:26:07\n", + " Jan 26 18:24:21\n", " completed\n", " arc-to-parquet\n", - "
host=arc2parq-9t75c
kind=job
owner=admin
\n", + "
host=arc2parq-42tm7
kind=job
owner=admin
\n", " \n", "
archive_url=https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
header=['labels', 'lepton_pT', 'lepton_eta', 'lepton_phi', 'missing_energy_magnitude', 'missing_energy_phi', 'jet_1_pt', 'jet_1_eta', 'jet_1_phi', 'jet_1_b-tag', 'jet_2_pt', 'jet_2_eta', 'jet_2_phi', 'jet_2_b-tag', 'jet_3_pt', 'jet_3_eta', 'jet_3_phi', 'jet_3_b-tag', 'jet_4_pt', 'jet_4_eta', 'jet_4_phi', 'jet_4_b-tag', 'm_jj', 'm_jjj', 'm_lv', 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']
key=higgs
name=higgs.pqt
target_path=/User/mlrun/models
\n", " \n", @@ -322,12 +331,12 @@ " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -343,8 +352,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 9bfb3ce77e0549f2b1016c128423130e , !mlrun logs 9bfb3ce77e0549f2b1016c128423130e \n", - "[mlrun] 2020-01-26 14:31:13,245 run executed, status=completed\n" + "!mlrun get run ddb58fa1dfb644b7875bc4d92033a1e4 , !mlrun logs ddb58fa1dfb644b7875bc4d92033a1e4 \n", + "[mlrun] 2020-01-26 18:29:36,692 run executed, status=completed\n" ] } ], @@ -381,7 +390,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 15, "metadata": {}, "outputs": [], "source": [ @@ -392,7 +401,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 16, "metadata": {}, "outputs": [], "source": [ @@ -402,7 +411,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 17, "metadata": {}, "outputs": [], "source": [ @@ -412,7 +421,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 18, "metadata": {}, "outputs": [], "source": [ @@ -421,7 +430,7 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 19, "metadata": {}, "outputs": [ { @@ -626,7 +635,7 @@ "[5 rows x 29 columns]" ] }, - "execution_count": 29, + "execution_count": 19, "metadata": {}, "output_type": "execute_result" } @@ -637,7 +646,7 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 20, "metadata": {}, "outputs": [ { @@ -646,7 +655,7 @@ "(11000000, 29)" ] }, - "execution_count": 30, + "execution_count": 20, "metadata": {}, "output_type": "execute_result" } @@ -664,7 +673,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ diff --git a/tests/test_classifier.ipynb b/tests/test_classifier.ipynb index bf065d78a..f91df44fc 100644 --- a/tests/test_classifier.ipynb +++ b/tests/test_classifier.ipynb @@ -11,7 +11,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -30,7 +30,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -52,24 +52,30 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 3, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-26 19:24:09,737 function spec saved to path: /User/repos/functions/evaluation/test-classifier.yaml\n" + ] + } + ], "source": [ - "yaml_name = os.path.join(CODE_BASE, 'evaluation', 'test-classifier.yaml')\n", - "if not os.path.isfile(yaml_name):\n", - " testfn = mlrun.code_to_function(\n", - " kind='job', \n", - " image='yjbds/mlrun-ds:latest',\n", - " filename=os.path.join(CODE_BASE, 'evaluation', 'test-classifier.py'))\n", - " testfn.build_config(base_image='yjbds/mlrun-ds:latest', commands=[])\n", + "testfn = mlrun.code_to_function(\n", + " kind='job', \n", + " image='yjbds/mlrun-ds:latest',\n", + " filename=os.path.join(CODE_BASE, 'evaluation', 'test-classifier.py'))\n", + "testfn.build_config(base_image='yjbds/mlrun-ds:latest', commands=[])\n", "\n", - " testfn.export(os.path.join(CODE_BASE, 'evaluation', 'test-classifier.yaml'))" + "testfn.export(os.path.join(CODE_BASE, 'evaluation', 'test-classifier.yaml'))" ] }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -80,7 +86,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 8, "metadata": {}, "outputs": [ { @@ -89,7 +95,7 @@ "'ready'" ] }, - "execution_count": 24, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } @@ -100,16 +106,16 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 25, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } @@ -125,21 +131,21 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 14:37:14,256 starting run test_model uid=6f193cffb1714302ae174b06ca9538f6 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-26 14:37:14,354 Job is running in the background, pod: test-model-x4z5p\n", - "[mlrun] 2020-01-26 14:37:24,867 log artifact roc.html at roc.html, size: 32259, db: Y\n", - "[mlrun] 2020-01-26 14:37:25,080 log artifact confusion_matrix.html at confusion_matrix.html, size: 16188, db: Y\n", - "[mlrun] 2020-01-26 14:37:25,667 log artifact feature-importances-plot.html at feature-importances-plot.html, size: 68260, db: Y\n", - "[mlrun] 2020-01-26 14:37:25,682 log artifact feature-importances-table at /User/mlrun/models/feature-importances-table.csv, size: None, db: Y\n", + "[mlrun] 2020-01-26 19:27:16,202 starting run test_model uid=60d2146665834f8ba1ca829227156ac8 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 19:27:16,305 Job is running in the background, pod: test-model-w96c5\n", + "[mlrun] 2020-01-26 19:27:27,264 log artifact roc.html at roc.html, size: 41071, db: Y\n", + "[mlrun] 2020-01-26 19:27:28,585 log artifact confusion_matrix.html at confusion_matrix.html, size: 14016, db: Y\n", + "[mlrun] 2020-01-26 19:27:29,158 log artifact feature-importances-plot.html at feature-importances-plot.html, size: 71976, db: Y\n", + "[mlrun] 2020-01-26 19:27:29,177 log artifact feature-importances-table at /User/mlrun/models/feature-importances-table.csv, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-26 14:37:25,739 run executed, status=completed\n", + "[mlrun] 2020-01-26 19:27:29,251 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -312,26 +318,26 @@ " \n", " \n", " \n", - "
...9538f6
\n", + "
...156ac8
\n", " 0\n", - " Jan 26 14:37:20\n", + " Jan 26 19:27:24\n", " completed\n", " test-classifier\n", - "
host=test-model-x4z5p
kind=job
owner=admin
\n", + "
host=test-model-w96c5
kind=job
owner=admin
\n", " \n", "
model=/User/mlrun/models/lgb-classifier.pkl
target_path=/User/mlrun/models
xtest=/User/mlrun/models/xtest.pqt
ytest=/User/mlrun/models/ytest.pqt
\n", " \n", - "
roc.html
confusion_matrix.html
feature-importances-plot.html
feature-importances-table
\n", + "
roc.html
confusion_matrix.html
feature-importances-plot.html
feature-importances-table
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -347,8 +353,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 6f193cffb1714302ae174b06ca9538f6 , !mlrun logs 6f193cffb1714302ae174b06ca9538f6 \n", - "[mlrun] 2020-01-26 14:37:33,579 run executed, status=completed\n" + "!mlrun get run 60d2146665834f8ba1ca829227156ac8 , !mlrun logs 60d2146665834f8ba1ca829227156ac8 \n", + "[mlrun] 2020-01-26 19:27:35,599 run executed, status=completed\n" ] } ], diff --git a/tests/train_classifier.ipynb b/tests/train_classifier.ipynb index 449aea8e5..b552e71f6 100644 --- a/tests/train_classifier.ipynb +++ b/tests/train_classifier.ipynb @@ -24,7 +24,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -43,18 +43,13 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "CODE_BASE = '/User/repos/functions/' \n", - "N_SAMPLES = 100_000 # size of HIGGS data\n", - "M_FEATURES = 28\n", - "NEG_WEIGHT = 0.5\n", "TARGET_DATA_PATH = '/User/mlrun/models'\n", - "FILE_NAME = 'simdata.pqt'\n", - "KEY = 'simdata'\n", - "RNG = 1\n", + "\n", "SKLEARN_CLASSIFIER = 'lightgbm.sklearn.LGBMClassifier'\n", "MODEL_KEY = 'model'\n", "MODEL_NAME = 'lgb-classifier.pkl'\n", @@ -65,686 +60,187 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## generate some binary classifiaction data" - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": {}, - "outputs": [], - "source": [ - "binarydatagen = mlrun.import_function(\n", - " os.path.join(CODE_BASE+'datagen/classification', 'binary.yaml')\n", - ").apply(mlrun.mount_v3io())" + "_____\n", + "## train a classifier" ] }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 3, "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "'ready'" - ] - }, - "execution_count": 31, - "metadata": {}, - "output_type": "execute_result" + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-26 19:17:00,914 function spec saved to path: /User/repos/functions/train/sklearn-classifier.yaml\n" + ] } ], "source": [ - "binarydatagen.deploy(skip_deployed=True, with_mlrun=False)" + "testfn = mlrun.code_to_function(\n", + " kind='job', \n", + " filename=os.path.join(CODE_BASE, 'train', 'sklearn-classifier.py'))\n", + "testfn.build_config(base_image='yjbds/mlrun-ds:latest', commands=[])\n", + "testfn.export(os.path.join(CODE_BASE, 'train', 'sklearn-classifier.yaml'))" ] }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 4, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 32, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "task1 = mlrun.NewTask()\n", - "task1.with_params(\n", - " n_samples=N_SAMPLES,\n", - " m_features=M_FEATURES,\n", - " weight=NEG_WEIGHT,\n", - " target_path=TARGET_DATA_PATH,\n", - " filename=FILE_NAME,\n", - " key=KEY,\n", - " random_state=RNG)" + "trainfn = mlrun.import_function(\n", + " os.path.join(CODE_BASE+'train/sklearn-classifier.yaml')\n", + ").apply(mlrun.mount_v3io())" ] }, { "cell_type": "code", - "execution_count": 33, - "metadata": {}, + "execution_count": 5, + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 14:35:40,509 starting run create_binary_classification uid=245e550ff213469681114228327a8e02 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-26 14:35:40,606 Job is running in the background, pod: create-binary-classification-7295j\n", - "[mlrun] 2020-01-26 14:35:53,548 log artifact simdata at /User/mlrun/models/simdata.pqt, size: None, db: Y\n", - "\n", - "[mlrun] 2020-01-26 14:35:53,560 run executed, status=completed\n", - "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", - " result = infer_dtype(pandas_collection)\n", - "final state: succeeded\n" + "[mlrun] 2020-01-26 19:17:15,462 starting remote build, image: .mlrun/func-default-sklearn-classifier-latest\n", + "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:d7724d11d33770dd3f65bee87ce7bf9f182428e96d53343f82ab5fce506f875b: no such file or directory \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Built cross stage deps: map[] \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:d7724d11d33770dd3f65bee87ce7bf9f182428e96d53343f82ab5fce506f875b: no such file or directory \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-ds:latest \n", + "\u001b[36mINFO\u001b[0m[0000] Unpacking rootfs as cmd RUN pip install mlrun requires it. \n", + "\u001b[36mINFO\u001b[0m[0048] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0063] RUN pip install mlrun \n", + "\u001b[36mINFO\u001b[0m[0063] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0063] args: [-c pip install mlrun] \n", + "Requirement already satisfied: mlrun in /opt/conda/lib/python3.7/site-packages (0.4.3)\n", + "Requirement already satisfied: nuclio-jupyter>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.0)\n", + "Requirement already satisfied: nuclio-sdk>=0.0.3 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.0.5)\n", + "Requirement already satisfied: pandas>=0.23.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.25.3)\n", + "Requirement already satisfied: gevent==1.4.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.4.0)\n", + "Requirement already satisfied: pyyaml>=5.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (5.1.1)\n", + "Requirement already satisfied: aiohttp>=3.5.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.6.2)\n", + "Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (7.0)\n", + "Requirement already satisfied: requests>=2.20.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (2.20.1)\n", + "Requirement already satisfied: nest-asyncio>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.2.2)\n", + "Requirement already satisfied: sqlalchemy==1.3.11 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.3.11)\n", + "Requirement already satisfied: GitPython>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.0.5)\n", + "Requirement already satisfied: tabulate<=0.8.3,>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.3)\n", + "Requirement already satisfied: gunicorn==19.9.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (19.9.0)\n", + "Requirement already satisfied: Flask>=1.1.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.1.1)\n", + "Requirement already satisfied: kfp>=0.1.29 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.2.0)\n", + "Requirement already satisfied: croniter==0.3.31 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.3.31)\n", + "Requirement already satisfied: boto3>=1.9 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.11.9)\n", + "Requirement already satisfied: jupyterlab>=0.35.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (1.2.6)\n", + "Requirement already satisfied: nbconvert>=5.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.6.1)\n", + "Requirement already satisfied: ipython>=7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (7.11.1)\n", + "Requirement already satisfied: notebook>=5.7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.3)\n", + "Requirement already satisfied: tornado<6,>=5 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.1.1)\n", + "Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (1.17.4)\n", + "Requirement already satisfied: python-dateutil>=2.6.1 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (2.8.0)\n", + "Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (2019.1)\n", + "Requirement already satisfied: greenlet>=0.4.14; platform_python_implementation == \"CPython\" in /opt/conda/lib/python3.7/site-packages (from gevent==1.4.0->mlrun) (0.4.15)\n", + "Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (19.3.0)\n", + "Requirement already satisfied: async-timeout<4.0,>=3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.1)\n", + "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: multidict<5.0,>=4.5 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (4.7.4)\n", + "Requirement already satisfied: chardet<4.0,>=2.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.4)\n", + "Requirement already satisfied: idna<2.8,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (2.6)\n", + "Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (1.24.1)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (2019.9.11)\n", + "Requirement already satisfied: gitdb2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from GitPython>=2.1.0->mlrun) (2.0.6)\n", + "Requirement already satisfied: Werkzeug>=0.15 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (0.16.0)\n", + "Requirement already satisfied: Jinja2>=2.10.1 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (2.10.3)\n", + "Requirement already satisfied: itsdangerous>=0.24 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (1.1.0)\n", + "Requirement already satisfied: cloudpickle==1.1.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.1.1)\n", + "Requirement already satisfied: argo-models==2.2.1a in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.2.1a0)\n", + "Requirement already satisfied: requests-toolbelt>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.9.1)\n", + "Requirement already satisfied: kfp-server-api<=0.1.40,>=0.1.18 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.1.40)\n", + "Requirement already satisfied: Deprecated in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.2.7)\n", + "Requirement already satisfied: jsonschema>=3.0.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (3.2.0)\n", + "Requirement already satisfied: cryptography>=2.4.2 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.7)\n", + "Requirement already satisfied: google-cloud-storage>=1.13.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.25.0)\n", + "Requirement already satisfied: PyJWT>=1.6.4 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.7.1)\n", + "Requirement already satisfied: kubernetes<=10.0.0,>=8.0.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (10.0.0)\n", + "Requirement already satisfied: six>=1.10 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.12.0)\n", + "Requirement already satisfied: google-auth>=1.6.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.11.0)\n", + "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.9.4)\n", + "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.3.2)\n", + "Requirement already satisfied: botocore<1.15.0,>=1.14.9 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (1.14.9)\n", + "Requirement already satisfied: jupyterlab-server~=1.0.0 in /opt/conda/lib/python3.7/site-packages (from jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (1.0.6)\n", + "Requirement already satisfied: defusedxml in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: traitlets>=4.2 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (4.3.3)\n", + "Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: pygments in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (2.5.2)\n", + "Requirement already satisfied: nbformat>=4.4 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (5.0.4)\n", + "Requirement already satisfied: bleach in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (3.1.0)\n", + "Requirement already satisfied: entrypoints>=0.2.2 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.3)\n", + "Requirement already satisfied: jupyter-core in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (4.6.1)\n", + "Requirement already satisfied: testpath in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.4.4)\n", + "Requirement already satisfied: mistune<2,>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.4)\n", + "Requirement already satisfied: decorator in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.4.1)\n", + "Requirement already satisfied: jedi>=0.10 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.16.0)\n", + "Requirement already satisfied: setuptools>=18.5 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (41.0.1.post20191122)\n", + "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (3.0.2)\n", + "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.8.0)\n", + "Requirement already satisfied: pickleshare in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.5)\n", + "Requirement already satisfied: backcall in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.0)\n", + "Requirement already satisfied: ipykernel in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.1.3)\n", + "Requirement already satisfied: terminado>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.8.3)\n", + "Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.2.0)\n", + "Requirement already satisfied: prometheus-client in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.1)\n", + "Requirement already satisfied: jupyter-client>=5.3.4 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.3.4)\n", + "Requirement already satisfied: pyzmq>=17 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (18.1.1)\n", + "Requirement already satisfied: Send2Trash in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (1.5.0)\n", + "Requirement already satisfied: smmap2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from gitdb2>=2.0.0->GitPython>=2.1.0->mlrun) (2.0.5)\n", + "Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.7/site-packages (from Jinja2>=2.10.1->Flask>=1.1.1->mlrun) (1.1.1)\n", + "Requirement already satisfied: wrapt<2,>=1.10 in /opt/conda/lib/python3.7/site-packages (from Deprecated->kfp>=0.1.29->mlrun) (1.11.2)\n", + "Requirement already satisfied: pyrsistent>=0.14.0 in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (0.15.7)\n", + "Requirement already satisfied: importlib-metadata; python_version < \"3.8\" in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (1.4.0)\n", + "Requirement already satisfied: asn1crypto>=0.21.0 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (0.24.0)\n", + "Requirement already satisfied: cffi!=1.11.3,>=1.8 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (1.12.3)\n", + "Requirement already satisfied: google-cloud-core<2.0dev,>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.2.0)\n", + "Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (0.5.0)\n", + "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (0.57.0)\n", + "Requirement already satisfied: requests-oauthlib in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (1.3.0)\n", + "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0.0)\n", + "Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.2.8)\n", + "Requirement already satisfied: rsa<4.1,>=3.1.4 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0)\n", + "Requirement already satisfied: docutils<0.16,>=0.10 in /opt/conda/lib/python3.7/site-packages (from botocore<1.15.0,>=1.14.9->boto3>=1.9->mlrun) (0.15.2)\n", + "Requirement already satisfied: json5 in /opt/conda/lib/python3.7/site-packages (from jupyterlab-server~=1.0.0->jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.5)\n", + "Requirement already satisfied: webencodings in /opt/conda/lib/python3.7/site-packages (from bleach->nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.5.1)\n", + "Requirement already satisfied: parso>=0.5.2 in /opt/conda/lib/python3.7/site-packages (from jedi>=0.10->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.5.2)\n", + "Requirement already satisfied: wcwidth in /opt/conda/lib/python3.7/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.8)\n", + "Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.7/site-packages (from pexpect; sys_platform != \"win32\"->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (2.1.0)\n", + "Requirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.4.2->kfp>=0.1.29->mlrun) (2.18)\n", + "Requirement already satisfied: google-api-core<2.0.0dev,>=1.16.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.16.0)\n", + "Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from requests-oauthlib->kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (3.1.0)\n", + "Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /opt/conda/lib/python3.7/site-packages (from pyasn1-modules>=0.2.1->google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.4.8)\n", + "Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.51.0)\n", + "Requirement already satisfied: protobuf>=3.4.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (3.11.2)\n", + "\u001b[36mINFO\u001b[0m[0066] Taking snapshot of full filesystem... \n" ] }, - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...7a8e02
0Jan 26 14:35:52completedbinary
host=create-binary-classification-7295j
kind=job
owner=admin
filename=simdata.pqt
key=simdata
m_features=28
n_samples=100000
random_state=1
target_path=/User/mlrun/models
weight=0.5
simdata
\n", - "
\n", - "
\n", - "
\n", - " Title\n", - " ×\n", - "
\n", - " \n", - "
\n", - "
\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 245e550ff213469681114228327a8e02 , !mlrun logs 245e550ff213469681114228327a8e02 \n", - "[mlrun] 2020-01-26 14:35:59,827 run executed, status=completed\n" - ] - } - ], - "source": [ - "tsk1 = binarydatagen.run(task1, handler='create_binary_classification')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "______" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## split the generated data" - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "metadata": {}, - "outputs": [], - "source": [ - "splitter = mlrun.import_function(\n", - " os.path.join(CODE_BASE+'datagen/splitters', 'train_valid_test.yaml')\n", - ").apply(mlrun.mount_v3io())" - ] - }, - { - "cell_type": "code", - "execution_count": 35, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'ready'" - ] - }, - "execution_count": 35, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "splitter.deploy(skip_deployed=True, with_mlrun=False)" - ] - }, - { - "cell_type": "code", - "execution_count": 36, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 36, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "task1 = mlrun.NewTask()\n", - "task1.with_params(\n", - " src_file=TARGET_DATA_PATH + '/' + FILE_NAME,\n", - " sample=20_000,\n", - " target_path=TARGET_DATA_PATH,\n", - " random_state=RNG)" - ] - }, - { - "cell_type": "code", - "execution_count": 37, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-26 14:35:59,880 starting run train_valid_test_splitter uid=907ad4a876fa4205a40a668956446468 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-26 14:35:59,974 Job is running in the background, pod: train-valid-test-splitter-vdn6h\n", - "[mlrun] 2020-01-26 14:36:09,842 log artifact header at /User/mlrun/models/header.pkl, size: None, db: Y\n", - "[mlrun] 2020-01-26 14:36:09,953 log artifact xtrain at /User/mlrun/models/xtrain.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 14:36:10,052 log artifact xvalid at /User/mlrun/models/xvalid.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 14:36:10,104 log artifact xtest at /User/mlrun/models/xtest.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 14:36:10,146 log artifact ytrain at /User/mlrun/models/ytrain.pqt, size: None, db: Y\n", - "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:708: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", - " labels = getattr(columns, 'labels', None) or [\n", - "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:735: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead\n", - " return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)\n", - "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:752: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", - " labels, = index.labels\n", - "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", - " result = infer_dtype(pandas_collection)\n", - "[mlrun] 2020-01-26 14:36:10,182 log artifact yvalid at /User/mlrun/models/yvalid.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 14:36:10,211 log artifact ytest at /User/mlrun/models/ytest.pqt, size: None, db: Y\n", - "\n", - "[mlrun] 2020-01-26 14:36:10,236 run executed, status=completed\n", - "final state: succeeded\n" - ] - }, - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...446468
0Jan 26 14:36:09completedtrain-valid-test
host=train-valid-test-splitter-vdn6h
kind=job
owner=admin
random_state=1
sample=20000
src_file=/User/mlrun/models/simdata.pqt
target_path=/User/mlrun/models
header
xtrain
xvalid
xtest
ytrain
yvalid
ytest
\n", - "
\n", - "
\n", - "
\n", - " Title\n", - " ×\n", - "
\n", - " \n", - "
\n", - "
\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 907ad4a876fa4205a40a668956446468 , !mlrun logs 907ad4a876fa4205a40a668956446468 \n", - "[mlrun] 2020-01-26 14:36:19,229 run executed, status=completed\n" - ] - } - ], - "source": [ - "tsk1 = splitter.run(task1, handler='train_valid_test_splitter')" - ] - }, - { - "cell_type": "code", - "execution_count": 38, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'header': '/User/mlrun/models/header.pkl',\n", - " 'xtrain': '/User/mlrun/models/xtrain.pqt',\n", - " 'xvalid': '/User/mlrun/models/xvalid.pqt',\n", - " 'xtest': '/User/mlrun/models/xtest.pqt',\n", - " 'ytrain': '/User/mlrun/models/ytrain.pqt',\n", - " 'yvalid': '/User/mlrun/models/yvalid.pqt',\n", - " 'ytest': '/User/mlrun/models/ytest.pqt'}" - ] - }, - "execution_count": 38, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "tsk1.outputs" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "_____\n", - "## train a classifier" - ] - }, - { - "cell_type": "code", - "execution_count": 39, - "metadata": {}, - "outputs": [], - "source": [ - "yaml_name = os.path.join(CODE_BASE, 'train', 'sklearn-classifier.yaml')\n", - "if not os.path.isfile(yaml_name):\n", - " testfn = mlrun.code_to_function(\n", - " kind='job', \n", - " filename=os.path.join(CODE_BASE, 'train', 'sklearn-classifier.py'))\n", - " testfn.build_config(base_image='yjbds/mlrun-ds:latest', commands=[])\n", - " testfn.export(os.path.join(CODE_BASE, 'train', 'sklearn-classifier.yaml'))" - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "metadata": {}, - "outputs": [], - "source": [ - "trainfn = mlrun.import_function(\n", - " os.path.join(CODE_BASE+'train/sklearn-classifier.yaml')\n", - ").apply(mlrun.mount_v3io())" - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "metadata": {}, - "outputs": [ { "data": { "text/plain": [ - "'ready'" + "True" ] }, - "execution_count": 41, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -755,30 +251,29 @@ }, { "cell_type": "code", - "execution_count": 42, + "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 42, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "task2 = mlrun.NewTask()\n", - "task2.with_params(\n", - " src_file=tsk1.output(KEY),\n", + "task = mlrun.NewTask()\n", + "task.with_params(\n", " SKClassifier=SKLEARN_CLASSIFIER,\n", " callbacks = [],\n", - " xtrain=tsk1.outputs['xtrain'],\n", - " ytrain=tsk1.outputs['ytrain'],\n", - " xvalid=tsk1.outputs['xvalid'],\n", - " yvalid=tsk1.outputs['yvalid'],\n", + " xtrain='/User/mlrun/models/xtrain.pqt',\n", + " ytrain='/User/mlrun/models/ytrain.pqt',\n", + " xvalid='/User/mlrun/models/xvalid.pqt',\n", + " yvalid='/User/mlrun/models/yvalid.pqt',\n", " target_path='/User/mlrun/models',\n", " name=MODEL_NAME,\n", " key=MODEL_KEY,\n", @@ -787,23 +282,23 @@ }, { "cell_type": "code", - "execution_count": 43, + "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 14:36:19,413 starting run train uid=6eb38fb5166b40099e6e579f00c1ad22 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-26 14:36:19,496 Job is running in the background, pod: train-n4qgh\n", + "[mlrun] 2020-01-26 19:20:00,276 starting run train uid=984d1c76d3744b8593395d4cec4c06e7 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 19:20:00,370 Job is running in the background, pod: train-7mmh2\n", "[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the \"boost_from_average\" parameter in \"binary\" objective is true.\n", "This may cause significantly different results comparing to the previous versions of LightGBM.\n", "Try to set boost_from_average=false, if your old models produce bad results\n", "[LightGBM] [Warning] Cannot change bin_construct_sample_cnt after constructed Dataset handle.\n", - "[mlrun] 2020-01-26 14:36:31,442 log artifact training-validation-plot.html at training-validation-plot.html, size: 32968, db: Y\n", - "[mlrun] 2020-01-26 14:36:31,512 log artifact model at /User/mlrun/models/lgb-classifier.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-26 19:21:48,490 log artifact training-validation-plot.html at training-validation-plot.html, size: 36420, db: Y\n", + "[mlrun] 2020-01-26 19:21:48,562 log artifact model at /User/mlrun/models/lgb-classifier.pkl, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-26 14:36:31,540 run executed, status=completed\n", + "[mlrun] 2020-01-26 19:21:48,603 run executed, status=completed\n", "/opt/conda/lib/python3.7/site-packages/sklearn/preprocessing/_label.py:235: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "/opt/conda/lib/python3.7/site-packages/sklearn/preprocessing/_label.py:268: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", @@ -980,26 +475,26 @@ " \n", " \n", " \n", - "
...c1ad22
\n", + "
...4c06e7
\n", " 0\n", - " Jan 26 14:36:25\n", + " Jan 26 19:20:08\n", " completed\n", " sklearn-classifier\n", - "
host=train-n4qgh
kind=job
owner=admin
\n", + "
host=train-7mmh2
kind=job
owner=admin
\n", " \n", - "
SKClassifier=lightgbm.sklearn.LGBMClassifier
callbacks=[]
key=model
name=lgb-classifier.pkl
src_file=None
target_path=/User/mlrun/models
verbose=False
xtrain=/User/mlrun/models/xtrain.pqt
xvalid=/User/mlrun/models/xvalid.pqt
ytrain=/User/mlrun/models/ytrain.pqt
yvalid=/User/mlrun/models/yvalid.pqt
\n", - "
train_accuracy=0.9856296296296296
\n", - "
training-validation-plot.html
model
\n", + "
SKClassifier=lightgbm.sklearn.LGBMClassifier
callbacks=[]
key=model
name=lgb-classifier.pkl
target_path=/User/mlrun/models
verbose=False
xtrain=/User/mlrun/models/xtrain.pqt
xvalid=/User/mlrun/models/xvalid.pqt
ytrain=/User/mlrun/models/ytrain.pqt
yvalid=/User/mlrun/models/yvalid.pqt
\n", + "
train_accuracy=0.732269862931968
\n", + "
training-validation-plot.html
model
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -1015,35 +510,35 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 6eb38fb5166b40099e6e579f00c1ad22 , !mlrun logs 6eb38fb5166b40099e6e579f00c1ad22 \n", - "[mlrun] 2020-01-26 14:36:38,665 run executed, status=completed\n" + "!mlrun get run 984d1c76d3744b8593395d4cec4c06e7 , !mlrun logs 984d1c76d3744b8593395d4cec4c06e7 \n", + "[mlrun] 2020-01-26 19:21:52,483 run executed, status=completed\n" ] } ], "source": [ - "tsk2 = trainfn.run(task2, handler='train')" + "tsk = trainfn.run(task, handler='train')" ] }, { "cell_type": "code", - "execution_count": 45, + "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'train_accuracy': 0.9856296296296296,\n", + "{'train_accuracy': 0.732269862931968,\n", " 'training-validation-plot.html': 'training-validation-plot.html',\n", " 'model': '/User/mlrun/models/lgb-classifier.pkl'}" ] }, - "execution_count": 45, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "tsk2.outputs" + "tsk.outputs" ] }, { diff --git a/tests/train_valid_test_split.ipynb b/tests/train_valid_test_split.ipynb index 6eafca913..b1d568b59 100644 --- a/tests/train_valid_test_split.ipynb +++ b/tests/train_valid_test_split.ipynb @@ -9,7 +9,7 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -23,336 +23,66 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## parameters" + "## parameters\n", + "\n", + "**Please be sure to run the notebook [arc_to_parquet](arc_to_parquet.ipynb) before running this one.**" ] }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ - "CODE_BASE = '/User/repos/functions/datagen' \n", - "N_SAMPLES = 100_000\n", - "M_FEATURES = 28\n", - "NEG_WEIGHT = 0.5\n", + "CODE_BASE = '/User/repos/functions' \n", "RNG = 1\n", "TARGET_DATA_PATH = '/User/mlrun/models'\n", - "SRC_FILE = 'simdata.pqt'\n", - "KEY = 'simdata'" + "SRC_FILE = 'higgs.pqt'\n", + "KEY = 'higgs'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## generate some binary classification data" + "## split the data" ] }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 14:33:32,512 starting run create_binary_classification uid=e1d104202c33479eaf866d1276a911d1 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-26 14:33:32,615 Job is running in the background, pod: create-binary-classification-c7gwh\n", - "[mlrun] 2020-01-26 14:33:43,575 log artifact simdata at /User/mlrun/models/simdata.pqt, size: None, db: Y\n", - "\n", - "[mlrun] 2020-01-26 14:33:43,586 run executed, status=completed\n", - "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", - " result = infer_dtype(pandas_collection)\n", - "final state: succeeded\n" - ] - }, - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...a911d1
0Jan 26 14:33:42completedbinary
host=create-binary-classification-c7gwh
kind=job
owner=admin
filename=/User/mlrun/models/simdata.pqt
key=simdata
m_features=28
n_samples=100000
random_state=1
target_path=/User/mlrun/models
weight=0.5
simdata
\n", - "
\n", - "
\n", - "
\n", - " Title\n", - " ×\n", - "
\n", - " \n", - "
\n", - "
\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run e1d104202c33479eaf866d1276a911d1 , !mlrun logs e1d104202c33479eaf866d1276a911d1 \n", - "[mlrun] 2020-01-26 14:33:51,818 run executed, status=completed\n" + "[mlrun] 2020-01-26 19:13:33,873 function spec saved to path: /User/repos/functions/datagen/splitters/train_valid_test.yaml\n" ] } ], "source": [ - "binarydatagen = mlrun.import_function(\n", - " os.path.join(CODE_BASE, 'classification', 'binary.yaml')\n", - ").apply(mlrun.mount_v3io())\n", - "\n", - "binarydatagen.deploy(skip_deployed=True)\n", - "\n", - "task1 = mlrun.NewTask()\n", - "task1.with_params(\n", - " n_samples=N_SAMPLES,\n", - " m_features=M_FEATURES,\n", - " weight=NEG_WEIGHT,\n", - " target_path=TARGET_DATA_PATH,\n", - " filename=TARGET_DATA_PATH + '/' + SRC_FILE,\n", - " key=KEY,\n", - " random_state=RNG)\n", - "\n", - "tsk1 = binarydatagen.run(task1, handler='create_binary_classification')" - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'simdata': '/User/mlrun/models/simdata.pqt'}" - ] - }, - "execution_count": 33, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "tsk1.outputs" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## split the data" + "testfn = mlrun.code_to_function(\n", + " kind='job', \n", + " filename=os.path.join(CODE_BASE, 'datagen/splitters', 'train_valid_test.py'))\n", + "testfn.build_config(base_image='yjbds/mlrun-ds:latest', commands=[])\n", + "testfn.export(yaml_name)" ] }, { "cell_type": "code", - "execution_count": 34, - "metadata": {}, - "outputs": [], - "source": [ - "yaml_name = os.path.join(CODE_BASE, 'splitters', 'train_valid_test.yaml')\n", - "if not os.path.isfile(yaml_name):\n", - " testfn = mlrun.code_to_function(\n", - " kind='job', \n", - " filename=os.path.join(CODE_BASE, 'splitters', 'train_valid_test.py'))\n", - " testfn.build_config(base_image='yjbds/mlrun-ds:latest', commands=[])\n", - " testfn.export(yaml_name)" - ] - }, - { - "cell_type": "code", - "execution_count": 35, + "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "splitter = mlrun.import_function(\n", - " os.path.join(CODE_BASE, 'splitters', 'train_valid_test.yaml')\n", + " os.path.join(CODE_BASE, 'datagen/splitters', 'train_valid_test.yaml')\n", ").apply(mlrun.mount_v3io())" ] }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 55, "metadata": {}, "outputs": [ { @@ -361,7 +91,7 @@ "'ready'" ] }, - "execution_count": 36, + "execution_count": 55, "metadata": {}, "output_type": "execute_result" } @@ -372,32 +102,132 @@ }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 14:34:07,430 starting run train_valid_test_splitter uid=6c61b309093146f0bd97a387d76aa71e -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-26 14:34:07,521 Job is running in the background, pod: train-valid-test-splitter-vrdv6\n", - "[mlrun] 2020-01-26 14:34:17,884 log artifact header at /User/mlrun/models/header.pkl, size: None, db: Y\n", - "[mlrun] 2020-01-26 14:34:18,258 log artifact xtrain at /User/mlrun/models/xtrain.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 14:34:18,389 log artifact xvalid at /User/mlrun/models/xvalid.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 14:34:18,469 log artifact xtest at /User/mlrun/models/xtest.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 14:34:18,496 log artifact ytrain at /User/mlrun/models/ytrain.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 14:34:18,520 log artifact yvalid at /User/mlrun/models/yvalid.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 14:34:18,536 log artifact ytest at /User/mlrun/models/ytest.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 19:13:38,909 starting run train_valid_test_splitter uid=2238f0c5856e4359a068cd881dc0c62b -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 19:13:39,003 Job is running in the background, pod: train-valid-test-splitter-zlfcj\n", + "[mlrun] 2020-01-26 19:14:07,630 log artifact header at /User/mlrun/models/header.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-26 19:14:17,951 log artifact xtrain at /User/mlrun/models/xtrain.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 19:14:21,585 log artifact xvalid at /User/mlrun/models/xvalid.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 19:14:23,245 log artifact xtest at /User/mlrun/models/xtest.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 19:14:24,139 log artifact ytrain at /User/mlrun/models/ytrain.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 19:14:24,519 log artifact yvalid at /User/mlrun/models/yvalid.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-26 19:14:24,755 log artifact ytest at /User/mlrun/models/ytest.pqt, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-26 14:34:18,559 run executed, status=completed\n", - "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:708: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", - " labels = getattr(columns, 'labels', None) or [\n", - "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:735: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead\n", - " return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)\n", - "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:752: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.\n", - " labels, = index.labels\n", - "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py:114: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.\n", - " result = infer_dtype(pandas_collection)\n", + "[mlrun] 2020-01-26 19:14:25,528 run executed, status=completed\n", + "--- Logging error ---\n", + "Traceback (most recent call last):\n", + " File \"/opt/conda/lib/python3.7/logging/__init__.py\", line 1025, in emit\n", + " msg = self.format(record)\n", + " File \"/opt/conda/lib/python3.7/logging/__init__.py\", line 869, in format\n", + " return fmt.format(record)\n", + " File \"/opt/conda/lib/python3.7/logging/__init__.py\", line 608, in format\n", + " record.message = record.getMessage()\n", + " File \"/opt/conda/lib/python3.7/logging/__init__.py\", line 369, in getMessage\n", + " msg = msg % self.args\n", + "TypeError: not all arguments converted during string formatting\n", + "Call stack:\n", + " File \"/opt/conda/bin/mlrun\", line 10, in \n", + " sys.exit(main())\n", + " File \"/opt/conda/lib/python3.7/site-packages/click/core.py\", line 764, in __call__\n", + " return self.main(*args, **kwargs)\n", + " File \"/opt/conda/lib/python3.7/site-packages/click/core.py\", line 717, in main\n", + " rv = self.invoke(ctx)\n", + " File \"/opt/conda/lib/python3.7/site-packages/click/core.py\", line 1137, in invoke\n", + " return _process_result(sub_ctx.command.invoke(sub_ctx))\n", + " File \"/opt/conda/lib/python3.7/site-packages/click/core.py\", line 956, in invoke\n", + " return ctx.invoke(self.callback, **ctx.params)\n", + " File \"/opt/conda/lib/python3.7/site-packages/click/core.py\", line 555, in invoke\n", + " return callback(*args, **kwargs)\n", + " File \"/opt/conda/lib/python3.7/site-packages/mlrun/__main__.py\", line 167, in run\n", + " resp = fn.run(runobj, watch=watch, schedule=schedule)\n", + " File \"/opt/conda/lib/python3.7/site-packages/mlrun/runtimes/base.py\", line 294, in run\n", + " resp = self._run(runspec, execution)\n", + " File \"/opt/conda/lib/python3.7/site-packages/mlrun/runtimes/local.py\", line 89, in _run\n", + " sout, serr = exec_from_params(fn, runobj, context)\n", + " File \"/opt/conda/lib/python3.7/site-packages/mlrun/runtimes/local.py\", line 174, in exec_from_params\n", + " val = handler(*args_list)\n", + " File \"main.py\", line 104, in train_valid_test_splitter\n", + " context.logger.info('numpy', np.__version__)\n", + "Message: 'numpy'\n", + "Arguments: ('1.17.4',)\n", + "--- Logging error ---\n", + "Traceback (most recent call last):\n", + " File \"/opt/conda/lib/python3.7/logging/__init__.py\", line 1025, in emit\n", + " msg = self.format(record)\n", + " File \"/opt/conda/lib/python3.7/logging/__init__.py\", line 869, in format\n", + " return fmt.format(record)\n", + " File \"/opt/conda/lib/python3.7/logging/__init__.py\", line 608, in format\n", + " record.message = record.getMessage()\n", + " File \"/opt/conda/lib/python3.7/logging/__init__.py\", line 369, in getMessage\n", + " msg = msg % self.args\n", + "TypeError: not all arguments converted during string formatting\n", + "Call stack:\n", + " File \"/opt/conda/bin/mlrun\", line 10, in \n", + " sys.exit(main())\n", + " File \"/opt/conda/lib/python3.7/site-packages/click/core.py\", line 764, in __call__\n", + " return self.main(*args, **kwargs)\n", + " File \"/opt/conda/lib/python3.7/site-packages/click/core.py\", line 717, in main\n", + " rv = self.invoke(ctx)\n", + " File \"/opt/conda/lib/python3.7/site-packages/click/core.py\", line 1137, in invoke\n", + " return _process_result(sub_ctx.command.invoke(sub_ctx))\n", + " File \"/opt/conda/lib/python3.7/site-packages/click/core.py\", line 956, in invoke\n", + " return ctx.invoke(self.callback, **ctx.params)\n", + " File \"/opt/conda/lib/python3.7/site-packages/click/core.py\", line 555, in invoke\n", + " return callback(*args, **kwargs)\n", + " File \"/opt/conda/lib/python3.7/site-packages/mlrun/__main__.py\", line 167, in run\n", + " resp = fn.run(runobj, watch=watch, schedule=schedule)\n", + " File \"/opt/conda/lib/python3.7/site-packages/mlrun/runtimes/base.py\", line 294, in run\n", + " resp = self._run(runspec, execution)\n", + " File \"/opt/conda/lib/python3.7/site-packages/mlrun/runtimes/local.py\", line 89, in _run\n", + " sout, serr = exec_from_params(fn, runobj, context)\n", + " File \"/opt/conda/lib/python3.7/site-packages/mlrun/runtimes/local.py\", line 174, in exec_from_params\n", + " val = handler(*args_list)\n", + " File \"main.py\", line 105, in train_valid_test_splitter\n", + " context.logger.info('pandas ', pd.__version__)\n", + "Message: 'pandas '\n", + "Arguments: ('0.25.3',)\n", + "--- Logging error ---\n", + "Traceback (most recent call last):\n", + " File \"/opt/conda/lib/python3.7/logging/__init__.py\", line 1025, in emit\n", + " msg = self.format(record)\n", + " File \"/opt/conda/lib/python3.7/logging/__init__.py\", line 869, in format\n", + " return fmt.format(record)\n", + " File \"/opt/conda/lib/python3.7/logging/__init__.py\", line 608, in format\n", + " record.message = record.getMessage()\n", + " File \"/opt/conda/lib/python3.7/logging/__init__.py\", line 369, in getMessage\n", + " msg = msg % self.args\n", + "TypeError: not all arguments converted during string formatting\n", + "Call stack:\n", + " File \"/opt/conda/bin/mlrun\", line 10, in \n", + " sys.exit(main())\n", + " File \"/opt/conda/lib/python3.7/site-packages/click/core.py\", line 764, in __call__\n", + " return self.main(*args, **kwargs)\n", + " File \"/opt/conda/lib/python3.7/site-packages/click/core.py\", line 717, in main\n", + " rv = self.invoke(ctx)\n", + " File \"/opt/conda/lib/python3.7/site-packages/click/core.py\", line 1137, in invoke\n", + " return _process_result(sub_ctx.command.invoke(sub_ctx))\n", + " File \"/opt/conda/lib/python3.7/site-packages/click/core.py\", line 956, in invoke\n", + " return ctx.invoke(self.callback, **ctx.params)\n", + " File \"/opt/conda/lib/python3.7/site-packages/click/core.py\", line 555, in invoke\n", + " return callback(*args, **kwargs)\n", + " File \"/opt/conda/lib/python3.7/site-packages/mlrun/__main__.py\", line 167, in run\n", + " resp = fn.run(runobj, watch=watch, schedule=schedule)\n", + " File \"/opt/conda/lib/python3.7/site-packages/mlrun/runtimes/base.py\", line 294, in run\n", + " resp = self._run(runspec, execution)\n", + " File \"/opt/conda/lib/python3.7/site-packages/mlrun/runtimes/local.py\", line 89, in _run\n", + " sout, serr = exec_from_params(fn, runobj, context)\n", + " File \"/opt/conda/lib/python3.7/site-packages/mlrun/runtimes/local.py\", line 174, in exec_from_params\n", + " val = handler(*args_list)\n", + " File \"main.py\", line 106, in train_valid_test_splitter\n", + " context.logger.info('pyarrow', pa.__version__)\n", + "Message: 'pyarrow'\n", + "Arguments: ('0.15.1',)\n", "final state: succeeded\n" ] }, @@ -570,26 +400,26 @@ " \n", " \n", " \n", - "
...6aa71e
\n", + "
...c0c62b
\n", " 0\n", - " Jan 26 14:34:17\n", + " Jan 26 19:13:46\n", " completed\n", " train-valid-test\n", - "
host=train-valid-test-splitter-vrdv6
kind=job
owner=admin
\n", + "
host=train-valid-test-splitter-zlfcj
kind=job
owner=admin
\n", " \n", - "
random_state=1
src_file=/User/mlrun/models/simdata.pqt
target_path=/User/mlrun/models
\n", + "
random_state=1
src_file=higgs.pqt
target_path=/User/mlrun/models
\n", " \n", "
header
xtrain
xvalid
xtest
ytrain
yvalid
ytest
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -605,15 +435,15 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 6c61b309093146f0bd97a387d76aa71e , !mlrun logs 6c61b309093146f0bd97a387d76aa71e \n", - "[mlrun] 2020-01-26 14:34:26,698 run executed, status=completed\n" + "!mlrun get run 2238f0c5856e4359a068cd881dc0c62b , !mlrun logs 2238f0c5856e4359a068cd881dc0c62b \n", + "[mlrun] 2020-01-26 19:14:28,394 run executed, status=completed\n" ] } ], "source": [ "task2 = mlrun.NewTask()\n", "task2.with_params(\n", - " src_file=tsk1.outputs['simdata'],\n", + " src_file='higgs.pqt',\n", " target_path=TARGET_DATA_PATH,\n", " random_state=RNG)\n", "\n", @@ -622,7 +452,7 @@ }, { "cell_type": "code", - "execution_count": 44, + "execution_count": 57, "metadata": {}, "outputs": [ { @@ -637,7 +467,7 @@ " 'ytest': '/User/mlrun/models/ytest.pqt'}" ] }, - "execution_count": 44, + "execution_count": 57, "metadata": {}, "output_type": "execute_result" } @@ -655,7 +485,7 @@ }, { "cell_type": "code", - "execution_count": 45, + "execution_count": 12, "metadata": {}, "outputs": [], "source": [ @@ -664,36 +494,50 @@ }, { "cell_type": "code", - "execution_count": 46, + "execution_count": 31, "metadata": {}, "outputs": [], "source": [ - "# rounding error of one sample\n", - "ERROR = -1\n", - "xtrain_shape = pd.read_parquet(tsk2.outputs['xtrain'], engine='pyarrow').shape\n", - "ytrain_shape = pd.read_parquet(tsk2.outputs['ytrain'], engine='pyarrow').shape\n", - "\n", - "assert (int(.75*(N_SAMPLES*(1-.1)))+ERROR, M_FEATURES) == xtrain_shape, \"xtrain doesn't have the expected shape\"\n", + "n_samples, n_features = pd.read_parquet(os.path.join(TARGET_DATA_PATH, SRC_FILE), engine='pyarrow').shape" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [], + "source": [ + "xtrain_shape = pd.read_parquet(tsk2.outputs['xtrain'], engine='pyarrow').shape" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [], + "source": [ + "rounding_err = -1\n", + "assert (int(n_samples*.75*.9)+rounding_err, M_FEATURES) == xtrain_shape, \"xtrain doesn't have the expected shape\"\n", "assert ytrain_shape[0] == xtrain_shape[0], \"ytrain and xtrain have different shapes\"\n", "assert ytrain_shape[1] == 1, \"ytrain (labels) has more than 1 column\"" ] }, { "cell_type": "code", - "execution_count": 47, + "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "xtest_shape = pd.read_parquet(tsk2.outputs['xtest'], engine='pyarrow').shape\n", "ytest_shape = pd.read_parquet(tsk2.outputs['ytest'], engine='pyarrow').shape\n", - "assert (int(N_SAMPLES*.1), M_FEATURES) == xtest_shape, \"xtest doesn't have the expected shape\"\n", + "assert (int(n_samples*.1), M_FEATURES) == xtest_shape, \"xtest doesn't have the expected shape\"\n", "assert ytest_shape[0] == xtest_shape[0], \"ytest and xtest have different shapes\"\n", "assert ytest_shape[1] == 1, \"ytest (test labels) has more than 1 column\"" ] }, { "cell_type": "code", - "execution_count": 48, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -702,11 +546,31 @@ }, { "cell_type": "code", - "execution_count": 49, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "assert len(load(open(tsk2.outputs['header'], 'rb'))) == M_FEATURES" + "assert len(load(open(tsk2.outputs['header'], 'rb'))) == n_features" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(load(open(tsk2.outputs['header'], 'rb')))" ] }, { diff --git a/train/sklearn-classifier.yaml b/train/sklearn-classifier.yaml index 21d86f528..668f0aac0 100644 --- a/train/sklearn-classifier.yaml +++ b/train/sklearn-classifier.yaml @@ -2,12 +2,11 @@ kind: job metadata: name: sklearn-classifier tag: '' - hash: 3d4b7b654a757bb047ac767b082b8529bd7b009e + hash: 14f1603a259311e015900f245810d9bf474dbb20 project: '' spec: command: '' args: [] - image: yjbds/mlrun-ds:latest volumes: [] volume_mounts: [] env: [] @@ -16,4 +15,4 @@ spec: functionSourceCode: aW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBwYW5kYXMgYXMgcGQKCmltcG9ydCBtYXRwbG90bGliLnB5cGxvdCBhcyBwbHQKZnJvbSBtYXRwbG90bGliLmZpZ3VyZSBpbXBvcnQgRmlndXJlCmltcG9ydCBzZWFib3JuIGFzIHNucwoKZnJvbSB0eXBpbmcgaW1wb3J0IE9wdGlvbmFsLCBVbmlvbgppbXBvcnQgb3MKaW1wb3J0IGltcG9ydGxpYgpmcm9tIGNsb3VkcGlja2xlIGltcG9ydCBkdW1wCgpmcm9tIG1scnVuLmV4ZWN1dGlvbiBpbXBvcnQgTUxDbGllbnRDdHgKZnJvbSBtbHJ1bi5kYXRhc3RvcmUgaW1wb3J0IERhdGFJdGVtCmZyb20gbWxydW4uYXJ0aWZhY3RzIGltcG9ydCBUYWJsZUFydGlmYWN0LCBQbG90QXJ0aWZhY3QKCmltcG9ydCB3YXJuaW5ncwp3YXJuaW5ncy5zaW1wbGVmaWx0ZXIoYWN0aW9uPSdpZ25vcmUnLCBjYXRlZ29yeT1GdXR1cmVXYXJuaW5nKQoKZGVmIHRyYWluKAogICAgY29udGV4dDogT3B0aW9uYWxbTUxDbGllbnRDdHhdID0gTm9uZSwKICAgIFNLQ2xhc3NpZmllcjogc3RyICA9ICcnLAogICAgY2FsbGJhY2tzICA9IFtdLAogICAgeHRyYWluOiBVbmlvbltEYXRhSXRlbSwgc3RyXSA9ICcnLAogICAgeXRyYWluOiBVbmlvbltEYXRhSXRlbSwgc3RyXSA9ICcnLAogICAgeHZhbGlkOiBVbmlvbltEYXRhSXRlbSwgc3RyXSA9ICcnLAogICAgeXZhbGlkOiBVbmlvbltEYXRhSXRlbSwgc3RyXSA9ICcnLAogICAgdGFyZ2V0X3BhdGg6IHN0ciA9ICcnLAogICAgbmFtZTogc3RyID0gJycsCiAgICBrZXk6IHN0ciA9ICcnLAogICAgdmVyYm9zZTogYm9vbCA9IEZhbHNlLAogICAgcmFuZG9tX3N0YXRlID0gMQopIC0+IE5vbmU6CiAgICAiIiJUcmFpbiBhbmQgc2F2ZSBhbiBTY2lraXRsZWFybiBtb2RlbC4KICAgIAogICAgVGhlIGRhdGEgc291cmNlIGNhbiBlaXRoZXIgYmUgYSBzdHJpbmcgZmlsZSBuYW1lIG9yIGFuIGFydGlmYWN0IGl0ZW0uCiAgICAKICAgIFRoZSBoZWFkZXIgaXMgZWl0aCBhIGxpc3Qgb2YgY29sdW1uIG5hbWVzLCBhbiBhcnRpZmFjdCBoZWFkZXIgaXRlbSwgb3IgTm9uZS4KICAgIAogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgICB0aGUgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIFNLQ2xhc3NpZmllcjogICAgc3RyaW5nIG1vZHVsZSBhbmQgY2xhc3NuYW1lIG9mIGNsYXNzaWZpZXIKICAgIDpwYXJhbSBjYWxsYmFja3M6ICAgICAgIHNrbGVhcm4gY2xhc3NpZmllciBmaXQgZnVuY3Rpb24gY2FsbGJhY2tzCiAgICA6cGFyYW0geHRyYWluOiAgICAgICAgICAKICAgIDpwYXJhbSB5dHJhaW46CiAgICA6cGFyYW0geHZhbGlkOgogICAgOnBhcmFtIHl2YWxpZDoKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogICAgIGZvbGRlciBsb2NhdGlvbiBvZiBmaWxlcwogICAgOnBhcmFtIG5hbWU6ICAgICAgICAgICAgZGVzdGluYXRpb24gbmFtZSBmb3IgbW9kZWwgZmlsZQogICAgOnBhcmFtIGtleTogICAgICAgICAgICAga2V5IGZvciBtb2RlbCBhcnRpZmFjdAogICAgOnBhcmFtIHZlcmJvc2UgOiAgICAgICAgKEZhbHNlKSBzaG93IG1ldHJpY3MgZm9yIHRyYWluaW5nL3ZhbGlkYXRpb24gc3RlcHMuCiAgICA6cGFyYW0gcmFuZG9tX3N0YXRlOiAgICAoMSkgc2tsZWFybiBybmcgc2VlZAogICAgCiAgICBleGFtcGxlIGNhbGxiYWNrczoKICAgIGBgYAogICAgZnJvbSBsaWdodGdibSBpbXBvcnQgcmVjb3JkX2V2YWx1YXRpb24KICAgIGV2YWxfcmVzdWx0cyA9IGRpY3QoKQogICAgY2FsbGJhY2tzID0gW3JlY29yZF9ldmFsdWF0aW9uKGV2YWxfcmVzdWx0cyldCiAgICBgYGAKICAgICIiIgogICAgIyBsb2FkIGRhdGEKICAgIHh0cmFpbiA9IHBkLnJlYWRfcGFycXVldChzdHIoeHRyYWluKSwgZW5naW5lPSdweWFycm93JykKICAgIHl0cmFpbiA9IHBkLnJlYWRfcGFycXVldChzdHIoeXRyYWluKSwgZW5naW5lPSdweWFycm93JykKICAgIHh2YWxpZCA9IHBkLnJlYWRfcGFycXVldChzdHIoeHZhbGlkKSwgZW5naW5lPSdweWFycm93JykKICAgIHl2YWxpZCA9IHBkLnJlYWRfcGFycXVldChzdHIoeXZhbGlkKSwgZW5naW5lPSdweWFycm93JykKCiAgICAjIGNyZWF0ZSBjbGFzc2lmaWVyIGNsYXNzIGZyb20gc3RyaW5nIGFuZCBpbnN0YW50aWF0ZQogICAgc3BsaXRzID0gU0tDbGFzc2lmaWVyLnNwbGl0KCIuIikKICAgIGNsZmNsYXNzID0gZ2V0YXR0cihpbXBvcnRsaWIuaW1wb3J0X21vZHVsZSgiLiIuam9pbihzcGxpdHNbOi0xXSkpLCBzcGxpdHNbLTFdKQogICAgbW9kZWwgPSBjbGZjbGFzcyhyYW5kb21fc3RhdGU9cmFuZG9tX3N0YXRlLCB2ZXJib3NlPWludCh2ZXJib3NlID09IFRydWUpKQoKICAgIG1vZGVsLmZpdCh4dHJhaW4sIAogICAgICAgICAgICAgIHl0cmFpbiwKICAgICAgICAgICAgICBldmFsX3NldD1bKHh2YWxpZCwgeXZhbGlkKSwgKHh0cmFpbiwgeXRyYWluKV0sCiAgICAgICAgICAgICAgZXZhbF9uYW1lcz1bJ3ZhbGlkJywgJ3RyYWluJ10sCiAgICAgICAgICAgICAgY2FsbGJhY2tzPWNhbGxiYWNrcywKICAgICAgICAgICAgICB2ZXJib3NlPXZlcmJvc2UpCiAgICAgCiAgICBjb250ZXh0LmxvZ19yZXN1bHQoInRyYWluX2FjY3VyYWN5IiwgZmxvYXQobW9kZWwuc2NvcmUoeHRyYWluLCB5dHJhaW4pKSkKICAgIAogICAgIyBwbG90IHRyYWluIGFuZCB2YWxpZGF0aW9uIGhpc3RvcnksIHNhdmUgYW5kIGxvZwogICAgbG9zcyA9IG5wLmFzYXJyYXkobW9kZWwuZXZhbHNfcmVzdWx0X1sndHJhaW4nXVsnYmluYXJ5X2xvZ2xvc3MnXSwgZHR5cGU9bnAuZmxvYXQpCiAgICB2YWxfbG9zcyA9IG5wLmFzYXJyYXkobW9kZWwuZXZhbHNfcmVzdWx0X1sndmFsaWQnXVsnYmluYXJ5X2xvZ2xvc3MnXSwgZHR5cGU9bnAuZmxvYXQpCiAgICBwbG90X3ZhbGlkYXRpb24oY29udGV4dCwgbG9zcywgdmFsX2xvc3MsIHRhcmdldF9wYXRoKQogICAgCiAgICAjIHNhdmUgbW9kZWwKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgZHVtcChtb2RlbCwgb3BlbihmaWxlcGF0aCwgJ3diJykpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWZpbGVwYXRoKQogICAgICAgIApkZWYgcGxvdF92YWxpZGF0aW9uKAogICAgY29udGV4dDogTUxDbGllbnRDdHgsCiAgICB0cmFpbl9tZXRyaWMsCiAgICB2YWxpZF9tZXRyaWMsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gJycsCiAgICBuYW1lOiBzdHIgPSAiaGlzdG9yeS5wbmciLAogICAga2V5OiBzdHIgPSAndHJhaW5pbmctdmFsaWRhdGlvbi1wbG90JwopOgogICAgIiIiUGxvdCB0cmFpbiBhbmQgdmFsaWRhdGlvbiBsb3NzIGN1cnZlcwoKICAgIFRoZXNlIGN1cnZlcyByZXByZXNlbnQgdGhlIHRyYWluaW5nIHJvdW5kIGxvc3NlcyBmcm9tIHRoZSB0cmFpbmluZwogICAgYW5kIHZhbGlkYXRpb24gc2V0cy4KICAgIAogICAgOnBhcmFtIGNvbnRleHQ6ICAgICAgICAgdGhlIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSB0cmFpbl9tZXRyaWM6ICAgIHRyYWluIG1ldHJpYwogICAgOnBhcmFtIHZhbGlkX21ldHJpYzogICAgdmFsaWRhdGlvbiBtZXRyaWMKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogICAgIGRlc3RpbmF0aW4gcGF0aCBmb3IgdHJhaW4vdm9saWRhdGlvbiBoaXN0b3J5IHBsb3QgYXJ0aWZhY3QKICAgICIiIgogICAgIyBnZW5lcmF0ZSBwbG90CiAgICBwbHQucGxvdCh0cmFpbl9tZXRyaWMpCiAgICBwbHQucGxvdCh2YWxpZF9tZXRyaWMpCiAgICBwbHQudGl0bGUoInRyYWluaW5nIHZhbGlkYXRpb24gcmVzdWx0cyIpCiAgICBwbHQueGxhYmVsKCJlcG9jaCIpCiAgICBwbHQueWxhYmVsKCIiKQogICAgcGx0LmxlZ2VuZChbInRyYWluIiwgInZhbGlkIl0pCiAgICBmaWcgPSBwbHQuZ2NmKCkKCiAgICAjIHNhdmUgZmlndXJlIGFuZCBsb2cgYXJ0aWZhY3QKICAgIHBsb3RwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgcGx0LnNhdmVmaWcocGxvdHBhdGgpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChQbG90QXJ0aWZhY3Qoa2V5LCBib2R5PWZpZykpCgogICAgIyBwbG90IGNsZWFudXAKICAgIHBsdC5jbGEoKQogICAgcGx0LmNsZigpCiAgICBwbHQuY2xvc2UoKSAgICAgICAgCg== base_image: yjbds/mlrun-ds:latest commands: [] - code_origin: https://github.com/yjb-ds/functions.git#25e611e4bd05320d342708ce786522bfecaa0e51:/User/repos/functions/train/sklearn-classifier.py + code_origin: https://github.com/yjb-ds/functions.git#e613e55761fd1ed325ad88155877924aa5b49ccc:/User/repos/functions/train/sklearn-classifier.py From 9f080fed89f32412dcebe59bb1d248921aa8aad2 Mon Sep 17 00:00:00 2001 From: yasha Date: Sun, 26 Jan 2020 19:54:38 +0000 Subject: [PATCH 21/32] gitignore issue --- .gitignore | 1 + .gitignore.swp | Bin 12288 -> 0 bytes 2 files changed, 1 insertion(+) delete mode 100644 .gitignore.swp diff --git a/.gitignore b/.gitignore index db580678d..f09384efb 100644 --- a/.gitignore +++ b/.gitignore @@ -2,3 +2,4 @@ models/ .ipynb_checkpoints *.gz *.csv +*.swp diff --git a/.gitignore.swp b/.gitignore.swp deleted file mode 100644 index ea205f57df386d1b6f3827c0f793b0dc8a8814e9..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 12288 zcmeI%y-LGS6u|LQp^G4jzCcx63Q1apB(s}?4niH3=A-de%?!oyY43s1QJ4GXhmP zvHYFU+3B!;)sK%3dk6cY&Ge=V0tg_000IagfB*srv_hcFI`V2J-D;-oHo5aFHzf%H z1Q0*~0R#|0009ILKmY**S|~7%MBaBr7A98z|9`&!zfApUVrk;TM7`g_`!O;C2q1s} x0tg_000IagfB*vjRiJN&#dG}->D$R%Y?aN@;4Zw6!-q_(v@GOFNAY9f$Tw{nIbr|+ From f6fd45ca67de8b7bd69630254b68054cf7b73a79 Mon Sep 17 00:00:00 2001 From: yasha Date: Mon, 27 Jan 2020 08:27:52 +0000 Subject: [PATCH 22/32] fix image source arc-to-parquet --- datagen/features/features-engineer.py | 110 ++++++++++ datagen/features/features-engineer.yaml | 18 ++ fileutils/arc_to_parquet/arc_to_parquet.yaml | 6 +- tests/arc_to_parquet.ipynb | 68 +++---- tests/features_engineer.ipynb | 203 +++++++++++++++++++ train/sklearn-classifier.py | 38 ++++ 6 files changed, 406 insertions(+), 37 deletions(-) create mode 100644 datagen/features/features-engineer.py create mode 100644 datagen/features/features-engineer.yaml create mode 100644 tests/features_engineer.ipynb diff --git a/datagen/features/features-engineer.py b/datagen/features/features-engineer.py new file mode 100644 index 000000000..c234b96a7 --- /dev/null +++ b/datagen/features/features-engineer.py @@ -0,0 +1,110 @@ +# Copyright 2019 Iguazio +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +from cloudpickle import dump, load + +import numpy as np +from sklearn.base import BaseEstimator, TransformerMixin + +from typing import Optional, Union + +rom mlrun.execution import MLClientCtx +from mlrun.datastore import DataItem +from mlrun.artifacts import TableArtifact, PlotArtifact + +import warnings +warnings.simplefilter(action="ignore", category=FutureWarning) + +class FeaturesEngineer(BaseEstimator, TransformerMixin): + """Engineer features from raw input. + A standard transformer mixin that can be inserted into a scikit learn Pipeline. + + To use, + >>> ffg = FeaturesEngineer() + >>> ffg.fit(X) + >>> x_transformed = ffg.transform(X) + or + >>> ffg = FeaturesEngineer() + >>> x_transformed = ffg.fit_transform(X) + + In a pipeline: + >>> from sklearn.pipeline import Pipeline + >>> from sklearn.preprocessing import StandardScaler + >>> transformers = [('feature_gen', FeaturesEngineerFeature()), + ('scaler', StandardScaler())] + >>> transformer_pipe = Pipeline(transformers) + """ + def fit(self, X, y=None): + """fit is unused here, but ANY model can be inserted here + """ + return self + + def transform(self, X, y=None): + """Transform raw input data as a preprocessing step, (if fit + estimates a model, then run transform only after calling fit). + + :param X: Raw input features + + Returns a DataFrame of features. + """ + x = X.copy() + + # do some cool feature engineering:here we replace by a N(2,2) series + m = 2.0 + s = 2.0 + n, f = x.shape + + if type(x) == np.ndarray: + x[:, f - 1] = np.random.normal(m, s, n) + else: + x.values[:, f - 1] = np.random.normal(m, s, n) + + x = x.astype("float") + + return x + +def features_engineer( + context: MLClientCtx, + X: Union[DataItem, str], + target_path: str = '', + model_key: str = 'features-model' + features_key: str = 'features' +): + """Generate features from an input array + + The features model will be saved for reuse in an inference pipeline or when + testing the model. In addition, the transformed features array is made available + through the artifact store. + + :param context: the function context + :param X: input array + :param target_path: destination folder for artifacts + :param model_key estimated models are saved under this key in the + artifact store + :param features_key: transformed features matrix + """ + feng = FeaturesEngineer() + + X = pd.read_parquet(str(X), engine='pyarrow') + + feng.fit(X) + Xt = feng.transform(X) + + filepath = os.path.join(target_path, model_key+'.pkl') + dump(feng, open(filepath, 'wb')) + context.log_artifact(model_key, target_path=filepath) + + filepath = os.path.join(target_path, features_key+'.pkl') + dump(feng, open(filepath, 'wb')) + context.log_artifact(features_key, target_path=filepath) \ No newline at end of file diff --git a/datagen/features/features-engineer.yaml b/datagen/features/features-engineer.yaml new file mode 100644 index 000000000..06dff7eef --- /dev/null +++ b/datagen/features/features-engineer.yaml @@ -0,0 +1,18 @@ +kind: job +metadata: + name: features-engineer + tag: '' + hash: be29081b9a995b0b6e6bcd3f5d2bc53f8168670a + project: '' +spec: + command: '' + args: [] + volumes: [] + volume_mounts: [] + env: [] + description: '' + build: + functionSourceCode: IyBDb3B5cmlnaHQgMjAxOSBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgppbXBvcnQgb3MKZnJvbSBjbG91ZHBpY2tsZSBpbXBvcnQgZHVtcCwgbG9hZAoKaW1wb3J0IG51bXB5IGFzIG5wCmZyb20gc2tsZWFybi5iYXNlIGltcG9ydCBCYXNlRXN0aW1hdG9yLCBUcmFuc2Zvcm1lck1peGluCgpmcm9tIHR5cGluZyBpbXBvcnQgT3B0aW9uYWwsIFVuaW9uCgpyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIG1scnVuLmRhdGFzdG9yZSBpbXBvcnQgRGF0YUl0ZW0KZnJvbSBtbHJ1bi5hcnRpZmFjdHMgaW1wb3J0IFRhYmxlQXJ0aWZhY3QsIFBsb3RBcnRpZmFjdAoKaW1wb3J0IHdhcm5pbmdzCndhcm5pbmdzLnNpbXBsZWZpbHRlcihhY3Rpb249Imlnbm9yZSIsIGNhdGVnb3J5PUZ1dHVyZVdhcm5pbmcpCgpjbGFzcyBGZWF0dXJlc0VuZ2luZWVyKEJhc2VFc3RpbWF0b3IsIFRyYW5zZm9ybWVyTWl4aW4pOgogICAgIiIiRW5naW5lZXIgZmVhdHVyZXMgZnJvbSByYXcgaW5wdXQuCiAgICBBIHN0YW5kYXJkIHRyYW5zZm9ybWVyIG1peGluIHRoYXQgY2FuIGJlIGluc2VydGVkIGludG8gYSBzY2lraXQgbGVhcm4gUGlwZWxpbmUuCiAgICAKICAgIFRvIHVzZSwgCiAgICA+Pj4gZmZnID0gRmVhdHVyZXNFbmdpbmVlcigpCiAgICA+Pj4gZmZnLmZpdChYKQogICAgPj4+IHhfdHJhbnNmb3JtZWQgPSBmZmcudHJhbnNmb3JtKFgpCiAgICBvcgogICAgPj4+IGZmZyA9IEZlYXR1cmVzRW5naW5lZXIoKQogICAgPj4+IHhfdHJhbnNmb3JtZWQgPSBmZmcuZml0X3RyYW5zZm9ybShYKQogICAgCiAgICBJbiBhIHBpcGVsaW5lOgogICAgPj4+IGZyb20gc2tsZWFybi5waXBlbGluZSBpbXBvcnQgUGlwZWxpbmUKICAgID4+PiBmcm9tIHNrbGVhcm4ucHJlcHJvY2Vzc2luZyBpbXBvcnQgU3RhbmRhcmRTY2FsZXIKICAgID4+PiB0cmFuc2Zvcm1lcnMgPSBbKCdmZWF0dXJlX2dlbicsIEZlYXR1cmVzRW5naW5lZXJGZWF0dXJlKCkpLCAKICAgICAgICAgICAgICAgICAgICAgICAgKCdzY2FsZXInLCBTdGFuZGFyZFNjYWxlcigpKV0KICAgID4+PiB0cmFuc2Zvcm1lcl9waXBlID0gUGlwZWxpbmUodHJhbnNmb3JtZXJzKQogICAgIiIiCiAgICBkZWYgZml0KHNlbGYsIFgsIHk9Tm9uZSk6CiAgICAgICAgIiIiZml0IGlzIHVudXNlZCBoZXJlLCBidXQgQU5ZIG1vZGVsIGNhbiBiZSBpbnNlcnRlZCBoZXJlCiAgICAgICAgIiIiCiAgICAgICAgcmV0dXJuIHNlbGYKCiAgICBkZWYgdHJhbnNmb3JtKHNlbGYsIFgsIHk9Tm9uZSk6CiAgICAgICAgIiIiVHJhbnNmb3JtIHJhdyBpbnB1dCBkYXRhIGFzIGEgcHJlcHJvY2Vzc2luZyBzdGVwLCAoaWYgZml0CiAgICAgICAgZXN0aW1hdGVzIGEgbW9kZWwsIHRoZW4gcnVuIHRyYW5zZm9ybSBvbmx5IGFmdGVyIGNhbGxpbmcgZml0KS4KICAgICAgICAKICAgICAgICA6cGFyYW0gWDogUmF3IGlucHV0IGZlYXR1cmVzCiAgICAgICAgCiAgICAgICAgUmV0dXJucyBhIERhdGFGcmFtZSBvZiBmZWF0dXJlcy4KICAgICAgICAiIiIKICAgICAgICB4ID0gWC5jb3B5KCkKCiAgICAgICAgIyBkbyBzb21lIGNvb2wgZmVhdHVyZSBlbmdpbmVlcmluZzpoZXJlIHdlIHJlcGxhY2UgYnkgYSBOKDIsMikgc2VyaWVzCiAgICAgICAgbSA9IDIuMAogICAgICAgIHMgPSAyLjAKICAgICAgICBuLCBmID0geC5zaGFwZQoKICAgICAgICBpZiB0eXBlKHgpID09IG5wLm5kYXJyYXk6CiAgICAgICAgICAgIHhbOiwgZiAtIDFdID0gbnAucmFuZG9tLm5vcm1hbChtLCBzLCBuKQogICAgICAgIGVsc2U6CiAgICAgICAgICAgIHgudmFsdWVzWzosIGYgLSAxXSA9IG5wLnJhbmRvbS5ub3JtYWwobSwgcywgbikKCiAgICAgICAgeCA9IHguYXN0eXBlKCJmbG9hdCIpCgogICAgICAgIHJldHVybiB4CiAgICAKZGVmIGZlYXR1cmVzX2VuZ2luZWVyKAogICAgY29udGV4dDogTUxDbGllbnRDdHgsCiAgICBYOiBVbmlvbltEYXRhSXRlbSwgc3RyXSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAnJywKICAgIG1vZGVsX2tleTogc3RyID0gJ2ZlYXR1cmVzLW1vZGVsJwogICAgZmVhdHVyZXNfa2V5OiBzdHIgPSAnZmVhdHVyZXMnCik6CiAgICAiIiJHZW5lcmF0ZSBmZWF0dXJlcyBmcm9tIGFuIGlucHV0IGFycmF5CiAgICAKICAgIFRoZSBmZWF0dXJlcyBtb2RlbCB3aWxsIGJlIHNhdmVkIGZvciByZXVzZSBpbiBhbiBpbmZlcmVuY2UgcGlwZWxpbmUgb3Igd2hlbgogICAgdGVzdGluZyB0aGUgbW9kZWwuIEluIGFkZGl0aW9uLCB0aGUgdHJhbnNmb3JtZWQgZmVhdHVyZXMgYXJyYXkgaXMgbWFkZSBhdmFpbGFibGUKICAgIHRocm91Z2ggdGhlIGFydGlmYWN0IHN0b3JlLgogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgICB0aGUgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIFg6ICAgICAgICAgICAgICAgaW5wdXQgYXJyYXkKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogICAgIGRlc3RpbmF0aW9uIGZvbGRlciBmb3IgYXJ0aWZhY3RzCiAgICA6cGFyYW0gbW9kZWxfa2V5ICAgICAgICBlc3RpbWF0ZWQgbW9kZWxzIGFyZSBzYXZlZCB1bmRlciB0aGlzIGtleSBpbiB0aGUgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBhcnRpZmFjdCBzdG9yZQogICAgOnBhcmFtIGZlYXR1cmVzX2tleTogICAgdHJhbnNmb3JtZWQgZmVhdHVyZXMgbWF0cml4CiAgICAiIiIKICAgIGZlbmcgPSBGZWF0dXJlc0VuZ2luZWVyKCkKICAgIAogICAgWCA9IHBkLnJlYWRfcGFycXVldChzdHIoWCksIGVuZ2luZT0ncHlhcnJvdycpCiAgICAKICAgIGZlbmcuZml0KFgpCiAgICBYdCA9IGZlbmcudHJhbnNmb3JtKFgpCiAgICAKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBtb2RlbF9rZXkrJy5wa2wnKQogICAgZHVtcChmZW5nLCBvcGVuKGZpbGVwYXRoLCAnd2InKSkKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KG1vZGVsX2tleSwgdGFyZ2V0X3BhdGg9ZmlsZXBhdGgpCiAgICAKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBmZWF0dXJlc19rZXkrJy5wa2wnKQogICAgZHVtcChmZW5nLCBvcGVuKGZpbGVwYXRoLCAnd2InKSkKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KGZlYXR1cmVzX2tleSwgdGFyZ2V0X3BhdGg9ZmlsZXBhdGgp + base_image: yjbds/mlrun-ds:latest + commands: [] + code_origin: https://github.com/yjb-ds/functions.git#9f080fed89f32412dcebe59bb1d248921aa8aad2:/User/repos/functions/datagen/features/features-engineer.py diff --git a/fileutils/arc_to_parquet/arc_to_parquet.yaml b/fileutils/arc_to_parquet/arc_to_parquet.yaml index a88ddf2cd..2dd5dc24f 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.yaml +++ b/fileutils/arc_to_parquet/arc_to_parquet.yaml @@ -2,7 +2,7 @@ kind: job metadata: name: arc-to-parquet tag: '' - hash: bc46a566c14672288af0ee28d3a7e2d031d0d37d + hash: c0e13d89c6e78a46c42fea8b96e7a4604bc51ba1 project: '' spec: command: '' @@ -13,6 +13,6 @@ spec: description: '' build: functionSourceCode: IyBDb3B5cmlnaHQgMjAxOCBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgoKaW1wb3J0IHNzbAoKdHJ5OgogICAgX2NyZWF0ZV91bnZlcmlmaWVkX2h0dHBzX2NvbnRleHQgPSBzc2wuX2NyZWF0ZV91bnZlcmlmaWVkX2NvbnRleHQKZXhjZXB0IEF0dHJpYnV0ZUVycm9yOgogICAgIyBMZWdhY3kgUHl0aG9uIHRoYXQgZG9lc24ndCB2ZXJpZnkgSFRUUFMgY2VydGlmaWNhdGVzIGJ5IGRlZmF1bHQKICAgIHBhc3MKZWxzZToKICAgICMgSGFuZGxlIHRhcmdldCBlbnZpcm9ubWVudCB0aGF0IGRvZXNuJ3Qgc3VwcG9ydCBIVFRQUyB2ZXJpZmljYXRpb24KICAgIHNzbC5fY3JlYXRlX2RlZmF1bHRfaHR0cHNfY29udGV4dCA9IF9jcmVhdGVfdW52ZXJpZmllZF9odHRwc19jb250ZXh0CgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IHB5YXJyb3cucGFycXVldCBhcyBwcQppbXBvcnQgcHlhcnJvdyBhcyBwYQpmcm9tIHBpY2tsZSBpbXBvcnQgZHVtcCwgbG9hZAoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gdHlwaW5nIGltcG9ydCBJTywgQW55U3RyLCBVbmlvbiwgTGlzdCwgT3B0aW9uYWwKCgpkZWYgYXJjX3RvX3BhcnF1ZXQoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwKICAgIGFyY2hpdmVfdXJsOiBVbmlvbltzdHIsIFBhdGgsIElPW0FueVN0cl1dLAogICAgaGVhZGVyOiBPcHRpb25hbFtMaXN0W3N0cl1dID0gTm9uZSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAiIiwKICAgIG5hbWU6IHN0ciA9ICIiLAogICAgY2h1bmtzaXplOiBpbnQgPSAxMF8wMDAsCiAgICBsb2dfZGF0YTogYm9vbCA9IFRydWUsCiAgICBhZGRfdWlkOiBib29sID0gRmFsc2UsCiAgICBrZXk6IHN0ciA9ICJyYXdfZGF0YSIsCikgLT4gTm9uZToKICAgICIiIk9wZW4gYSBmaWxlL29iamVjdCBhcmNoaXZlIGFuZCBzYXZlIGFzIGEgcGFycXVldCBmaWxlLgogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBhcmNoaXZlX3VybDogYW55IHZhbGlkIHN0cmluZyBwYXRoIGNvbnNpc3RlbnQgd2l0aCB0aGUgcGF0aCB2YXJpYWJsZQogICAgICAgICAgICAgICAgICAgICAgICBvZiBwYW5kYXMucmVhZF9jc3YsIGluY2x1ZGluZyBzdHJpbmdzIGFzIGZpbGUgcGF0aHMsIGFzIHVybHMsIAogICAgICAgICAgICAgICAgICAgICAgICBwYXRobGliLlBhdGggb2JqZWN0cywgZXRjLi4uCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgIGNvbHVtbiBuYW1lcwogICAgOnBhcmFtIHRhcmdldF9wYXRoOiBkZXN0aW5hdGlvbiBmb2xkZXIgb2YgdGFibGUKICAgIDpwYXJhbSBuYW1lOiAgICAgICAgbmFtZSBmaWxlIHRvIGJlIHNhdmVkIGxvY2FsbHksIGFsc28KICAgIDpwYXJhbSBjaHVua3NpemU6ICAgKDApIHJvdyBzaXplIHJldHJpZXZlZCBwZXIgaXRlcmF0aW9uCiAgICA6cGFyYW0ga2V5OiAgICAgICAgIGtleSBpbiBhcnRpZmFjdCBzdG9yZSAod2hlbiBsb2dfZGF0YT1UcnVlKQogICAgIiIiCiAgICBpZiBub3QgbmFtZS5lbmRzd2l0aCgiLnBxdCIpOgogICAgICAgIG5hbWUgKz0gIi5wcXQiCgogICAgZGVzdF9wYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgb3MubWFrZWRpcnMob3MucGF0aC5qb2luKHRhcmdldF9wYXRoKSwgZXhpc3Rfb2s9VHJ1ZSkKICAgIGlmIG5vdCBvcy5wYXRoLmlzZmlsZShkZXN0X3BhdGgpOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oImRlc3RpbmF0aW9uIGZpbGUgZG9lcyBub3QgZXhpc3QsIGRvd25sb2FkaW5nIikKICAgICAgICBwcXdyaXRlciA9IE5vbmUKICAgICAgICBmb3IgaSwgZGYgaW4gZW51bWVyYXRlKHBkLnJlYWRfY3N2KGFyY2hpdmVfdXJsLCBjaHVua3NpemU9Y2h1bmtzaXplLCBuYW1lcz1oZWFkZXIpKToKICAgICAgICAgICAgdGFibGUgPSBwYS5UYWJsZS5mcm9tX3BhbmRhcyhkZikKICAgICAgICAgICAgaWYgaSA9PSAwOgogICAgICAgICAgICAgICAgcHF3cml0ZXIgPSBwcS5QYXJxdWV0V3JpdGVyKGRlc3RfcGF0aCwgdGFibGUuc2NoZW1hKQogICAgICAgICAgICBwcXdyaXRlci53cml0ZV90YWJsZSh0YWJsZSkKCiAgICAgICAgaWYgcHF3cml0ZXI6CiAgICAgICAgICAgIHBxd3JpdGVyLmNsb3NlKCkKCiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbyhmInNhdmVkIHRhYmxlIHRvIHtkZXN0X3BhdGh9IikKICAgIGVsc2U6CiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbygiZGVzdGluYXRpb24gZmlsZSBhbHJlYWR5IGV4aXN0cyIpCgogICAgY29udGV4dC5sb2dfYXJ0aWZhY3Qoa2V5LCB0YXJnZXRfcGF0aD1kZXN0X3BhdGgpCiAgICAjIGxvZyBoZWFkZXIKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCAnaGVhZGVyLnBrbCcpCiAgICBkdW1wKGhlYWRlciwgb3BlbihmaWxlcGF0aCwgJ3diJykpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgnaGVhZGVyJywgdGFyZ2V0X3BhdGg9ZmlsZXBhdGgpCg== - base_image: yjbds/mlrun-ds:latest + base_image: yjbds/mlrun-files:latest commands: [] - code_origin: https://github.com/yjb-ds/functions.git#e613e55761fd1ed325ad88155877924aa5b49ccc:/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py + code_origin: https://github.com/yjb-ds/functions.git#9f080fed89f32412dcebe59bb1d248921aa8aad2:/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py diff --git a/tests/arc_to_parquet.ipynb b/tests/arc_to_parquet.ipynb index 136174c78..5a47ded22 100644 --- a/tests/arc_to_parquet.ipynb +++ b/tests/arc_to_parquet.ipynb @@ -9,7 +9,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -27,7 +27,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -46,7 +46,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -62,14 +62,14 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 18:24:11,226 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" + "[mlrun] 2020-01-27 08:12:26,749 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" ] } ], @@ -78,14 +78,14 @@ "arctoparq = mlrun.code_to_function(\n", " filename=os.path.join(CODE_BASE, 'fileutils/arc_to_parquet', 'arc_to_parquet.py'), \n", " kind='job')\n", - "arctoparq.build_config(base_image='yjbds/mlrun-ds:latest', commands=[])\n", + "arctoparq.build_config(base_image='yjbds/mlrun-files:latest', commands=[])\n", "yaml_name = os.path.join(CODE_BASE, 'fileutils/arc_to_parquet', 'arc_to_parquet.yaml')\n", "arctoparq.export(yaml_name)" ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -110,7 +110,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 14, "metadata": {}, "outputs": [ { @@ -119,7 +119,7 @@ "'ready'" ] }, - "execution_count": 13, + "execution_count": 14, "metadata": {}, "output_type": "execute_result" } @@ -130,21 +130,21 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-26 18:24:16,244 starting run arc2parq uid=ddb58fa1dfb644b7875bc4d92033a1e4 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-26 18:24:16,326 Job is running in the background, pod: arc2parq-42tm7\n", - "[mlrun] 2020-01-26 18:24:21,234 destination file does not exist, downloading\n", - "[mlrun] 2020-01-26 18:29:29,278 saved table to /User/mlrun/models/higgs.pqt\n", - "[mlrun] 2020-01-26 18:29:29,294 log artifact higgs at /User/mlrun/models/higgs.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-26 18:29:29,307 log artifact header at /User/mlrun/models/header.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-27 08:13:34,366 starting run arc2parq uid=ca75db580ec146038a8a932e85b64ac1 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-27 08:13:34,456 Job is running in the background, pod: arc2parq-2rtrg\n", + "[mlrun] 2020-01-27 08:13:42,564 destination file does not exist, downloading\n", + "[mlrun] 2020-01-27 08:18:45,530 saved table to /User/mlrun/models/higgs.pqt\n", + "[mlrun] 2020-01-27 08:18:45,545 log artifact higgs at /User/mlrun/models/higgs.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-27 08:18:45,558 log artifact header at /User/mlrun/models/header.pkl, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-26 18:29:29,332 run executed, status=completed\n", + "[mlrun] 2020-01-27 08:18:45,581 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -317,12 +317,12 @@ " \n", " \n", " \n", - "
...33a1e4
\n", + "
...b64ac1
\n", " 0\n", - " Jan 26 18:24:21\n", + " Jan 27 08:13:42\n", " completed\n", " arc-to-parquet\n", - "
host=arc2parq-42tm7
kind=job
owner=admin
\n", + "
host=arc2parq-2rtrg
kind=job
owner=admin
\n", " \n", "
archive_url=https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
header=['labels', 'lepton_pT', 'lepton_eta', 'lepton_phi', 'missing_energy_magnitude', 'missing_energy_phi', 'jet_1_pt', 'jet_1_eta', 'jet_1_phi', 'jet_1_b-tag', 'jet_2_pt', 'jet_2_eta', 'jet_2_phi', 'jet_2_b-tag', 'jet_3_pt', 'jet_3_eta', 'jet_3_phi', 'jet_3_b-tag', 'jet_4_pt', 'jet_4_eta', 'jet_4_phi', 'jet_4_b-tag', 'm_jj', 'm_jjj', 'm_lv', 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']
key=higgs
name=higgs.pqt
target_path=/User/mlrun/models
\n", " \n", @@ -331,12 +331,12 @@ " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -352,8 +352,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run ddb58fa1dfb644b7875bc4d92033a1e4 , !mlrun logs ddb58fa1dfb644b7875bc4d92033a1e4 \n", - "[mlrun] 2020-01-26 18:29:36,692 run executed, status=completed\n" + "!mlrun get run ca75db580ec146038a8a932e85b64ac1 , !mlrun logs ca75db580ec146038a8a932e85b64ac1 \n", + "[mlrun] 2020-01-27 08:18:54,929 run executed, status=completed\n" ] } ], @@ -390,7 +390,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -401,7 +401,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ @@ -411,7 +411,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ @@ -421,7 +421,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -430,7 +430,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 12, "metadata": {}, "outputs": [ { @@ -635,7 +635,7 @@ "[5 rows x 29 columns]" ] }, - "execution_count": 19, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } @@ -646,7 +646,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 13, "metadata": {}, "outputs": [ { @@ -655,7 +655,7 @@ "(11000000, 29)" ] }, - "execution_count": 20, + "execution_count": 13, "metadata": {}, "output_type": "execute_result" } diff --git a/tests/features_engineer.ipynb b/tests/features_engineer.ipynb new file mode 100644 index 000000000..9fa519d53 --- /dev/null +++ b/tests/features_engineer.ipynb @@ -0,0 +1,203 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import mlrun\n", + "import os\n", + "import numpy as np\n", + "mlrun.mlconf.dbpath = 'http://mlrun-api:8080'" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "CODE_BASE = '/User/repos/functions/' \n", + "TARGET_PATH = '/User/mlrun/models'\n", + "\n", + "SRC_FILE = 'higgs.pqt'\n", + "RNG = 1\n", + "\n", + "MODEL_KEY = 'model'\n", + "FEATURES_KEY = 'lgb-classifier.pkl'\n", + "VERBOSE = False" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-26 21:04:35,571 function spec saved to path: /User/repos/functions/datagen/features/features-engineer.yaml\n" + ] + } + ], + "source": [ + "testfn = mlrun.code_to_function(\n", + " kind='job', \n", + " filename=os.path.join(CODE_BASE, 'datagen/features', 'features-engineer.py'))\n", + "testfn.build_config(base_image='yjbds/mlrun-ds:latest', commands=[])\n", + "testfn.export(os.path.join(CODE_BASE, 'datagen/features', 'features-engineer.yaml'))" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "engineer = mlrun.import_function(\n", + " os.path.join(CODE_BASE, 'datagen/features', 'features-engineer.yaml')\n", + ").apply(mlrun.mount_v3io())" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'ready'" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "engineer.deploy(skip_deployed=True, with_mlrun=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-26 21:05:17,440 starting run features_engineer uid=2f1baafc36b44bbea796fe5276c0e27d -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-26 21:05:17,525 Job is running in the background, pod: features-engineer-wpkc5\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "ERROR:root:Internal Python error in the inspect module.\n", + "Below is the traceback from this internal error.\n", + "\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Traceback (most recent call last):\n", + " File \"/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py\", line 3319, in run_code\n", + " exec(code_obj, self.user_global_ns, self.user_ns)\n", + " File \"\", line 6, in \n", + " engtsk = engineer.run(task, handler='features_engineer')\n", + " File \"/User/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\", line 262, in run\n", + " runspec.logs(True, self._get_db())\n", + " File \"/User/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/model.py\", line 352, in logs\n", + " watch=watch)\n", + " File \"/User/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/db/httpdb.py\", line 115, in watch_log\n", + " time.sleep(10)\n", + "KeyboardInterrupt\n", + "\n", + "During handling of the above exception, another exception occurred:\n", + "\n", + "Traceback (most recent call last):\n", + " File \"/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py\", line 2034, in showtraceback\n", + " stb = value._render_traceback_()\n", + "AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'\n", + "\n", + "During handling of the above exception, another exception occurred:\n", + "\n", + "Traceback (most recent call last):\n", + " File \"/conda/lib/python3.6/site-packages/IPython/core/ultratb.py\", line 1151, in get_records\n", + " return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)\n", + " File \"/conda/lib/python3.6/site-packages/IPython/core/ultratb.py\", line 319, in wrapped\n", + " return f(*args, **kwargs)\n", + " File \"/conda/lib/python3.6/site-packages/IPython/core/ultratb.py\", line 353, in _fixed_getinnerframes\n", + " records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))\n", + " File \"/conda/lib/python3.6/inspect.py\", line 1490, in getinnerframes\n", + " frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)\n", + " File \"/conda/lib/python3.6/inspect.py\", line 1448, in getframeinfo\n", + " filename = getsourcefile(frame) or getfile(frame)\n", + " File \"/conda/lib/python3.6/inspect.py\", line 696, in getsourcefile\n", + " if getattr(getmodule(object, filename), '__loader__', None) is not None:\n", + " File \"/conda/lib/python3.6/inspect.py\", line 742, in getmodule\n", + " os.path.realpath(f)] = module.__name__\n", + " File \"/conda/lib/python3.6/posixpath.py\", line 395, in realpath\n", + " path, ok = _joinrealpath(filename[:0], filename, {})\n", + " File \"/conda/lib/python3.6/posixpath.py\", line 429, in _joinrealpath\n", + " if not islink(newpath):\n", + " File \"/conda/lib/python3.6/posixpath.py\", line 171, in islink\n", + " st = os.lstat(path)\n", + "KeyboardInterrupt\n" + ] + }, + { + "ename": "KeyboardInterrupt", + "evalue": "", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m" + ] + } + ], + "source": [ + "task = mlrun.NewTask()\n", + "task.with_params(\n", + " X='higgs.pqt',\n", + " target_path=TARGET_PATH)\n", + "\n", + "engtsk = engineer.run(task, handler='features_engineer')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/train/sklearn-classifier.py b/train/sklearn-classifier.py index 35f9bb297..eae4def1d 100644 --- a/train/sklearn-classifier.py +++ b/train/sklearn-classifier.py @@ -124,3 +124,41 @@ def plot_validation( plt.cla() plt.clf() plt.close() + + + +def keras_classifier_generator( + metrics: list = [], + input_size: int = 20, + dropout: float = 0.5, + output_bias: float = None, + learning_rate: float = 1e-3 +): + """Generate a super simple classifier + + :param metrics: select metrics to be evaluated + :param output_bias: layer initializer + :param input_size: number of features, size of input + :param dropout: dropout frequency + :param learning_rate: + + returns a compiled keras model used as input to the KerasClassifer wrapper + """ + if output_bias is not None: + output_bias = Constant(output_bias) + + model = Sequential( + [ + Dense(16, activation="relu", input_shape=(input_size,)), + Dropout(dropout), + Dense(1, activation="sigmoid", bias_initializer=output_bias), + ] + ) + + model.compile( + optimizer=Adam(lr=learning_rate), + loss=BinaryCrossentropy(), + metrics=metrics + ) + + return model \ No newline at end of file From eb009dac39c64611adac33d24c7e33ba0856c941 Mon Sep 17 00:00:00 2001 From: yasha Date: Mon, 27 Jan 2020 08:29:41 +0000 Subject: [PATCH 23/32] fix image source arc-to-parquet --- datagen/features/features-engineer.py | 110 ------------------------ datagen/features/features-engineer.yaml | 18 ---- 2 files changed, 128 deletions(-) delete mode 100644 datagen/features/features-engineer.py delete mode 100644 datagen/features/features-engineer.yaml diff --git a/datagen/features/features-engineer.py b/datagen/features/features-engineer.py deleted file mode 100644 index c234b96a7..000000000 --- a/datagen/features/features-engineer.py +++ /dev/null @@ -1,110 +0,0 @@ -# Copyright 2019 Iguazio -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import os -from cloudpickle import dump, load - -import numpy as np -from sklearn.base import BaseEstimator, TransformerMixin - -from typing import Optional, Union - -rom mlrun.execution import MLClientCtx -from mlrun.datastore import DataItem -from mlrun.artifacts import TableArtifact, PlotArtifact - -import warnings -warnings.simplefilter(action="ignore", category=FutureWarning) - -class FeaturesEngineer(BaseEstimator, TransformerMixin): - """Engineer features from raw input. - A standard transformer mixin that can be inserted into a scikit learn Pipeline. - - To use, - >>> ffg = FeaturesEngineer() - >>> ffg.fit(X) - >>> x_transformed = ffg.transform(X) - or - >>> ffg = FeaturesEngineer() - >>> x_transformed = ffg.fit_transform(X) - - In a pipeline: - >>> from sklearn.pipeline import Pipeline - >>> from sklearn.preprocessing import StandardScaler - >>> transformers = [('feature_gen', FeaturesEngineerFeature()), - ('scaler', StandardScaler())] - >>> transformer_pipe = Pipeline(transformers) - """ - def fit(self, X, y=None): - """fit is unused here, but ANY model can be inserted here - """ - return self - - def transform(self, X, y=None): - """Transform raw input data as a preprocessing step, (if fit - estimates a model, then run transform only after calling fit). - - :param X: Raw input features - - Returns a DataFrame of features. - """ - x = X.copy() - - # do some cool feature engineering:here we replace by a N(2,2) series - m = 2.0 - s = 2.0 - n, f = x.shape - - if type(x) == np.ndarray: - x[:, f - 1] = np.random.normal(m, s, n) - else: - x.values[:, f - 1] = np.random.normal(m, s, n) - - x = x.astype("float") - - return x - -def features_engineer( - context: MLClientCtx, - X: Union[DataItem, str], - target_path: str = '', - model_key: str = 'features-model' - features_key: str = 'features' -): - """Generate features from an input array - - The features model will be saved for reuse in an inference pipeline or when - testing the model. In addition, the transformed features array is made available - through the artifact store. - - :param context: the function context - :param X: input array - :param target_path: destination folder for artifacts - :param model_key estimated models are saved under this key in the - artifact store - :param features_key: transformed features matrix - """ - feng = FeaturesEngineer() - - X = pd.read_parquet(str(X), engine='pyarrow') - - feng.fit(X) - Xt = feng.transform(X) - - filepath = os.path.join(target_path, model_key+'.pkl') - dump(feng, open(filepath, 'wb')) - context.log_artifact(model_key, target_path=filepath) - - filepath = os.path.join(target_path, features_key+'.pkl') - dump(feng, open(filepath, 'wb')) - context.log_artifact(features_key, target_path=filepath) \ No newline at end of file diff --git a/datagen/features/features-engineer.yaml b/datagen/features/features-engineer.yaml deleted file mode 100644 index 06dff7eef..000000000 --- a/datagen/features/features-engineer.yaml +++ /dev/null @@ -1,18 +0,0 @@ -kind: job -metadata: - name: features-engineer - tag: '' - hash: be29081b9a995b0b6e6bcd3f5d2bc53f8168670a - project: '' -spec: - command: '' - args: [] - volumes: [] - volume_mounts: [] - env: [] - description: '' - build: - functionSourceCode: IyBDb3B5cmlnaHQgMjAxOSBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgppbXBvcnQgb3MKZnJvbSBjbG91ZHBpY2tsZSBpbXBvcnQgZHVtcCwgbG9hZAoKaW1wb3J0IG51bXB5IGFzIG5wCmZyb20gc2tsZWFybi5iYXNlIGltcG9ydCBCYXNlRXN0aW1hdG9yLCBUcmFuc2Zvcm1lck1peGluCgpmcm9tIHR5cGluZyBpbXBvcnQgT3B0aW9uYWwsIFVuaW9uCgpyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIG1scnVuLmRhdGFzdG9yZSBpbXBvcnQgRGF0YUl0ZW0KZnJvbSBtbHJ1bi5hcnRpZmFjdHMgaW1wb3J0IFRhYmxlQXJ0aWZhY3QsIFBsb3RBcnRpZmFjdAoKaW1wb3J0IHdhcm5pbmdzCndhcm5pbmdzLnNpbXBsZWZpbHRlcihhY3Rpb249Imlnbm9yZSIsIGNhdGVnb3J5PUZ1dHVyZVdhcm5pbmcpCgpjbGFzcyBGZWF0dXJlc0VuZ2luZWVyKEJhc2VFc3RpbWF0b3IsIFRyYW5zZm9ybWVyTWl4aW4pOgogICAgIiIiRW5naW5lZXIgZmVhdHVyZXMgZnJvbSByYXcgaW5wdXQuCiAgICBBIHN0YW5kYXJkIHRyYW5zZm9ybWVyIG1peGluIHRoYXQgY2FuIGJlIGluc2VydGVkIGludG8gYSBzY2lraXQgbGVhcm4gUGlwZWxpbmUuCiAgICAKICAgIFRvIHVzZSwgCiAgICA+Pj4gZmZnID0gRmVhdHVyZXNFbmdpbmVlcigpCiAgICA+Pj4gZmZnLmZpdChYKQogICAgPj4+IHhfdHJhbnNmb3JtZWQgPSBmZmcudHJhbnNmb3JtKFgpCiAgICBvcgogICAgPj4+IGZmZyA9IEZlYXR1cmVzRW5naW5lZXIoKQogICAgPj4+IHhfdHJhbnNmb3JtZWQgPSBmZmcuZml0X3RyYW5zZm9ybShYKQogICAgCiAgICBJbiBhIHBpcGVsaW5lOgogICAgPj4+IGZyb20gc2tsZWFybi5waXBlbGluZSBpbXBvcnQgUGlwZWxpbmUKICAgID4+PiBmcm9tIHNrbGVhcm4ucHJlcHJvY2Vzc2luZyBpbXBvcnQgU3RhbmRhcmRTY2FsZXIKICAgID4+PiB0cmFuc2Zvcm1lcnMgPSBbKCdmZWF0dXJlX2dlbicsIEZlYXR1cmVzRW5naW5lZXJGZWF0dXJlKCkpLCAKICAgICAgICAgICAgICAgICAgICAgICAgKCdzY2FsZXInLCBTdGFuZGFyZFNjYWxlcigpKV0KICAgID4+PiB0cmFuc2Zvcm1lcl9waXBlID0gUGlwZWxpbmUodHJhbnNmb3JtZXJzKQogICAgIiIiCiAgICBkZWYgZml0KHNlbGYsIFgsIHk9Tm9uZSk6CiAgICAgICAgIiIiZml0IGlzIHVudXNlZCBoZXJlLCBidXQgQU5ZIG1vZGVsIGNhbiBiZSBpbnNlcnRlZCBoZXJlCiAgICAgICAgIiIiCiAgICAgICAgcmV0dXJuIHNlbGYKCiAgICBkZWYgdHJhbnNmb3JtKHNlbGYsIFgsIHk9Tm9uZSk6CiAgICAgICAgIiIiVHJhbnNmb3JtIHJhdyBpbnB1dCBkYXRhIGFzIGEgcHJlcHJvY2Vzc2luZyBzdGVwLCAoaWYgZml0CiAgICAgICAgZXN0aW1hdGVzIGEgbW9kZWwsIHRoZW4gcnVuIHRyYW5zZm9ybSBvbmx5IGFmdGVyIGNhbGxpbmcgZml0KS4KICAgICAgICAKICAgICAgICA6cGFyYW0gWDogUmF3IGlucHV0IGZlYXR1cmVzCiAgICAgICAgCiAgICAgICAgUmV0dXJucyBhIERhdGFGcmFtZSBvZiBmZWF0dXJlcy4KICAgICAgICAiIiIKICAgICAgICB4ID0gWC5jb3B5KCkKCiAgICAgICAgIyBkbyBzb21lIGNvb2wgZmVhdHVyZSBlbmdpbmVlcmluZzpoZXJlIHdlIHJlcGxhY2UgYnkgYSBOKDIsMikgc2VyaWVzCiAgICAgICAgbSA9IDIuMAogICAgICAgIHMgPSAyLjAKICAgICAgICBuLCBmID0geC5zaGFwZQoKICAgICAgICBpZiB0eXBlKHgpID09IG5wLm5kYXJyYXk6CiAgICAgICAgICAgIHhbOiwgZiAtIDFdID0gbnAucmFuZG9tLm5vcm1hbChtLCBzLCBuKQogICAgICAgIGVsc2U6CiAgICAgICAgICAgIHgudmFsdWVzWzosIGYgLSAxXSA9IG5wLnJhbmRvbS5ub3JtYWwobSwgcywgbikKCiAgICAgICAgeCA9IHguYXN0eXBlKCJmbG9hdCIpCgogICAgICAgIHJldHVybiB4CiAgICAKZGVmIGZlYXR1cmVzX2VuZ2luZWVyKAogICAgY29udGV4dDogTUxDbGllbnRDdHgsCiAgICBYOiBVbmlvbltEYXRhSXRlbSwgc3RyXSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAnJywKICAgIG1vZGVsX2tleTogc3RyID0gJ2ZlYXR1cmVzLW1vZGVsJwogICAgZmVhdHVyZXNfa2V5OiBzdHIgPSAnZmVhdHVyZXMnCik6CiAgICAiIiJHZW5lcmF0ZSBmZWF0dXJlcyBmcm9tIGFuIGlucHV0IGFycmF5CiAgICAKICAgIFRoZSBmZWF0dXJlcyBtb2RlbCB3aWxsIGJlIHNhdmVkIGZvciByZXVzZSBpbiBhbiBpbmZlcmVuY2UgcGlwZWxpbmUgb3Igd2hlbgogICAgdGVzdGluZyB0aGUgbW9kZWwuIEluIGFkZGl0aW9uLCB0aGUgdHJhbnNmb3JtZWQgZmVhdHVyZXMgYXJyYXkgaXMgbWFkZSBhdmFpbGFibGUKICAgIHRocm91Z2ggdGhlIGFydGlmYWN0IHN0b3JlLgogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgICB0aGUgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIFg6ICAgICAgICAgICAgICAgaW5wdXQgYXJyYXkKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogICAgIGRlc3RpbmF0aW9uIGZvbGRlciBmb3IgYXJ0aWZhY3RzCiAgICA6cGFyYW0gbW9kZWxfa2V5ICAgICAgICBlc3RpbWF0ZWQgbW9kZWxzIGFyZSBzYXZlZCB1bmRlciB0aGlzIGtleSBpbiB0aGUgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBhcnRpZmFjdCBzdG9yZQogICAgOnBhcmFtIGZlYXR1cmVzX2tleTogICAgdHJhbnNmb3JtZWQgZmVhdHVyZXMgbWF0cml4CiAgICAiIiIKICAgIGZlbmcgPSBGZWF0dXJlc0VuZ2luZWVyKCkKICAgIAogICAgWCA9IHBkLnJlYWRfcGFycXVldChzdHIoWCksIGVuZ2luZT0ncHlhcnJvdycpCiAgICAKICAgIGZlbmcuZml0KFgpCiAgICBYdCA9IGZlbmcudHJhbnNmb3JtKFgpCiAgICAKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBtb2RlbF9rZXkrJy5wa2wnKQogICAgZHVtcChmZW5nLCBvcGVuKGZpbGVwYXRoLCAnd2InKSkKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KG1vZGVsX2tleSwgdGFyZ2V0X3BhdGg9ZmlsZXBhdGgpCiAgICAKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBmZWF0dXJlc19rZXkrJy5wa2wnKQogICAgZHVtcChmZW5nLCBvcGVuKGZpbGVwYXRoLCAnd2InKSkKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KGZlYXR1cmVzX2tleSwgdGFyZ2V0X3BhdGg9ZmlsZXBhdGgp - base_image: yjbds/mlrun-ds:latest - commands: [] - code_origin: https://github.com/yjb-ds/functions.git#9f080fed89f32412dcebe59bb1d248921aa8aad2:/User/repos/functions/datagen/features/features-engineer.py From e67bbbba46a2b5767f5ed561da882d90a979e670 Mon Sep 17 00:00:00 2001 From: yasha Date: Mon, 27 Jan 2020 15:35:36 +0000 Subject: [PATCH 24/32] add partitioning to parquet save, arc-to-parq --- fileutils/arc_to_parquet/arc_to_parquet.py | 13 +- fileutils/arc_to_parquet/arc_to_parquet.yaml | 6 +- tests/arc_to_parquet-airlines.ipynb | 553 +++++++++++++++++++ 3 files changed, 567 insertions(+), 5 deletions(-) create mode 100644 tests/arc_to_parquet-airlines.ipynb diff --git a/fileutils/arc_to_parquet/arc_to_parquet.py b/fileutils/arc_to_parquet/arc_to_parquet.py index 1b7e9887b..9bc2ef361 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.py +++ b/fileutils/arc_to_parquet/arc_to_parquet.py @@ -45,6 +45,8 @@ def arc_to_parquet( log_data: bool = True, add_uid: bool = False, key: str = "raw_data", + dataset: bool = False, + partition_cols = [] ) -> None: """Open a file/object archive and save as a parquet file. @@ -57,6 +59,9 @@ def arc_to_parquet( :param name: name file to be saved locally, also :param chunksize: (0) row size retrieved per iteration :param key: key in artifact store (when log_data=True) + :param dataset: (False) if True then target_path is folder for + partitioned files + :param part_cols: ([]) list of partitioning columns """ if not name.endswith(".pqt"): name += ".pqt" @@ -70,8 +75,11 @@ def arc_to_parquet( table = pa.Table.from_pandas(df) if i == 0: pqwriter = pq.ParquetWriter(dest_path, table.schema) - pqwriter.write_table(table) - + if dataset: + pq.write_to_dataset(table, root_path=target_path, partition_cols=partition_cols) + else: + pqwriter.write_table(table) + if pqwriter: pqwriter.close() @@ -80,6 +88,7 @@ def arc_to_parquet( context.logger.info("destination file already exists") context.log_artifact(key, target_path=dest_path) + # log header filepath = os.path.join(target_path, 'header.pkl') dump(header, open(filepath, 'wb')) diff --git a/fileutils/arc_to_parquet/arc_to_parquet.yaml b/fileutils/arc_to_parquet/arc_to_parquet.yaml index 2dd5dc24f..2e3f78c21 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.yaml +++ b/fileutils/arc_to_parquet/arc_to_parquet.yaml @@ -2,7 +2,7 @@ kind: job metadata: name: arc-to-parquet tag: '' - hash: c0e13d89c6e78a46c42fea8b96e7a4604bc51ba1 + hash: 4f8217a2058dde23e82f681c5dc8d686b844ceb2 project: '' spec: command: '' @@ -12,7 +12,7 @@ spec: env: [] description: '' build: - functionSourceCode: IyBDb3B5cmlnaHQgMjAxOCBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgoKaW1wb3J0IHNzbAoKdHJ5OgogICAgX2NyZWF0ZV91bnZlcmlmaWVkX2h0dHBzX2NvbnRleHQgPSBzc2wuX2NyZWF0ZV91bnZlcmlmaWVkX2NvbnRleHQKZXhjZXB0IEF0dHJpYnV0ZUVycm9yOgogICAgIyBMZWdhY3kgUHl0aG9uIHRoYXQgZG9lc24ndCB2ZXJpZnkgSFRUUFMgY2VydGlmaWNhdGVzIGJ5IGRlZmF1bHQKICAgIHBhc3MKZWxzZToKICAgICMgSGFuZGxlIHRhcmdldCBlbnZpcm9ubWVudCB0aGF0IGRvZXNuJ3Qgc3VwcG9ydCBIVFRQUyB2ZXJpZmljYXRpb24KICAgIHNzbC5fY3JlYXRlX2RlZmF1bHRfaHR0cHNfY29udGV4dCA9IF9jcmVhdGVfdW52ZXJpZmllZF9odHRwc19jb250ZXh0CgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IHB5YXJyb3cucGFycXVldCBhcyBwcQppbXBvcnQgcHlhcnJvdyBhcyBwYQpmcm9tIHBpY2tsZSBpbXBvcnQgZHVtcCwgbG9hZAoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gdHlwaW5nIGltcG9ydCBJTywgQW55U3RyLCBVbmlvbiwgTGlzdCwgT3B0aW9uYWwKCgpkZWYgYXJjX3RvX3BhcnF1ZXQoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwKICAgIGFyY2hpdmVfdXJsOiBVbmlvbltzdHIsIFBhdGgsIElPW0FueVN0cl1dLAogICAgaGVhZGVyOiBPcHRpb25hbFtMaXN0W3N0cl1dID0gTm9uZSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAiIiwKICAgIG5hbWU6IHN0ciA9ICIiLAogICAgY2h1bmtzaXplOiBpbnQgPSAxMF8wMDAsCiAgICBsb2dfZGF0YTogYm9vbCA9IFRydWUsCiAgICBhZGRfdWlkOiBib29sID0gRmFsc2UsCiAgICBrZXk6IHN0ciA9ICJyYXdfZGF0YSIsCikgLT4gTm9uZToKICAgICIiIk9wZW4gYSBmaWxlL29iamVjdCBhcmNoaXZlIGFuZCBzYXZlIGFzIGEgcGFycXVldCBmaWxlLgogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBhcmNoaXZlX3VybDogYW55IHZhbGlkIHN0cmluZyBwYXRoIGNvbnNpc3RlbnQgd2l0aCB0aGUgcGF0aCB2YXJpYWJsZQogICAgICAgICAgICAgICAgICAgICAgICBvZiBwYW5kYXMucmVhZF9jc3YsIGluY2x1ZGluZyBzdHJpbmdzIGFzIGZpbGUgcGF0aHMsIGFzIHVybHMsIAogICAgICAgICAgICAgICAgICAgICAgICBwYXRobGliLlBhdGggb2JqZWN0cywgZXRjLi4uCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgIGNvbHVtbiBuYW1lcwogICAgOnBhcmFtIHRhcmdldF9wYXRoOiBkZXN0aW5hdGlvbiBmb2xkZXIgb2YgdGFibGUKICAgIDpwYXJhbSBuYW1lOiAgICAgICAgbmFtZSBmaWxlIHRvIGJlIHNhdmVkIGxvY2FsbHksIGFsc28KICAgIDpwYXJhbSBjaHVua3NpemU6ICAgKDApIHJvdyBzaXplIHJldHJpZXZlZCBwZXIgaXRlcmF0aW9uCiAgICA6cGFyYW0ga2V5OiAgICAgICAgIGtleSBpbiBhcnRpZmFjdCBzdG9yZSAod2hlbiBsb2dfZGF0YT1UcnVlKQogICAgIiIiCiAgICBpZiBub3QgbmFtZS5lbmRzd2l0aCgiLnBxdCIpOgogICAgICAgIG5hbWUgKz0gIi5wcXQiCgogICAgZGVzdF9wYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgb3MubWFrZWRpcnMob3MucGF0aC5qb2luKHRhcmdldF9wYXRoKSwgZXhpc3Rfb2s9VHJ1ZSkKICAgIGlmIG5vdCBvcy5wYXRoLmlzZmlsZShkZXN0X3BhdGgpOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oImRlc3RpbmF0aW9uIGZpbGUgZG9lcyBub3QgZXhpc3QsIGRvd25sb2FkaW5nIikKICAgICAgICBwcXdyaXRlciA9IE5vbmUKICAgICAgICBmb3IgaSwgZGYgaW4gZW51bWVyYXRlKHBkLnJlYWRfY3N2KGFyY2hpdmVfdXJsLCBjaHVua3NpemU9Y2h1bmtzaXplLCBuYW1lcz1oZWFkZXIpKToKICAgICAgICAgICAgdGFibGUgPSBwYS5UYWJsZS5mcm9tX3BhbmRhcyhkZikKICAgICAgICAgICAgaWYgaSA9PSAwOgogICAgICAgICAgICAgICAgcHF3cml0ZXIgPSBwcS5QYXJxdWV0V3JpdGVyKGRlc3RfcGF0aCwgdGFibGUuc2NoZW1hKQogICAgICAgICAgICBwcXdyaXRlci53cml0ZV90YWJsZSh0YWJsZSkKCiAgICAgICAgaWYgcHF3cml0ZXI6CiAgICAgICAgICAgIHBxd3JpdGVyLmNsb3NlKCkKCiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbyhmInNhdmVkIHRhYmxlIHRvIHtkZXN0X3BhdGh9IikKICAgIGVsc2U6CiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbygiZGVzdGluYXRpb24gZmlsZSBhbHJlYWR5IGV4aXN0cyIpCgogICAgY29udGV4dC5sb2dfYXJ0aWZhY3Qoa2V5LCB0YXJnZXRfcGF0aD1kZXN0X3BhdGgpCiAgICAjIGxvZyBoZWFkZXIKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCAnaGVhZGVyLnBrbCcpCiAgICBkdW1wKGhlYWRlciwgb3BlbihmaWxlcGF0aCwgJ3diJykpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgnaGVhZGVyJywgdGFyZ2V0X3BhdGg9ZmlsZXBhdGgpCg== + functionSourceCode: IyBDb3B5cmlnaHQgMjAxOCBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgoKaW1wb3J0IHNzbAoKdHJ5OgogICAgX2NyZWF0ZV91bnZlcmlmaWVkX2h0dHBzX2NvbnRleHQgPSBzc2wuX2NyZWF0ZV91bnZlcmlmaWVkX2NvbnRleHQKZXhjZXB0IEF0dHJpYnV0ZUVycm9yOgogICAgIyBMZWdhY3kgUHl0aG9uIHRoYXQgZG9lc24ndCB2ZXJpZnkgSFRUUFMgY2VydGlmaWNhdGVzIGJ5IGRlZmF1bHQKICAgIHBhc3MKZWxzZToKICAgICMgSGFuZGxlIHRhcmdldCBlbnZpcm9ubWVudCB0aGF0IGRvZXNuJ3Qgc3VwcG9ydCBIVFRQUyB2ZXJpZmljYXRpb24KICAgIHNzbC5fY3JlYXRlX2RlZmF1bHRfaHR0cHNfY29udGV4dCA9IF9jcmVhdGVfdW52ZXJpZmllZF9odHRwc19jb250ZXh0CgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IHB5YXJyb3cucGFycXVldCBhcyBwcQppbXBvcnQgcHlhcnJvdyBhcyBwYQpmcm9tIHBpY2tsZSBpbXBvcnQgZHVtcCwgbG9hZAoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gdHlwaW5nIGltcG9ydCBJTywgQW55U3RyLCBVbmlvbiwgTGlzdCwgT3B0aW9uYWwKCgpkZWYgYXJjX3RvX3BhcnF1ZXQoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwKICAgIGFyY2hpdmVfdXJsOiBVbmlvbltzdHIsIFBhdGgsIElPW0FueVN0cl1dLAogICAgaGVhZGVyOiBPcHRpb25hbFtMaXN0W3N0cl1dID0gTm9uZSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAiIiwKICAgIG5hbWU6IHN0ciA9ICIiLAogICAgY2h1bmtzaXplOiBpbnQgPSAxMF8wMDAsCiAgICBsb2dfZGF0YTogYm9vbCA9IFRydWUsCiAgICBhZGRfdWlkOiBib29sID0gRmFsc2UsCiAgICBrZXk6IHN0ciA9ICJyYXdfZGF0YSIsCiAgICBkYXRhc2V0OiBib29sID0gRmFsc2UsCiAgICBwYXJ0aXRpb25fY29scyA9IFtdCikgLT4gTm9uZToKICAgICIiIk9wZW4gYSBmaWxlL29iamVjdCBhcmNoaXZlIGFuZCBzYXZlIGFzIGEgcGFycXVldCBmaWxlLgogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBhcmNoaXZlX3VybDogYW55IHZhbGlkIHN0cmluZyBwYXRoIGNvbnNpc3RlbnQgd2l0aCB0aGUgcGF0aCB2YXJpYWJsZQogICAgICAgICAgICAgICAgICAgICAgICBvZiBwYW5kYXMucmVhZF9jc3YsIGluY2x1ZGluZyBzdHJpbmdzIGFzIGZpbGUgcGF0aHMsIGFzIHVybHMsIAogICAgICAgICAgICAgICAgICAgICAgICBwYXRobGliLlBhdGggb2JqZWN0cywgZXRjLi4uCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgIGNvbHVtbiBuYW1lcwogICAgOnBhcmFtIHRhcmdldF9wYXRoOiBkZXN0aW5hdGlvbiBmb2xkZXIgb2YgdGFibGUKICAgIDpwYXJhbSBuYW1lOiAgICAgICAgbmFtZSBmaWxlIHRvIGJlIHNhdmVkIGxvY2FsbHksIGFsc28KICAgIDpwYXJhbSBjaHVua3NpemU6ICAgKDApIHJvdyBzaXplIHJldHJpZXZlZCBwZXIgaXRlcmF0aW9uCiAgICA6cGFyYW0ga2V5OiAgICAgICAgIGtleSBpbiBhcnRpZmFjdCBzdG9yZSAod2hlbiBsb2dfZGF0YT1UcnVlKQogICAgOnBhcmFtIGRhdGFzZXQ6ICAgICAoRmFsc2UpIGlmIFRydWUgdGhlbiB0YXJnZXRfcGF0aCBpcyBmb2xkZXIgZm9yCiAgICAgICAgICAgICAgICAgICAgICAgIHBhcnRpdGlvbmVkIGZpbGVzCiAgICA6cGFyYW0gcGFydF9jb2xzOiAgIChbXSkgbGlzdCBvZiBwYXJ0aXRpb25pbmcgY29sdW1ucwogICAgIiIiCiAgICBpZiBub3QgbmFtZS5lbmRzd2l0aCgiLnBxdCIpOgogICAgICAgIG5hbWUgKz0gIi5wcXQiCgogICAgZGVzdF9wYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgb3MubWFrZWRpcnMob3MucGF0aC5qb2luKHRhcmdldF9wYXRoKSwgZXhpc3Rfb2s9VHJ1ZSkKICAgIGlmIG5vdCBvcy5wYXRoLmlzZmlsZShkZXN0X3BhdGgpOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oImRlc3RpbmF0aW9uIGZpbGUgZG9lcyBub3QgZXhpc3QsIGRvd25sb2FkaW5nIikKICAgICAgICBwcXdyaXRlciA9IE5vbmUKICAgICAgICBmb3IgaSwgZGYgaW4gZW51bWVyYXRlKHBkLnJlYWRfY3N2KGFyY2hpdmVfdXJsLCBjaHVua3NpemU9Y2h1bmtzaXplLCBuYW1lcz1oZWFkZXIpKToKICAgICAgICAgICAgdGFibGUgPSBwYS5UYWJsZS5mcm9tX3BhbmRhcyhkZikKICAgICAgICAgICAgaWYgaSA9PSAwOgogICAgICAgICAgICAgICAgcHF3cml0ZXIgPSBwcS5QYXJxdWV0V3JpdGVyKGRlc3RfcGF0aCwgdGFibGUuc2NoZW1hKQogICAgICAgICAgICBpZiBkYXRhc2V0OgogICAgICAgICAgICAgICAgcHEud3JpdGVfdG9fZGF0YXNldCh0YWJsZSwgcm9vdF9wYXRoPXRhcmdldF9wYXRoLCBwYXJ0aXRpb25fY29scz1wYXJ0aXRpb25fY29scykKICAgICAgICAgICAgZWxzZToKICAgICAgICAgICAgICAgIHBxd3JpdGVyLndyaXRlX3RhYmxlKHRhYmxlKQogICAgICAgICAgICAKICAgICAgICBpZiBwcXdyaXRlcjoKICAgICAgICAgICAgcHF3cml0ZXIuY2xvc2UoKQoKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKGYic2F2ZWQgdGFibGUgdG8ge2Rlc3RfcGF0aH0iKQogICAgZWxzZToKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCJkZXN0aW5hdGlvbiBmaWxlIGFscmVhZHkgZXhpc3RzIikKCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWRlc3RfcGF0aCkKICAgIAogICAgIyBsb2cgaGVhZGVyCiAgICBmaWxlcGF0aCA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgJ2hlYWRlci5wa2wnKQogICAgZHVtcChoZWFkZXIsIG9wZW4oZmlsZXBhdGgsICd3YicpKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ2hlYWRlcicsIHRhcmdldF9wYXRoPWZpbGVwYXRoKQo= base_image: yjbds/mlrun-files:latest commands: [] - code_origin: https://github.com/yjb-ds/functions.git#9f080fed89f32412dcebe59bb1d248921aa8aad2:/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py + code_origin: https://github.com/yjb-ds/functions.git#eb009dac39c64611adac33d24c7e33ba0856c941:/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py diff --git a/tests/arc_to_parquet-airlines.ipynb b/tests/arc_to_parquet-airlines.ipynb new file mode 100644 index 000000000..dca07524d --- /dev/null +++ b/tests/arc_to_parquet-airlines.ipynb @@ -0,0 +1,553 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# archive to parquet\n", + "\n", + "Convert a remote archive or csv file (or local file://), to parquet format" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": {}, + "outputs": [], + "source": [ + "import mlrun\n", + "import os\n", + "mlrun.mlconf.dbpath = 'http://mlrun-api:8080'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## parameters\n" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "metadata": {}, + "outputs": [], + "source": [ + "BASE_IMAGE = 'yjbds/mlrun-files:latest'\n", + "\n", + "CODE_BASE = '/User/repos/functions/'\n", + "TARGET_PATH = '/User/mlrun/airlines/dataset'\n", + "\n", + "ARCHIVE_BIG = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears_10.csv\"\n", + "ARCHIVE = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears.csv\"\n", + "ARCHIVE_SMALL = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**For testing and development use ARCHIVE_SMALL:**" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "metadata": {}, + "outputs": [], + "source": [ + "USE_ARCHIVE = ARCHIVE" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "metadata": {}, + "outputs": [], + "source": [ + "FILE_NAME = 'airlines.pqt'\n", + "KEY = 'airlines'\n", + "\n", + "# no need for this as the files contain a header:\n", + "HEADER = ['Year','Month','DayofMonth','DayOfWeek','DepTime','CRSDepTime','ArrTime','CRSArrTime',\n", + " 'UniqueCarrier','FlightNum','TailNum','ActualElapsedTime','CRSElapsedTime','AirTime',\n", + " 'ArrDelay','DepDelay','Origin','Dest','Distance','TaxiIn','TaxiOut','Cancelled',\n", + " 'CancellationCode','Diverted','CarrierDelay','WeatherDelay','NASDelay','SecurityDelay',\n", + " 'LateAircraftDelay']" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "metadata": {}, + "outputs": [], + "source": [ + "os.makedirs(TARGET_PATH, exist_ok=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## load and configure function\n", + "\n", + "**If run the first time, create the function:**" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-27 15:27:39,471 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" + ] + } + ], + "source": [ + "# load function from a local Python file\n", + "arctoparq = mlrun.code_to_function(\n", + " filename=os.path.join(CODE_BASE, 'fileutils/arc_to_parquet', 'arc_to_parquet.py'), \n", + " kind='job')\n", + "arctoparq.build_config(base_image=BASE_IMAGE, commands=[])\n", + "yaml_name = os.path.join(CODE_BASE, 'fileutils/arc_to_parquet', 'arc_to_parquet.yaml')\n", + "arctoparq.export(yaml_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**otherwise load it:**" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "metadata": {}, + "outputs": [], + "source": [ + "arctoparq = mlrun.import_function(\n", + " os.path.join(CODE_BASE, 'fileutils/arc_to_parquet', 'arc_to_parquet.yaml')\n", + ").apply(mlrun.mount_v3io())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## deploy / build" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following triggers a build when run for the first time using specs found in the yaml file above." + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'ready'" + ] + }, + "execution_count": 73, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "arctoparq.deploy(skip_deployed=True, with_mlrun=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-27 15:27:41,838 starting run arc2parq uid=d3774f73733c4abeb45c9dd7a7a4862c -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-27 15:27:41,973 Job is running in the background, pod: arc2parq-6d86t\n" + ] + } + ], + "source": [ + "# create and run the task\n", + "arc_to_parq_task = mlrun.NewTask(\n", + " 'arc2parq', \n", + " handler='arc_to_parquet', \n", + " params={\n", + " 'target_path': TARGET_PATH,\n", + " 'name' : FILE_NAME, \n", + " 'key' : KEY,\n", + " 'archive_url': USE_ARCHIVE,\n", + " 'dataset' : True,\n", + " 'part_cols' : ['Year', 'Month']\n", + " })\n", + "# run\n", + "run = arctoparq.run(arc_to_parq_task)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "___" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## tests" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### single parquet file\n", + "\n", + "run this only when `dataset=False`" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import numpy as np\n", + "import pandas as pd" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [], + "source": [ + "# add more context tests\n", + "# convert these to real tests" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [], + "source": [ + "assert KEY in run.outputs.keys(), f\"mlrun.functions: key {KEY} not found in outputs\"\n", + "assert os.path.isfile(TARGET_PATH+'/'+ FILE_NAME), f\"mlrun.functions: artifact source not found at {TARGET_PATH+'/'+ FILE_NAME}\"" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [], + "source": [ + "copied = pd.read_parquet(TARGET_PATH+'/'+ FILE_NAME, engine=\"pyarrow\")" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
YearMonthDayofMonthDayOfWeekDepTimeCRSDepTimeArrTimeCRSArrTimeUniqueCarrierFlightNum...CancelledCancellationCodeDivertedCarrierDelayWeatherDelayNASDelaySecurityDelayLateAircraftDelayIsArrDelayedIsDepDelayed
\n", + "

0 rows × 31 columns

\n", + "
" + ], + "text/plain": [ + "Empty DataFrame\n", + "Columns: [Year, Month, DayofMonth, DayOfWeek, DepTime, CRSDepTime, ArrTime, CRSArrTime, UniqueCarrier, FlightNum, TailNum, ActualElapsedTime, CRSElapsedTime, AirTime, ArrDelay, DepDelay, Origin, Dest, Distance, TaxiIn, TaxiOut, Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay, IsArrDelayed, IsDepDelayed]\n", + "Index: []\n", + "\n", + "[0 rows x 31 columns]" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "copied.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(0, 31)" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "copied.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "run this only when `dataset=True`" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [], + "source": [ + "import pyarrow.parquet as pq\n", + "import pyarrow as pa" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": {}, + "outputs": [ + { + "ename": "ValueError", + "evalue": "Schema in /User/mlrun/airlines/dataset/a3f9441653c14708ad69f63694749301.parquet was different. \nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nTailNum: double\nActualElapsedTime: double\nCRSElapsedTime: int64\nAirTime: double\nArrDelay: double\nDepDelay: double\nOrigin: string\nDest: string\nDistance: double\nTaxiIn: double\nTaxiOut: double\nCancelled: int64\nCancellationCode: double\nDiverted: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nIsArrDelayed: string\nIsDepDelayed: string\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'0, \"stop\": 10000, \"step\": 1}], \"column_indexes\": [{\"name\": n'\n b'ull, \"field_name\": null, \"pandas_type\": \"unicode\", \"numpy_ty'\n b'pe\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}], \"columns'\n b'\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas_type\": \"i'\n b'nt64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"M'\n b'onth\", \"field_name\": \"Month\", \"pandas_type\": \"int64\", \"numpy'\n b'_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayofMonth\", \"'\n b'field_name\": \"DayofMonth\", \"pandas_type\": \"int64\", \"numpy_ty'\n b'pe\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWeek\", \"fiel'\n b'd_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"numpy_type\": '\n b'\"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"field_name\"'\n b': \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"field_name\": '\n b'\"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\",'\n b' \"metadata\": null}, {\"name\": \"ArrTime\", \"field_name\": \"ArrTi'\n b'me\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"CRSArrT'\n b'ime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metada'\n b'ta\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"UniqueC'\n b'arrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"'\n b'metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": \"Fligh'\n b'tNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metad'\n b'ata\": null}, {\"name\": \"TailNum\", \"field_name\": \"TailNum\", \"p'\n b'andas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\":'\n b' null}, {\"name\": \"ActualElapsedTime\", \"field_name\": \"ActualE'\n b'lapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_nam'\n b'e\": \"CRSElapsedTime\", \"pandas_type\": \"int64\", \"numpy_type\": '\n b'\"int64\", \"metadata\": null}, {\"name\": \"AirTime\", \"field_name\"'\n b': \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"ArrDelay\", \"field_name\": \"A'\n b'rrDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\",'\n b' \"metadata\": null}, {\"name\": \"DepDelay\", \"field_name\": \"DepD'\n b'elay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"m'\n b'etadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Origin\", '\n b'\"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\"'\n b': null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pandas_type'\n b'\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {\"n'\n b'ame\": \"Distance\", \"field_name\": \"Distance\", \"pandas_type\": \"'\n b'float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name'\n b'\": \"TaxiIn\", \"field_name\": \"TaxiIn\", \"pandas_type\": \"float64'\n b'\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Tax'\n b'iOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"float64\", \"n'\n b'umpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Cancelle'\n b'd\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int64\", \"nump'\n b'y_type\": \"int64\", \"metadata\": null}, {\"name\": \"CancellationC'\n b'ode\", \"field_name\": \"CancellationCode\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'Diverted\", \"field_name\": \"Diverted\", \"pandas_type\": \"int64\",'\n b' \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carrier'\n b'Delay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"float6'\n b'4\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"We'\n b'atherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_type\"'\n b': \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"n'\n b'ame\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDelay\"'\n b', \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metada'\n b'ta\": null}, {\"name\": \"IsArrDelayed\", \"field_name\": \"IsArrDel'\n b'ayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"me'\n b'tadata\": null}, {\"name\": \"IsDepDelayed\", \"field_name\": \"IsDe'\n b'pDelayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\",'\n b' \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"vers'\n b'ion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////6AWAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAANwPAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAKQPAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDEwMDAw'\n b'LCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51'\n b'bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNv'\n b'ZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVu'\n b'Y29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogIlll'\n b'YXIiLCAiZmllbGRfbmFtZSI6ICJZZWFyIiwgInBhbmRhc190eXBlIjogImlu'\n b'dDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxs'\n b'fSwgeyJuYW1lIjogIk1vbnRoIiwgImZpZWxkX25hbWUiOiAiTW9udGgiLCAi'\n b'cGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIs'\n b'ICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiRGF5b2ZNb250aCIsICJm'\n b'aWVsZF9uYW1lIjogIkRheW9mTW9udGgiLCAicGFuZGFzX3R5cGUiOiAiaW50'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9'\n b'LCB7Im5hbWUiOiAiRGF5T2ZXZWVrIiwgImZpZWxkX25hbWUiOiAiRGF5T2ZX'\n b'ZWVrIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAi'\n b'aW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRlcFRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJEZXBUaW1lIiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJDUlNEZXBUaW1lIiwgImZpZWxkX25hbWUiOiAi'\n b'Q1JTRGVwVGltZSIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90'\n b'eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJB'\n b'cnJUaW1lIiwgImZpZWxkX25hbWUiOiAiQXJyVGltZSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTQXJyVGltZSIsICJmaWVsZF9u'\n b'YW1lIjogIkNSU0FyclRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiVW5pcXVlQ2FycmllciIsICJmaWVsZF9uYW1lIjogIlVuaXF1ZUNh'\n b'cnJpZXIiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBl'\n b'IjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiRmxp'\n b'Z2h0TnVtIiwgImZpZWxkX25hbWUiOiAiRmxpZ2h0TnVtIiwgInBhbmRhc190'\n b'eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIlRhaWxOdW0iLCAiZmllbGRfbmFtZSI6'\n b'ICJUYWlsTnVtIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJBY3R1YWxFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkFjdHVhbEVs'\n b'YXBzZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJDUlNFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBzZWRU'\n b'aW1lIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAi'\n b'aW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkFpclRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJBaXJUaW1lIiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJBcnJEZWxheSIsICJmaWVsZF9uYW1lIjogIkFy'\n b'ckRlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJE'\n b'ZXBEZWxheSIsICJmaWVsZF9uYW1lIjogIkRlcERlbGF5IiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJPcmlnaW4iLCAiZmllbGRfbmFt'\n b'ZSI6ICJPcmlnaW4iLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1w'\n b'eV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiRGVzdCIsICJmaWVsZF9uYW1lIjogIkRlc3QiLCAicGFuZGFzX3R5cGUi'\n b'OiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0'\n b'YSI6IG51bGx9LCB7Im5hbWUiOiAiRGlzdGFuY2UiLCAiZmllbGRfbmFtZSI6'\n b'ICJEaXN0YW5jZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiVGF4aUluIiwgImZpZWxkX25hbWUiOiAiVGF4aUluIiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxkX25h'\n b'bWUiOiAiVGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiQ2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVkIiwg'\n b'InBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQi'\n b'LCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhbmNlbGxhdGlvbkNv'\n b'ZGUiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsYXRpb25Db2RlIiwgInBhbmRh'\n b'c190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXZlcnRlZCIsICJmaWVs'\n b'ZF9uYW1lIjogIkRpdmVydGVkIiwgInBhbmRhc190eXBlIjogImludDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJu'\n b'YW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVsZF9uYW1lIjogIkNhcnJpZXJE'\n b'ZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUi'\n b'OiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiV2Vh'\n b'dGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAiV2VhdGhlckRlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJOQVNEZWxheSIsICJm'\n b'aWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVs'\n b'bH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5IiwgImZpZWxkX25hbWUiOiAi'\n b'U2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJMYXRl'\n b'QWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiSXNBcnJEZWxheWVkIiwgImZpZWxkX25hbWUiOiAiSXNBcnJEZWxh'\n b'eWVkIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6'\n b'ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIklzRGVw'\n b'RGVsYXllZCIsICJmaWVsZF9uYW1lIjogIklzRGVwRGVsYXllZCIsICJwYW5k'\n b'YXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJw'\n b'eWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVyc2lv'\n b'biI6ICIwLjI1LjMifQAAAAAfAAAAWAYAABQGAADcBQAApAUAAGwFAAA0BQAA'\n b'BAUAAMwEAACUBAAAXAQAACwEAADwAwAAtAMAAIQDAABQAwAAHAMAAPACAADE'\n b'AgAAkAIAAGACAAAwAgAA+AEAALwBAACEAQAATAEAABQBAADgAAAAqAAAAGwA'\n b'AAA4AAAABAAAADT6//8AAAEFFAAAAAwAAAAEAAAAAAAAAMj7//8MAAAASXNE'\n b'ZXBEZWxheWVkAAAAAGT6//8AAAEFFAAAAAwAAAAEAAAAAAAAAPj7//8MAAAA'\n b'SXNBcnJEZWxheWVkAAAAAJT6//8AAAEDGAAAAAwAAAAEAAAAAAAAAF77//8A'\n b'AAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAAzPr//wAAAQMYAAAADAAAAAQA'\n b'AAAAAAAAlvv//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAAD7//8AAAEDGAAA'\n b'AAwAAAAEAAAAAAAAAMr7//8AAAIACAAAAE5BU0RlbGF5AAAAADD7//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAAPr7//8AAAIADAAAAFdlYXRoZXJEZWxheQAAAABk'\n b'+///AAABAxgAAAAMAAAABAAAAAAAAAAu/P//AAACAAwAAABDYXJyaWVyRGVs'\n b'YXkAAAAAmPv//wAAAQIcAAAADAAAAAQAAAAAAAAAiPv//wAAAAFAAAAACAAA'\n b'AERpdmVydGVkAAAAAMz7//8AAAEDGAAAAAwAAAAEAAAAAAAAAJb8//8AAAIA'\n b'EAAAAENhbmNlbGxhdGlvbkNvZGUAAAAABPz//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAA9Pv//wAAAAFAAAAACQAAAENhbmNlbGxlZAAAADj8//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAAL9//8AAAIABwAAAFRheGlPdXQAZPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAALv3//wAAAgAGAAAAVGF4aUluAACQ/P//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAABa/f//AAACAAgAAABEaXN0YW5jZQAAAADA/P//AAABBRQAAAAM'\n b'AAAABAAAAAAAAABU/v//BAAAAERlc3QAAAAA6Pz//wAAAQUUAAAADAAAAAQA'\n b'AAAAAAAAfP7//wYAAABPcmlnaW4AABD9//8AAAEDGAAAAAwAAAAEAAAAAAAA'\n b'ANr9//8AAAIACAAAAERlcERlbGF5AAAAAED9//8AAAEDGAAAAAwAAAAEAAAA'\n b'AAAAAAr+//8AAAIACAAAAEFyckRlbGF5AAAAAHD9//8AAAEDGAAAAAwAAAAE'\n b'AAAAAAAAADr+//8AAAIABwAAAEFpclRpbWUAnP3//wAAAQIcAAAADAAAAAQA'\n b'AAAAAAAAjP3//wAAAAFAAAAADgAAAENSU0VsYXBzZWRUaW1lAADU/f//AAAB'\n b'AxgAAAAMAAAABAAAAAAAAACe/v//AAACABEAAABBY3R1YWxFbGFwc2VkVGlt'\n b'ZQAAAAz+//8AAAEDGAAAAAwAAAAEAAAAAAAAANb+//8AAAIABwAAAFRhaWxO'\n b'dW0AOP7//wAAAQIcAAAADAAAAAQAAAAAAAAAKP7//wAAAAFAAAAACQAAAEZs'\n b'aWdodE51bQAAAGz+//8AAAEFGAAAABAAAAAEAAAAAAAAAAQABAAEAAAADQAA'\n b'AFVuaXF1ZUNhcnJpZXIAAACg/v//AAABAhwAAAAMAAAABAAAAAAAAACQ/v//'\n b'AAAAAUAAAAAKAAAAQ1JTQXJyVGltZQAA1P7//wAAAQMYAAAADAAAAAQAAAAA'\n b'AAAAnv///wAAAgAHAAAAQXJyVGltZQAA////AAABAhwAAAAMAAAABAAAAAAA'\n b'AADw/v//AAAAAUAAAAAKAAAAQ1JTRGVwVGltZQAANP///wAAAQMgAAAAFAAA'\n b'AAQAAAAAAAAAAAAGAAgABgAGAAAAAAACAAcAAABEZXBUaW1lAGj///8AAAEC'\n b'HAAAAAwAAAAEAAAAAAAAAFj///8AAAABQAAAAAkAAABEYXlPZldlZWsAAACc'\n b'////AAABAhwAAAAMAAAABAAAAAAAAACM////AAAAAUAAAAAKAAAARGF5b2ZN'\n b'b250aAAA0P///wAAAQIcAAAADAAAAAQAAAAAAAAAwP///wAAAAFAAAAABQAA'\n b'AE1vbnRoAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAECJAAAABQAAAAEAAAA'\n b'AAAAAAgADAAIAAcACAAAAAAAAAFAAAAABAAAAFllYXIAAAAAAAAAAA==')])\n\nvs\n\nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nTailNum: string\nActualElapsedTime: double\nCRSElapsedTime: double\nAirTime: double\nArrDelay: double\nDepDelay: double\nOrigin: string\nDest: string\nDistance: int64\nTaxiIn: double\nTaxiOut: double\nCancelled: int64\nCancellationCode: double\nDiverted: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nIsArrDelayed: string\nIsDepDelayed: string\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'10000, \"stop\": 20000, \"step\": 1}], \"column_indexes\": [{\"name'\n b'\": null, \"field_name\": null, \"pandas_type\": \"unicode\", \"nump'\n b'y_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}], \"col'\n b'umns\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas_type\"'\n b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\"'\n b': \"Month\", \"field_name\": \"Month\", \"pandas_type\": \"int64\", \"n'\n b'umpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayofMonth'\n b'\", \"field_name\": \"DayofMonth\", \"pandas_type\": \"int64\", \"nump'\n b'y_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWeek\", \"'\n b'field_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"numpy_typ'\n b'e\": \"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"field_n'\n b'ame\": \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"field_nam'\n b'e\": \"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int'\n b'64\", \"metadata\": null}, {\"name\": \"ArrTime\", \"field_name\": \"A'\n b'rrTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", '\n b'\"metadata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"CRS'\n b'ArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"me'\n b'tadata\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"Uni'\n b'queCarrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object'\n b'\", \"metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": \"F'\n b'lightNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"m'\n b'etadata\": null}, {\"name\": \"TailNum\", \"field_name\": \"TailNum\"'\n b', \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadat'\n b'a\": null}, {\"name\": \"ActualElapsedTime\", \"field_name\": \"Actu'\n b'alElapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"flo'\n b'at64\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_'\n b'name\": \"CRSElapsedTime\", \"pandas_type\": \"float64\", \"numpy_ty'\n b'pe\": \"float64\", \"metadata\": null}, {\"name\": \"AirTime\", \"fiel'\n b'd_name\": \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": '\n b'\"float64\", \"metadata\": null}, {\"name\": \"ArrDelay\", \"field_na'\n b'me\": \"ArrDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"DepDelay\", \"field_name\"'\n b': \"DepDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float'\n b'64\", \"metadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Or'\n b'igin\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"me'\n b'tadata\": null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pand'\n b'as_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": nul'\n b'l}, {\"name\": \"Distance\", \"field_name\": \"Distance\", \"pandas_t'\n b'ype\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"n'\n b'ame\": \"TaxiIn\", \"field_name\": \"TaxiIn\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'TaxiOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"float64\",'\n b' \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Cance'\n b'lled\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int64\", \"n'\n b'umpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Cancellati'\n b'onCode\", \"field_name\": \"CancellationCode\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"Diverted\", \"field_name\": \"Diverted\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carr'\n b'ierDelay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"flo'\n b'at64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": '\n b'\"WeatherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\":'\n b' \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"na'\n b'me\": \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_ty'\n b'pe\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, '\n b'{\"name\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDel'\n b'ay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}, {\"name\": \"IsArrDelayed\", \"field_name\": \"IsArr'\n b'Delayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", '\n b'\"metadata\": null}, {\"name\": \"IsDepDelayed\", \"field_name\": \"I'\n b'sDepDelayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"objec'\n b't\", \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"v'\n b'ersion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////5gWAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAANwPAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAKcPAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAxMDAwMCwgInN0b3AiOiAy'\n b'MDAwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7Im5hbWUi'\n b'OiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlwZSI6ICJ1'\n b'bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjog'\n b'eyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3sibmFtZSI6'\n b'ICJZZWFyIiwgImZpZWxkX25hbWUiOiAiWWVhciIsICJwYW5kYXNfdHlwZSI6'\n b'ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJNb250aCIsICJmaWVsZF9uYW1lIjogIk1vbnRo'\n b'IiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRheW9mTW9udGgi'\n b'LCAiZmllbGRfbmFtZSI6ICJEYXlvZk1vbnRoIiwgInBhbmRhc190eXBlIjog'\n b'ImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBu'\n b'dWxsfSwgeyJuYW1lIjogIkRheU9mV2VlayIsICJmaWVsZF9uYW1lIjogIkRh'\n b'eU9mV2VlayIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBl'\n b'IjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEZXBU'\n b'aW1lIiwgImZpZWxkX25hbWUiOiAiRGVwVGltZSIsICJwYW5kYXNfdHlwZSI6'\n b'ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0'\n b'YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTRGVwVGltZSIsICJmaWVsZF9uYW1l'\n b'IjogIkNSU0RlcFRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiQXJyVGltZSIsICJmaWVsZF9uYW1lIjogIkFyclRpbWUiLCAicGFuZGFz'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNSU0FyclRpbWUiLCAiZmll'\n b'bGRfbmFtZSI6ICJDUlNBcnJUaW1lIiwgInBhbmRhc190eXBlIjogImludDY0'\n b'IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIlVuaXF1ZUNhcnJpZXIiLCAiZmllbGRfbmFtZSI6ICJVbmlx'\n b'dWVDYXJyaWVyIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlf'\n b'dHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjog'\n b'IkZsaWdodE51bSIsICJmaWVsZF9uYW1lIjogIkZsaWdodE51bSIsICJwYW5k'\n b'YXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYWlsTnVtIiwgImZpZWxkX25h'\n b'bWUiOiAiVGFpbE51bSIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51'\n b'bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJBY3R1YWxFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkFjdHVh'\n b'bEVsYXBzZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJDUlNFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBz'\n b'ZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJB'\n b'aXJUaW1lIiwgImZpZWxkX25hbWUiOiAiQWlyVGltZSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQXJyRGVsYXkiLCAiZmllbGRfbmFt'\n b'ZSI6ICJBcnJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiRGVwRGVsYXkiLCAiZmllbGRfbmFtZSI6ICJEZXBEZWxheSIsICJw'\n b'YW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2'\n b'NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiT3JpZ2luIiwgImZp'\n b'ZWxkX25hbWUiOiAiT3JpZ2luIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUi'\n b'LCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIkRlc3QiLCAiZmllbGRfbmFtZSI6ICJEZXN0IiwgInBhbmRh'\n b'c190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRpc3RhbmNlIiwgImZpZWxk'\n b'X25hbWUiOiAiRGlzdGFuY2UiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiVGF4aUluIiwgImZpZWxkX25hbWUiOiAiVGF4aUluIiwgInBhbmRh'\n b'c190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxk'\n b'X25hbWUiOiAiVGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiQ2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVk'\n b'IiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhbmNlbGxhdGlv'\n b'bkNvZGUiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsYXRpb25Db2RlIiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXZlcnRlZCIsICJm'\n b'aWVsZF9uYW1lIjogIkRpdmVydGVkIiwgInBhbmRhc190eXBlIjogImludDY0'\n b'IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVsZF9uYW1lIjogIkNhcnJp'\n b'ZXJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5'\n b'cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAi'\n b'V2VhdGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAiV2VhdGhlckRlbGF5Iiwg'\n b'InBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9h'\n b'dDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJOQVNEZWxheSIs'\n b'ICJmaWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5IiwgImZpZWxkX25hbWUi'\n b'OiAiU2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJM'\n b'YXRlQWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiSXNBcnJEZWxheWVkIiwgImZpZWxkX25hbWUiOiAiSXNBcnJE'\n b'ZWxheWVkIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlw'\n b'ZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIklz'\n b'RGVwRGVsYXllZCIsICJmaWVsZF9uYW1lIjogIklzRGVwRGVsYXllZCIsICJw'\n b'YW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6'\n b'ICJweWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVy'\n b'c2lvbiI6ICIwLjI1LjMifQAfAAAAVAYAABAGAADYBQAAoAUAAGgFAAAwBQAA'\n b'AAUAAMgEAACQBAAAWAQAACwEAADwAwAAuAMAAIgDAABUAwAAIAMAAPQCAADI'\n b'AgAAkAIAAGACAAAwAgAA+AEAALwBAACEAQAATAEAABQBAADgAAAAqAAAAGwA'\n b'AAA4AAAABAAAADj6//8AAAEFFAAAAAwAAAAEAAAAAAAAAMz7//8MAAAASXNE'\n b'ZXBEZWxheWVkAAAAAGj6//8AAAEFFAAAAAwAAAAEAAAAAAAAAPz7//8MAAAA'\n b'SXNBcnJEZWxheWVkAAAAAJj6//8AAAEDGAAAAAwAAAAEAAAAAAAAAGL7//8A'\n b'AAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAA0Pr//wAAAQMYAAAADAAAAAQA'\n b'AAAAAAAAmvv//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAAT7//8AAAEDGAAA'\n b'AAwAAAAEAAAAAAAAAM77//8AAAIACAAAAE5BU0RlbGF5AAAAADT7//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAAP77//8AAAIADAAAAFdlYXRoZXJEZWxheQAAAABo'\n b'+///AAABAxgAAAAMAAAABAAAAAAAAAAy/P//AAACAAwAAABDYXJyaWVyRGVs'\n b'YXkAAAAAnPv//wAAAQIcAAAADAAAAAQAAAAAAAAAjPv//wAAAAFAAAAACAAA'\n b'AERpdmVydGVkAAAAAND7//8AAAEDGAAAAAwAAAAEAAAAAAAAAJr8//8AAAIA'\n b'EAAAAENhbmNlbGxhdGlvbkNvZGUAAAAACPz//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAA+Pv//wAAAAFAAAAACQAAAENhbmNlbGxlZAAAADz8//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAAb9//8AAAIABwAAAFRheGlPdXQAaPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAAMv3//wAAAgAGAAAAVGF4aUluAACU/P//AAABAhwAAAAMAAAA'\n b'BAAAAAAAAACE/P//AAAAAUAAAAAIAAAARGlzdGFuY2UAAAAAyPz//wAAAQUU'\n b'AAAADAAAAAQAAAAAAAAAXP7//wQAAABEZXN0AAAAAPD8//8AAAEFFAAAAAwA'\n b'AAAEAAAAAAAAAIT+//8GAAAAT3JpZ2luAAAY/f//AAABAxgAAAAMAAAABAAA'\n b'AAAAAADi/f//AAACAAgAAABEZXBEZWxheQAAAABI/f//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAAAS/v//AAACAAgAAABBcnJEZWxheQAAAAB4/f//AAABAxgAAAAM'\n b'AAAABAAAAAAAAABC/v//AAACAAcAAABBaXJUaW1lAKT9//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAG7+//8AAAIADgAAAENSU0VsYXBzZWRUaW1lAADY/f//AAAB'\n b'AxgAAAAMAAAABAAAAAAAAACi/v//AAACABEAAABBY3R1YWxFbGFwc2VkVGlt'\n b'ZQAAABD+//8AAAEFFAAAAAwAAAAEAAAAAAAAAKT///8HAAAAVGFpbE51bQA4'\n b'/v//AAABAhwAAAAMAAAABAAAAAAAAAAo/v//AAAAAUAAAAAJAAAARmxpZ2h0'\n b'TnVtAAAAbP7//wAAAQUYAAAAEAAAAAQAAAAAAAAABAAEAAQAAAANAAAAVW5p'\n b'cXVlQ2FycmllcgAAAKD+//8AAAECHAAAAAwAAAAEAAAAAAAAAJD+//8AAAAB'\n b'QAAAAAoAAABDUlNBcnJUaW1lAADU/v//AAABAxgAAAAMAAAABAAAAAAAAACe'\n b'////AAACAAcAAABBcnJUaW1lAAD///8AAAECHAAAAAwAAAAEAAAAAAAAAPD+'\n b'//8AAAABQAAAAAoAAABDUlNEZXBUaW1lAAA0////AAABAyAAAAAUAAAABAAA'\n b'AAAAAAAAAAYACAAGAAYAAAAAAAIABwAAAERlcFRpbWUAaP///wAAAQIcAAAA'\n b'DAAAAAQAAAAAAAAAWP///wAAAAFAAAAACQAAAERheU9mV2VlawAAAJz///8A'\n b'AAECHAAAAAwAAAAEAAAAAAAAAIz///8AAAABQAAAAAoAAABEYXlvZk1vbnRo'\n b'AADQ////AAABAhwAAAAMAAAABAAAAAAAAADA////AAAAAUAAAAAFAAAATW9u'\n b'dGgAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIkAAAAFAAAAAQAAAAAAAAA'\n b'CAAMAAgABwAIAAAAAAAAAUAAAAAEAAAAWWVhcgAAAAA=')])", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdataset\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpq\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mParquetDataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mTARGET_PATH\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/pyarrow/parquet.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, read_dictionary, memory_map, buffer_size)\u001b[0m\n\u001b[1;32m 1058\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1059\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mvalidate_schema\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1060\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalidate_schemas\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1061\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1062\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mequals\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mother\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/pyarrow/parquet.py\u001b[0m in \u001b[0;36mvalidate_schemas\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 1111\u001b[0m \u001b[0;34m'{1!s}\\n\\nvs\\n\\n{2!s}'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1112\u001b[0m .format(piece, file_schema,\n\u001b[0;32m-> 1113\u001b[0;31m dataset_schema))\n\u001b[0m\u001b[1;32m 1114\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1115\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolumns\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muse_threads\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muse_pandas_metadata\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mValueError\u001b[0m: Schema in /User/mlrun/airlines/dataset/a3f9441653c14708ad69f63694749301.parquet was different. \nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nTailNum: double\nActualElapsedTime: double\nCRSElapsedTime: int64\nAirTime: double\nArrDelay: double\nDepDelay: double\nOrigin: string\nDest: string\nDistance: double\nTaxiIn: double\nTaxiOut: double\nCancelled: int64\nCancellationCode: double\nDiverted: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nIsArrDelayed: string\nIsDepDelayed: string\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'0, \"stop\": 10000, \"step\": 1}], \"column_indexes\": [{\"name\": n'\n b'ull, \"field_name\": null, \"pandas_type\": \"unicode\", \"numpy_ty'\n b'pe\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}], \"columns'\n b'\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas_type\": \"i'\n b'nt64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"M'\n b'onth\", \"field_name\": \"Month\", \"pandas_type\": \"int64\", \"numpy'\n b'_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayofMonth\", \"'\n b'field_name\": \"DayofMonth\", \"pandas_type\": \"int64\", \"numpy_ty'\n b'pe\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWeek\", \"fiel'\n b'd_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"numpy_type\": '\n b'\"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"field_name\"'\n b': \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"field_name\": '\n b'\"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\",'\n b' \"metadata\": null}, {\"name\": \"ArrTime\", \"field_name\": \"ArrTi'\n b'me\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"CRSArrT'\n b'ime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metada'\n b'ta\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"UniqueC'\n b'arrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"'\n b'metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": \"Fligh'\n b'tNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metad'\n b'ata\": null}, {\"name\": \"TailNum\", \"field_name\": \"TailNum\", \"p'\n b'andas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\":'\n b' null}, {\"name\": \"ActualElapsedTime\", \"field_name\": \"ActualE'\n b'lapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_nam'\n b'e\": \"CRSElapsedTime\", \"pandas_type\": \"int64\", \"numpy_type\": '\n b'\"int64\", \"metadata\": null}, {\"name\": \"AirTime\", \"field_name\"'\n b': \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"ArrDelay\", \"field_name\": \"A'\n b'rrDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\",'\n b' \"metadata\": null}, {\"name\": \"DepDelay\", \"field_name\": \"DepD'\n b'elay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"m'\n b'etadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Origin\", '\n b'\"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\"'\n b': null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pandas_type'\n b'\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {\"n'\n b'ame\": \"Distance\", \"field_name\": \"Distance\", \"pandas_type\": \"'\n b'float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name'\n b'\": \"TaxiIn\", \"field_name\": \"TaxiIn\", \"pandas_type\": \"float64'\n b'\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Tax'\n b'iOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"float64\", \"n'\n b'umpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Cancelle'\n b'd\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int64\", \"nump'\n b'y_type\": \"int64\", \"metadata\": null}, {\"name\": \"CancellationC'\n b'ode\", \"field_name\": \"CancellationCode\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'Diverted\", \"field_name\": \"Diverted\", \"pandas_type\": \"int64\",'\n b' \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carrier'\n b'Delay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"float6'\n b'4\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"We'\n b'atherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_type\"'\n b': \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"n'\n b'ame\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDelay\"'\n b', \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metada'\n b'ta\": null}, {\"name\": \"IsArrDelayed\", \"field_name\": \"IsArrDel'\n b'ayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"me'\n b'tadata\": null}, {\"name\": \"IsDepDelayed\", \"field_name\": \"IsDe'\n b'pDelayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\",'\n b' \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"vers'\n b'ion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////6AWAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAANwPAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAKQPAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDEwMDAw'\n b'LCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51'\n b'bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNv'\n b'ZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVu'\n b'Y29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogIlll'\n b'YXIiLCAiZmllbGRfbmFtZSI6ICJZZWFyIiwgInBhbmRhc190eXBlIjogImlu'\n b'dDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxs'\n b'fSwgeyJuYW1lIjogIk1vbnRoIiwgImZpZWxkX25hbWUiOiAiTW9udGgiLCAi'\n b'cGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIs'\n b'ICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiRGF5b2ZNb250aCIsICJm'\n b'aWVsZF9uYW1lIjogIkRheW9mTW9udGgiLCAicGFuZGFzX3R5cGUiOiAiaW50'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9'\n b'LCB7Im5hbWUiOiAiRGF5T2ZXZWVrIiwgImZpZWxkX25hbWUiOiAiRGF5T2ZX'\n b'ZWVrIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAi'\n b'aW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRlcFRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJEZXBUaW1lIiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJDUlNEZXBUaW1lIiwgImZpZWxkX25hbWUiOiAi'\n b'Q1JTRGVwVGltZSIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90'\n b'eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJB'\n b'cnJUaW1lIiwgImZpZWxkX25hbWUiOiAiQXJyVGltZSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTQXJyVGltZSIsICJmaWVsZF9u'\n b'YW1lIjogIkNSU0FyclRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiVW5pcXVlQ2FycmllciIsICJmaWVsZF9uYW1lIjogIlVuaXF1ZUNh'\n b'cnJpZXIiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBl'\n b'IjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiRmxp'\n b'Z2h0TnVtIiwgImZpZWxkX25hbWUiOiAiRmxpZ2h0TnVtIiwgInBhbmRhc190'\n b'eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIlRhaWxOdW0iLCAiZmllbGRfbmFtZSI6'\n b'ICJUYWlsTnVtIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJBY3R1YWxFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkFjdHVhbEVs'\n b'YXBzZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJDUlNFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBzZWRU'\n b'aW1lIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAi'\n b'aW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkFpclRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJBaXJUaW1lIiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJBcnJEZWxheSIsICJmaWVsZF9uYW1lIjogIkFy'\n b'ckRlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJE'\n b'ZXBEZWxheSIsICJmaWVsZF9uYW1lIjogIkRlcERlbGF5IiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJPcmlnaW4iLCAiZmllbGRfbmFt'\n b'ZSI6ICJPcmlnaW4iLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1w'\n b'eV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiRGVzdCIsICJmaWVsZF9uYW1lIjogIkRlc3QiLCAicGFuZGFzX3R5cGUi'\n b'OiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0'\n b'YSI6IG51bGx9LCB7Im5hbWUiOiAiRGlzdGFuY2UiLCAiZmllbGRfbmFtZSI6'\n b'ICJEaXN0YW5jZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiVGF4aUluIiwgImZpZWxkX25hbWUiOiAiVGF4aUluIiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxkX25h'\n b'bWUiOiAiVGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiQ2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVkIiwg'\n b'InBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQi'\n b'LCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhbmNlbGxhdGlvbkNv'\n b'ZGUiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsYXRpb25Db2RlIiwgInBhbmRh'\n b'c190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXZlcnRlZCIsICJmaWVs'\n b'ZF9uYW1lIjogIkRpdmVydGVkIiwgInBhbmRhc190eXBlIjogImludDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJu'\n b'YW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVsZF9uYW1lIjogIkNhcnJpZXJE'\n b'ZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUi'\n b'OiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiV2Vh'\n b'dGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAiV2VhdGhlckRlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJOQVNEZWxheSIsICJm'\n b'aWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVs'\n b'bH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5IiwgImZpZWxkX25hbWUiOiAi'\n b'U2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJMYXRl'\n b'QWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiSXNBcnJEZWxheWVkIiwgImZpZWxkX25hbWUiOiAiSXNBcnJEZWxh'\n b'eWVkIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6'\n b'ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIklzRGVw'\n b'RGVsYXllZCIsICJmaWVsZF9uYW1lIjogIklzRGVwRGVsYXllZCIsICJwYW5k'\n b'YXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJw'\n b'eWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVyc2lv'\n b'biI6ICIwLjI1LjMifQAAAAAfAAAAWAYAABQGAADcBQAApAUAAGwFAAA0BQAA'\n b'BAUAAMwEAACUBAAAXAQAACwEAADwAwAAtAMAAIQDAABQAwAAHAMAAPACAADE'\n b'AgAAkAIAAGACAAAwAgAA+AEAALwBAACEAQAATAEAABQBAADgAAAAqAAAAGwA'\n b'AAA4AAAABAAAADT6//8AAAEFFAAAAAwAAAAEAAAAAAAAAMj7//8MAAAASXNE'\n b'ZXBEZWxheWVkAAAAAGT6//8AAAEFFAAAAAwAAAAEAAAAAAAAAPj7//8MAAAA'\n b'SXNBcnJEZWxheWVkAAAAAJT6//8AAAEDGAAAAAwAAAAEAAAAAAAAAF77//8A'\n b'AAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAAzPr//wAAAQMYAAAADAAAAAQA'\n b'AAAAAAAAlvv//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAAD7//8AAAEDGAAA'\n b'AAwAAAAEAAAAAAAAAMr7//8AAAIACAAAAE5BU0RlbGF5AAAAADD7//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAAPr7//8AAAIADAAAAFdlYXRoZXJEZWxheQAAAABk'\n b'+///AAABAxgAAAAMAAAABAAAAAAAAAAu/P//AAACAAwAAABDYXJyaWVyRGVs'\n b'YXkAAAAAmPv//wAAAQIcAAAADAAAAAQAAAAAAAAAiPv//wAAAAFAAAAACAAA'\n b'AERpdmVydGVkAAAAAMz7//8AAAEDGAAAAAwAAAAEAAAAAAAAAJb8//8AAAIA'\n b'EAAAAENhbmNlbGxhdGlvbkNvZGUAAAAABPz//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAA9Pv//wAAAAFAAAAACQAAAENhbmNlbGxlZAAAADj8//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAAL9//8AAAIABwAAAFRheGlPdXQAZPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAALv3//wAAAgAGAAAAVGF4aUluAACQ/P//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAABa/f//AAACAAgAAABEaXN0YW5jZQAAAADA/P//AAABBRQAAAAM'\n b'AAAABAAAAAAAAABU/v//BAAAAERlc3QAAAAA6Pz//wAAAQUUAAAADAAAAAQA'\n b'AAAAAAAAfP7//wYAAABPcmlnaW4AABD9//8AAAEDGAAAAAwAAAAEAAAAAAAA'\n b'ANr9//8AAAIACAAAAERlcERlbGF5AAAAAED9//8AAAEDGAAAAAwAAAAEAAAA'\n b'AAAAAAr+//8AAAIACAAAAEFyckRlbGF5AAAAAHD9//8AAAEDGAAAAAwAAAAE'\n b'AAAAAAAAADr+//8AAAIABwAAAEFpclRpbWUAnP3//wAAAQIcAAAADAAAAAQA'\n b'AAAAAAAAjP3//wAAAAFAAAAADgAAAENSU0VsYXBzZWRUaW1lAADU/f//AAAB'\n b'AxgAAAAMAAAABAAAAAAAAACe/v//AAACABEAAABBY3R1YWxFbGFwc2VkVGlt'\n b'ZQAAAAz+//8AAAEDGAAAAAwAAAAEAAAAAAAAANb+//8AAAIABwAAAFRhaWxO'\n b'dW0AOP7//wAAAQIcAAAADAAAAAQAAAAAAAAAKP7//wAAAAFAAAAACQAAAEZs'\n b'aWdodE51bQAAAGz+//8AAAEFGAAAABAAAAAEAAAAAAAAAAQABAAEAAAADQAA'\n b'AFVuaXF1ZUNhcnJpZXIAAACg/v//AAABAhwAAAAMAAAABAAAAAAAAACQ/v//'\n b'AAAAAUAAAAAKAAAAQ1JTQXJyVGltZQAA1P7//wAAAQMYAAAADAAAAAQAAAAA'\n b'AAAAnv///wAAAgAHAAAAQXJyVGltZQAA////AAABAhwAAAAMAAAABAAAAAAA'\n b'AADw/v//AAAAAUAAAAAKAAAAQ1JTRGVwVGltZQAANP///wAAAQMgAAAAFAAA'\n b'AAQAAAAAAAAAAAAGAAgABgAGAAAAAAACAAcAAABEZXBUaW1lAGj///8AAAEC'\n b'HAAAAAwAAAAEAAAAAAAAAFj///8AAAABQAAAAAkAAABEYXlPZldlZWsAAACc'\n b'////AAABAhwAAAAMAAAABAAAAAAAAACM////AAAAAUAAAAAKAAAARGF5b2ZN'\n b'b250aAAA0P///wAAAQIcAAAADAAAAAQAAAAAAAAAwP///wAAAAFAAAAABQAA'\n b'AE1vbnRoAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAECJAAAABQAAAAEAAAA'\n b'AAAAAAgADAAIAAcACAAAAAAAAAFAAAAABAAAAFllYXIAAAAAAAAAAA==')])\n\nvs\n\nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nTailNum: string\nActualElapsedTime: double\nCRSElapsedTime: double\nAirTime: double\nArrDelay: double\nDepDelay: double\nOrigin: string\nDest: string\nDistance: int64\nTaxiIn: double\nTaxiOut: double\nCancelled: int64\nCancellationCode: double\nDiverted: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nIsArrDelayed: string\nIsDepDelayed: string\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'10000, \"stop\": 20000, \"step\": 1}], \"column_indexes\": [{\"name'\n b'\": null, \"field_name\": null, \"pandas_type\": \"unicode\", \"nump'\n b'y_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}], \"col'\n b'umns\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas_type\"'\n b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\"'\n b': \"Month\", \"field_name\": \"Month\", \"pandas_type\": \"int64\", \"n'\n b'umpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayofMonth'\n b'\", \"field_name\": \"DayofMonth\", \"pandas_type\": \"int64\", \"nump'\n b'y_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWeek\", \"'\n b'field_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"numpy_typ'\n b'e\": \"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"field_n'\n b'ame\": \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"field_nam'\n b'e\": \"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int'\n b'64\", \"metadata\": null}, {\"name\": \"ArrTime\", \"field_name\": \"A'\n b'rrTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", '\n b'\"metadata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"CRS'\n b'ArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"me'\n b'tadata\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"Uni'\n b'queCarrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object'\n b'\", \"metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": \"F'\n b'lightNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"m'\n b'etadata\": null}, {\"name\": \"TailNum\", \"field_name\": \"TailNum\"'\n b', \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadat'\n b'a\": null}, {\"name\": \"ActualElapsedTime\", \"field_name\": \"Actu'\n b'alElapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"flo'\n b'at64\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_'\n b'name\": \"CRSElapsedTime\", \"pandas_type\": \"float64\", \"numpy_ty'\n b'pe\": \"float64\", \"metadata\": null}, {\"name\": \"AirTime\", \"fiel'\n b'd_name\": \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": '\n b'\"float64\", \"metadata\": null}, {\"name\": \"ArrDelay\", \"field_na'\n b'me\": \"ArrDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"DepDelay\", \"field_name\"'\n b': \"DepDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float'\n b'64\", \"metadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Or'\n b'igin\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"me'\n b'tadata\": null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pand'\n b'as_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": nul'\n b'l}, {\"name\": \"Distance\", \"field_name\": \"Distance\", \"pandas_t'\n b'ype\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"n'\n b'ame\": \"TaxiIn\", \"field_name\": \"TaxiIn\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'TaxiOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"float64\",'\n b' \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Cance'\n b'lled\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int64\", \"n'\n b'umpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Cancellati'\n b'onCode\", \"field_name\": \"CancellationCode\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"Diverted\", \"field_name\": \"Diverted\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carr'\n b'ierDelay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"flo'\n b'at64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": '\n b'\"WeatherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\":'\n b' \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"na'\n b'me\": \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_ty'\n b'pe\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, '\n b'{\"name\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDel'\n b'ay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}, {\"name\": \"IsArrDelayed\", \"field_name\": \"IsArr'\n b'Delayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", '\n b'\"metadata\": null}, {\"name\": \"IsDepDelayed\", \"field_name\": \"I'\n b'sDepDelayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"objec'\n b't\", \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"v'\n b'ersion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////5gWAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAANwPAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAKcPAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAxMDAwMCwgInN0b3AiOiAy'\n b'MDAwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7Im5hbWUi'\n b'OiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlwZSI6ICJ1'\n b'bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjog'\n b'eyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3sibmFtZSI6'\n b'ICJZZWFyIiwgImZpZWxkX25hbWUiOiAiWWVhciIsICJwYW5kYXNfdHlwZSI6'\n b'ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJNb250aCIsICJmaWVsZF9uYW1lIjogIk1vbnRo'\n b'IiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRheW9mTW9udGgi'\n b'LCAiZmllbGRfbmFtZSI6ICJEYXlvZk1vbnRoIiwgInBhbmRhc190eXBlIjog'\n b'ImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBu'\n b'dWxsfSwgeyJuYW1lIjogIkRheU9mV2VlayIsICJmaWVsZF9uYW1lIjogIkRh'\n b'eU9mV2VlayIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBl'\n b'IjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEZXBU'\n b'aW1lIiwgImZpZWxkX25hbWUiOiAiRGVwVGltZSIsICJwYW5kYXNfdHlwZSI6'\n b'ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0'\n b'YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTRGVwVGltZSIsICJmaWVsZF9uYW1l'\n b'IjogIkNSU0RlcFRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiQXJyVGltZSIsICJmaWVsZF9uYW1lIjogIkFyclRpbWUiLCAicGFuZGFz'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNSU0FyclRpbWUiLCAiZmll'\n b'bGRfbmFtZSI6ICJDUlNBcnJUaW1lIiwgInBhbmRhc190eXBlIjogImludDY0'\n b'IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIlVuaXF1ZUNhcnJpZXIiLCAiZmllbGRfbmFtZSI6ICJVbmlx'\n b'dWVDYXJyaWVyIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlf'\n b'dHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjog'\n b'IkZsaWdodE51bSIsICJmaWVsZF9uYW1lIjogIkZsaWdodE51bSIsICJwYW5k'\n b'YXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYWlsTnVtIiwgImZpZWxkX25h'\n b'bWUiOiAiVGFpbE51bSIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51'\n b'bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJBY3R1YWxFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkFjdHVh'\n b'bEVsYXBzZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJDUlNFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBz'\n b'ZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJB'\n b'aXJUaW1lIiwgImZpZWxkX25hbWUiOiAiQWlyVGltZSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQXJyRGVsYXkiLCAiZmllbGRfbmFt'\n b'ZSI6ICJBcnJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiRGVwRGVsYXkiLCAiZmllbGRfbmFtZSI6ICJEZXBEZWxheSIsICJw'\n b'YW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2'\n b'NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiT3JpZ2luIiwgImZp'\n b'ZWxkX25hbWUiOiAiT3JpZ2luIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUi'\n b'LCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIkRlc3QiLCAiZmllbGRfbmFtZSI6ICJEZXN0IiwgInBhbmRh'\n b'c190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRpc3RhbmNlIiwgImZpZWxk'\n b'X25hbWUiOiAiRGlzdGFuY2UiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiVGF4aUluIiwgImZpZWxkX25hbWUiOiAiVGF4aUluIiwgInBhbmRh'\n b'c190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxk'\n b'X25hbWUiOiAiVGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiQ2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVk'\n b'IiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhbmNlbGxhdGlv'\n b'bkNvZGUiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsYXRpb25Db2RlIiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXZlcnRlZCIsICJm'\n b'aWVsZF9uYW1lIjogIkRpdmVydGVkIiwgInBhbmRhc190eXBlIjogImludDY0'\n b'IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVsZF9uYW1lIjogIkNhcnJp'\n b'ZXJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5'\n b'cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAi'\n b'V2VhdGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAiV2VhdGhlckRlbGF5Iiwg'\n b'InBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9h'\n b'dDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJOQVNEZWxheSIs'\n b'ICJmaWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5IiwgImZpZWxkX25hbWUi'\n b'OiAiU2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJM'\n b'YXRlQWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiSXNBcnJEZWxheWVkIiwgImZpZWxkX25hbWUiOiAiSXNBcnJE'\n b'ZWxheWVkIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlw'\n b'ZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIklz'\n b'RGVwRGVsYXllZCIsICJmaWVsZF9uYW1lIjogIklzRGVwRGVsYXllZCIsICJw'\n b'YW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6'\n b'ICJweWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVy'\n b'c2lvbiI6ICIwLjI1LjMifQAfAAAAVAYAABAGAADYBQAAoAUAAGgFAAAwBQAA'\n b'AAUAAMgEAACQBAAAWAQAACwEAADwAwAAuAMAAIgDAABUAwAAIAMAAPQCAADI'\n b'AgAAkAIAAGACAAAwAgAA+AEAALwBAACEAQAATAEAABQBAADgAAAAqAAAAGwA'\n b'AAA4AAAABAAAADj6//8AAAEFFAAAAAwAAAAEAAAAAAAAAMz7//8MAAAASXNE'\n b'ZXBEZWxheWVkAAAAAGj6//8AAAEFFAAAAAwAAAAEAAAAAAAAAPz7//8MAAAA'\n b'SXNBcnJEZWxheWVkAAAAAJj6//8AAAEDGAAAAAwAAAAEAAAAAAAAAGL7//8A'\n b'AAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAA0Pr//wAAAQMYAAAADAAAAAQA'\n b'AAAAAAAAmvv//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAAT7//8AAAEDGAAA'\n b'AAwAAAAEAAAAAAAAAM77//8AAAIACAAAAE5BU0RlbGF5AAAAADT7//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAAP77//8AAAIADAAAAFdlYXRoZXJEZWxheQAAAABo'\n b'+///AAABAxgAAAAMAAAABAAAAAAAAAAy/P//AAACAAwAAABDYXJyaWVyRGVs'\n b'YXkAAAAAnPv//wAAAQIcAAAADAAAAAQAAAAAAAAAjPv//wAAAAFAAAAACAAA'\n b'AERpdmVydGVkAAAAAND7//8AAAEDGAAAAAwAAAAEAAAAAAAAAJr8//8AAAIA'\n b'EAAAAENhbmNlbGxhdGlvbkNvZGUAAAAACPz//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAA+Pv//wAAAAFAAAAACQAAAENhbmNlbGxlZAAAADz8//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAAb9//8AAAIABwAAAFRheGlPdXQAaPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAAMv3//wAAAgAGAAAAVGF4aUluAACU/P//AAABAhwAAAAMAAAA'\n b'BAAAAAAAAACE/P//AAAAAUAAAAAIAAAARGlzdGFuY2UAAAAAyPz//wAAAQUU'\n b'AAAADAAAAAQAAAAAAAAAXP7//wQAAABEZXN0AAAAAPD8//8AAAEFFAAAAAwA'\n b'AAAEAAAAAAAAAIT+//8GAAAAT3JpZ2luAAAY/f//AAABAxgAAAAMAAAABAAA'\n b'AAAAAADi/f//AAACAAgAAABEZXBEZWxheQAAAABI/f//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAAAS/v//AAACAAgAAABBcnJEZWxheQAAAAB4/f//AAABAxgAAAAM'\n b'AAAABAAAAAAAAABC/v//AAACAAcAAABBaXJUaW1lAKT9//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAG7+//8AAAIADgAAAENSU0VsYXBzZWRUaW1lAADY/f//AAAB'\n b'AxgAAAAMAAAABAAAAAAAAACi/v//AAACABEAAABBY3R1YWxFbGFwc2VkVGlt'\n b'ZQAAABD+//8AAAEFFAAAAAwAAAAEAAAAAAAAAKT///8HAAAAVGFpbE51bQA4'\n b'/v//AAABAhwAAAAMAAAABAAAAAAAAAAo/v//AAAAAUAAAAAJAAAARmxpZ2h0'\n b'TnVtAAAAbP7//wAAAQUYAAAAEAAAAAQAAAAAAAAABAAEAAQAAAANAAAAVW5p'\n b'cXVlQ2FycmllcgAAAKD+//8AAAECHAAAAAwAAAAEAAAAAAAAAJD+//8AAAAB'\n b'QAAAAAoAAABDUlNBcnJUaW1lAADU/v//AAABAxgAAAAMAAAABAAAAAAAAACe'\n b'////AAACAAcAAABBcnJUaW1lAAD///8AAAECHAAAAAwAAAAEAAAAAAAAAPD+'\n b'//8AAAABQAAAAAoAAABDUlNEZXBUaW1lAAA0////AAABAyAAAAAUAAAABAAA'\n b'AAAAAAAAAAYACAAGAAYAAAAAAAIABwAAAERlcFRpbWUAaP///wAAAQIcAAAA'\n b'DAAAAAQAAAAAAAAAWP///wAAAAFAAAAACQAAAERheU9mV2VlawAAAJz///8A'\n b'AAECHAAAAAwAAAAEAAAAAAAAAIz///8AAAABQAAAAAoAAABEYXlvZk1vbnRo'\n b'AADQ////AAABAhwAAAAMAAAABAAAAAAAAADA////AAAAAUAAAAAFAAAATW9u'\n b'dGgAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIkAAAAFAAAAAQAAAAAAAAA'\n b'CAAMAAgABwAIAAAAAAAAAUAAAAAEAAAAWWVhcgAAAAA=')])" + ] + } + ], + "source": [ + "table = pq.read_table(TARGET_PATH)" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\u001b[0;31mInit signature:\u001b[0m\n", + "\u001b[0mpq\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mParquetDataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m \u001b[0mpath_or_paths\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m \u001b[0mfilesystem\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m \u001b[0mschema\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m \u001b[0mmetadata\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m \u001b[0msplit_row_groups\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m \u001b[0mvalidate_schema\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m \u001b[0mfilters\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m \u001b[0mmetadata_nthreads\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m \u001b[0mread_dictionary\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m \u001b[0mmemory_map\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m \u001b[0mbuffer_size\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", + "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mDocstring:\u001b[0m \n", + "Encapsulates details of reading a complete Parquet dataset possibly\n", + "consisting of multiple files and partitions in subdirectories\n", + "\n", + "Parameters\n", + "----------\n", + "path_or_paths : str or List[str]\n", + " A directory name, single file name, or list of file names\n", + "filesystem : FileSystem, default None\n", + " If nothing passed, paths assumed to be found in the local on-disk\n", + " filesystem\n", + "metadata : pyarrow.parquet.FileMetaData\n", + " Use metadata obtained elsewhere to validate file schemas\n", + "schema : pyarrow.parquet.Schema\n", + " Use schema obtained elsewhere to validate file schemas. Alternative to\n", + " metadata parameter\n", + "split_row_groups : boolean, default False\n", + " Divide files into pieces for each row group in the file\n", + "validate_schema : boolean, default True\n", + " Check that individual file schemas are all the same / compatible\n", + "filters : List[Tuple] or List[List[Tuple]] or None (default)\n", + " List of filters to apply, like ``[[('x', '=', 0), ...], ...]``. This\n", + " implements partition-level (hive) filtering only, i.e., to prevent the\n", + " loading of some files of the dataset.\n", + "\n", + " Predicates are expressed in disjunctive normal form (DNF). This means\n", + " that the innermost tuple describe a single column predicate. These\n", + " inner predicate make are all combined with a conjunction (AND) into a\n", + " larger predicate. The most outer list then combines all filters\n", + " with a disjunction (OR). By this, we should be able to express all\n", + " kinds of filters that are possible using boolean logic.\n", + "\n", + " This function also supports passing in as List[Tuple]. These predicates\n", + " are evaluated as a conjunction. To express OR in predictates, one must\n", + " use the (preferred) List[List[Tuple]] notation.\n", + "metadata_nthreads: int, default 1\n", + " How many threads to allow the thread pool which is used to read the\n", + " dataset metadata. Increasing this is helpful to read partitioned\n", + " datasets.\n", + "read_dictionary : list, default None\n", + " List of names or column paths (for nested types) to read directly\n", + " as DictionaryArray. Only supported for BYTE_ARRAY storage. To read\n", + " a flat column as dictionary-encoded pass the column name. For\n", + " nested types, you must pass the full column \"path\", which could be\n", + " something like level1.level2.list.item. Refer to the Parquet\n", + " file's schema to obtain the paths.\n", + "memory_map : boolean, default False\n", + " If the source is a file path, use a memory map to read file, which can\n", + " improve performance in some environments\n", + "buffer_size : int, default 0\n", + " If positive, perform read buffering when deserializing individual\n", + " column chunks. Otherwise IO calls are unbuffered.\n", + "\u001b[0;31mFile:\u001b[0m ~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/pyarrow/parquet.py\n", + "\u001b[0;31mType:\u001b[0m type\n", + "\u001b[0;31mSubclasses:\u001b[0m \n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "dataset = pq.ParquetDataset(TARGET_PATH)" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [ + { + "ename": "ValueError", + "evalue": "Schema in /User/mlrun/airlines/dataset/a3f9441653c14708ad69f63694749301.parquet was different. \nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nTailNum: double\nActualElapsedTime: double\nCRSElapsedTime: int64\nAirTime: double\nArrDelay: double\nDepDelay: double\nOrigin: string\nDest: string\nDistance: double\nTaxiIn: double\nTaxiOut: double\nCancelled: int64\nCancellationCode: double\nDiverted: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nIsArrDelayed: string\nIsDepDelayed: string\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'0, \"stop\": 10000, \"step\": 1}], \"column_indexes\": [{\"name\": n'\n b'ull, \"field_name\": null, \"pandas_type\": \"unicode\", \"numpy_ty'\n b'pe\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}], \"columns'\n b'\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas_type\": \"i'\n b'nt64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"M'\n b'onth\", \"field_name\": \"Month\", \"pandas_type\": \"int64\", \"numpy'\n b'_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayofMonth\", \"'\n b'field_name\": \"DayofMonth\", \"pandas_type\": \"int64\", \"numpy_ty'\n b'pe\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWeek\", \"fiel'\n b'd_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"numpy_type\": '\n b'\"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"field_name\"'\n b': \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"field_name\": '\n b'\"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\",'\n b' \"metadata\": null}, {\"name\": \"ArrTime\", \"field_name\": \"ArrTi'\n b'me\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"CRSArrT'\n b'ime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metada'\n b'ta\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"UniqueC'\n b'arrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"'\n b'metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": \"Fligh'\n b'tNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metad'\n b'ata\": null}, {\"name\": \"TailNum\", \"field_name\": \"TailNum\", \"p'\n b'andas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\":'\n b' null}, {\"name\": \"ActualElapsedTime\", \"field_name\": \"ActualE'\n b'lapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_nam'\n b'e\": \"CRSElapsedTime\", \"pandas_type\": \"int64\", \"numpy_type\": '\n b'\"int64\", \"metadata\": null}, {\"name\": \"AirTime\", \"field_name\"'\n b': \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"ArrDelay\", \"field_name\": \"A'\n b'rrDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\",'\n b' \"metadata\": null}, {\"name\": \"DepDelay\", \"field_name\": \"DepD'\n b'elay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"m'\n b'etadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Origin\", '\n b'\"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\"'\n b': null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pandas_type'\n b'\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {\"n'\n b'ame\": \"Distance\", \"field_name\": \"Distance\", \"pandas_type\": \"'\n b'float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name'\n b'\": \"TaxiIn\", \"field_name\": \"TaxiIn\", \"pandas_type\": \"float64'\n b'\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Tax'\n b'iOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"float64\", \"n'\n b'umpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Cancelle'\n b'd\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int64\", \"nump'\n b'y_type\": \"int64\", \"metadata\": null}, {\"name\": \"CancellationC'\n b'ode\", \"field_name\": \"CancellationCode\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'Diverted\", \"field_name\": \"Diverted\", \"pandas_type\": \"int64\",'\n b' \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carrier'\n b'Delay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"float6'\n b'4\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"We'\n b'atherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_type\"'\n b': \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"n'\n b'ame\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDelay\"'\n b', \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metada'\n b'ta\": null}, {\"name\": \"IsArrDelayed\", \"field_name\": \"IsArrDel'\n b'ayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"me'\n b'tadata\": null}, {\"name\": \"IsDepDelayed\", \"field_name\": \"IsDe'\n b'pDelayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\",'\n b' \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"vers'\n b'ion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////6AWAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAANwPAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAKQPAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDEwMDAw'\n b'LCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51'\n b'bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNv'\n b'ZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVu'\n b'Y29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogIlll'\n b'YXIiLCAiZmllbGRfbmFtZSI6ICJZZWFyIiwgInBhbmRhc190eXBlIjogImlu'\n b'dDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxs'\n b'fSwgeyJuYW1lIjogIk1vbnRoIiwgImZpZWxkX25hbWUiOiAiTW9udGgiLCAi'\n b'cGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIs'\n b'ICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiRGF5b2ZNb250aCIsICJm'\n b'aWVsZF9uYW1lIjogIkRheW9mTW9udGgiLCAicGFuZGFzX3R5cGUiOiAiaW50'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9'\n b'LCB7Im5hbWUiOiAiRGF5T2ZXZWVrIiwgImZpZWxkX25hbWUiOiAiRGF5T2ZX'\n b'ZWVrIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAi'\n b'aW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRlcFRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJEZXBUaW1lIiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJDUlNEZXBUaW1lIiwgImZpZWxkX25hbWUiOiAi'\n b'Q1JTRGVwVGltZSIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90'\n b'eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJB'\n b'cnJUaW1lIiwgImZpZWxkX25hbWUiOiAiQXJyVGltZSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTQXJyVGltZSIsICJmaWVsZF9u'\n b'YW1lIjogIkNSU0FyclRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiVW5pcXVlQ2FycmllciIsICJmaWVsZF9uYW1lIjogIlVuaXF1ZUNh'\n b'cnJpZXIiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBl'\n b'IjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiRmxp'\n b'Z2h0TnVtIiwgImZpZWxkX25hbWUiOiAiRmxpZ2h0TnVtIiwgInBhbmRhc190'\n b'eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIlRhaWxOdW0iLCAiZmllbGRfbmFtZSI6'\n b'ICJUYWlsTnVtIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJBY3R1YWxFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkFjdHVhbEVs'\n b'YXBzZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJDUlNFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBzZWRU'\n b'aW1lIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAi'\n b'aW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkFpclRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJBaXJUaW1lIiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJBcnJEZWxheSIsICJmaWVsZF9uYW1lIjogIkFy'\n b'ckRlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJE'\n b'ZXBEZWxheSIsICJmaWVsZF9uYW1lIjogIkRlcERlbGF5IiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJPcmlnaW4iLCAiZmllbGRfbmFt'\n b'ZSI6ICJPcmlnaW4iLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1w'\n b'eV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiRGVzdCIsICJmaWVsZF9uYW1lIjogIkRlc3QiLCAicGFuZGFzX3R5cGUi'\n b'OiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0'\n b'YSI6IG51bGx9LCB7Im5hbWUiOiAiRGlzdGFuY2UiLCAiZmllbGRfbmFtZSI6'\n b'ICJEaXN0YW5jZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiVGF4aUluIiwgImZpZWxkX25hbWUiOiAiVGF4aUluIiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxkX25h'\n b'bWUiOiAiVGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiQ2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVkIiwg'\n b'InBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQi'\n b'LCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhbmNlbGxhdGlvbkNv'\n b'ZGUiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsYXRpb25Db2RlIiwgInBhbmRh'\n b'c190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXZlcnRlZCIsICJmaWVs'\n b'ZF9uYW1lIjogIkRpdmVydGVkIiwgInBhbmRhc190eXBlIjogImludDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJu'\n b'YW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVsZF9uYW1lIjogIkNhcnJpZXJE'\n b'ZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUi'\n b'OiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiV2Vh'\n b'dGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAiV2VhdGhlckRlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJOQVNEZWxheSIsICJm'\n b'aWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVs'\n b'bH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5IiwgImZpZWxkX25hbWUiOiAi'\n b'U2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJMYXRl'\n b'QWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiSXNBcnJEZWxheWVkIiwgImZpZWxkX25hbWUiOiAiSXNBcnJEZWxh'\n b'eWVkIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6'\n b'ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIklzRGVw'\n b'RGVsYXllZCIsICJmaWVsZF9uYW1lIjogIklzRGVwRGVsYXllZCIsICJwYW5k'\n b'YXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJw'\n b'eWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVyc2lv'\n b'biI6ICIwLjI1LjMifQAAAAAfAAAAWAYAABQGAADcBQAApAUAAGwFAAA0BQAA'\n b'BAUAAMwEAACUBAAAXAQAACwEAADwAwAAtAMAAIQDAABQAwAAHAMAAPACAADE'\n b'AgAAkAIAAGACAAAwAgAA+AEAALwBAACEAQAATAEAABQBAADgAAAAqAAAAGwA'\n b'AAA4AAAABAAAADT6//8AAAEFFAAAAAwAAAAEAAAAAAAAAMj7//8MAAAASXNE'\n b'ZXBEZWxheWVkAAAAAGT6//8AAAEFFAAAAAwAAAAEAAAAAAAAAPj7//8MAAAA'\n b'SXNBcnJEZWxheWVkAAAAAJT6//8AAAEDGAAAAAwAAAAEAAAAAAAAAF77//8A'\n b'AAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAAzPr//wAAAQMYAAAADAAAAAQA'\n b'AAAAAAAAlvv//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAAD7//8AAAEDGAAA'\n b'AAwAAAAEAAAAAAAAAMr7//8AAAIACAAAAE5BU0RlbGF5AAAAADD7//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAAPr7//8AAAIADAAAAFdlYXRoZXJEZWxheQAAAABk'\n b'+///AAABAxgAAAAMAAAABAAAAAAAAAAu/P//AAACAAwAAABDYXJyaWVyRGVs'\n b'YXkAAAAAmPv//wAAAQIcAAAADAAAAAQAAAAAAAAAiPv//wAAAAFAAAAACAAA'\n b'AERpdmVydGVkAAAAAMz7//8AAAEDGAAAAAwAAAAEAAAAAAAAAJb8//8AAAIA'\n b'EAAAAENhbmNlbGxhdGlvbkNvZGUAAAAABPz//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAA9Pv//wAAAAFAAAAACQAAAENhbmNlbGxlZAAAADj8//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAAL9//8AAAIABwAAAFRheGlPdXQAZPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAALv3//wAAAgAGAAAAVGF4aUluAACQ/P//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAABa/f//AAACAAgAAABEaXN0YW5jZQAAAADA/P//AAABBRQAAAAM'\n b'AAAABAAAAAAAAABU/v//BAAAAERlc3QAAAAA6Pz//wAAAQUUAAAADAAAAAQA'\n b'AAAAAAAAfP7//wYAAABPcmlnaW4AABD9//8AAAEDGAAAAAwAAAAEAAAAAAAA'\n b'ANr9//8AAAIACAAAAERlcERlbGF5AAAAAED9//8AAAEDGAAAAAwAAAAEAAAA'\n b'AAAAAAr+//8AAAIACAAAAEFyckRlbGF5AAAAAHD9//8AAAEDGAAAAAwAAAAE'\n b'AAAAAAAAADr+//8AAAIABwAAAEFpclRpbWUAnP3//wAAAQIcAAAADAAAAAQA'\n b'AAAAAAAAjP3//wAAAAFAAAAADgAAAENSU0VsYXBzZWRUaW1lAADU/f//AAAB'\n b'AxgAAAAMAAAABAAAAAAAAACe/v//AAACABEAAABBY3R1YWxFbGFwc2VkVGlt'\n b'ZQAAAAz+//8AAAEDGAAAAAwAAAAEAAAAAAAAANb+//8AAAIABwAAAFRhaWxO'\n b'dW0AOP7//wAAAQIcAAAADAAAAAQAAAAAAAAAKP7//wAAAAFAAAAACQAAAEZs'\n b'aWdodE51bQAAAGz+//8AAAEFGAAAABAAAAAEAAAAAAAAAAQABAAEAAAADQAA'\n b'AFVuaXF1ZUNhcnJpZXIAAACg/v//AAABAhwAAAAMAAAABAAAAAAAAACQ/v//'\n b'AAAAAUAAAAAKAAAAQ1JTQXJyVGltZQAA1P7//wAAAQMYAAAADAAAAAQAAAAA'\n b'AAAAnv///wAAAgAHAAAAQXJyVGltZQAA////AAABAhwAAAAMAAAABAAAAAAA'\n b'AADw/v//AAAAAUAAAAAKAAAAQ1JTRGVwVGltZQAANP///wAAAQMgAAAAFAAA'\n b'AAQAAAAAAAAAAAAGAAgABgAGAAAAAAACAAcAAABEZXBUaW1lAGj///8AAAEC'\n b'HAAAAAwAAAAEAAAAAAAAAFj///8AAAABQAAAAAkAAABEYXlPZldlZWsAAACc'\n b'////AAABAhwAAAAMAAAABAAAAAAAAACM////AAAAAUAAAAAKAAAARGF5b2ZN'\n b'b250aAAA0P///wAAAQIcAAAADAAAAAQAAAAAAAAAwP///wAAAAFAAAAABQAA'\n b'AE1vbnRoAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAECJAAAABQAAAAEAAAA'\n b'AAAAAAgADAAIAAcACAAAAAAAAAFAAAAABAAAAFllYXIAAAAAAAAAAA==')])\n\nvs\n\nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nTailNum: string\nActualElapsedTime: double\nCRSElapsedTime: double\nAirTime: double\nArrDelay: double\nDepDelay: double\nOrigin: string\nDest: string\nDistance: int64\nTaxiIn: double\nTaxiOut: double\nCancelled: int64\nCancellationCode: double\nDiverted: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nIsArrDelayed: string\nIsDepDelayed: string\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'10000, \"stop\": 20000, \"step\": 1}], \"column_indexes\": [{\"name'\n b'\": null, \"field_name\": null, \"pandas_type\": \"unicode\", \"nump'\n b'y_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}], \"col'\n b'umns\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas_type\"'\n b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\"'\n b': \"Month\", \"field_name\": \"Month\", \"pandas_type\": \"int64\", \"n'\n b'umpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayofMonth'\n b'\", \"field_name\": \"DayofMonth\", \"pandas_type\": \"int64\", \"nump'\n b'y_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWeek\", \"'\n b'field_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"numpy_typ'\n b'e\": \"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"field_n'\n b'ame\": \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"field_nam'\n b'e\": \"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int'\n b'64\", \"metadata\": null}, {\"name\": \"ArrTime\", \"field_name\": \"A'\n b'rrTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", '\n b'\"metadata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"CRS'\n b'ArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"me'\n b'tadata\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"Uni'\n b'queCarrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object'\n b'\", \"metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": \"F'\n b'lightNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"m'\n b'etadata\": null}, {\"name\": \"TailNum\", \"field_name\": \"TailNum\"'\n b', \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadat'\n b'a\": null}, {\"name\": \"ActualElapsedTime\", \"field_name\": \"Actu'\n b'alElapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"flo'\n b'at64\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_'\n b'name\": \"CRSElapsedTime\", \"pandas_type\": \"float64\", \"numpy_ty'\n b'pe\": \"float64\", \"metadata\": null}, {\"name\": \"AirTime\", \"fiel'\n b'd_name\": \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": '\n b'\"float64\", \"metadata\": null}, {\"name\": \"ArrDelay\", \"field_na'\n b'me\": \"ArrDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"DepDelay\", \"field_name\"'\n b': \"DepDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float'\n b'64\", \"metadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Or'\n b'igin\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"me'\n b'tadata\": null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pand'\n b'as_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": nul'\n b'l}, {\"name\": \"Distance\", \"field_name\": \"Distance\", \"pandas_t'\n b'ype\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"n'\n b'ame\": \"TaxiIn\", \"field_name\": \"TaxiIn\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'TaxiOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"float64\",'\n b' \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Cance'\n b'lled\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int64\", \"n'\n b'umpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Cancellati'\n b'onCode\", \"field_name\": \"CancellationCode\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"Diverted\", \"field_name\": \"Diverted\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carr'\n b'ierDelay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"flo'\n b'at64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": '\n b'\"WeatherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\":'\n b' \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"na'\n b'me\": \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_ty'\n b'pe\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, '\n b'{\"name\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDel'\n b'ay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}, {\"name\": \"IsArrDelayed\", \"field_name\": \"IsArr'\n b'Delayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", '\n b'\"metadata\": null}, {\"name\": \"IsDepDelayed\", \"field_name\": \"I'\n b'sDepDelayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"objec'\n b't\", \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"v'\n b'ersion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////5gWAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAANwPAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAKcPAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAxMDAwMCwgInN0b3AiOiAy'\n b'MDAwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7Im5hbWUi'\n b'OiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlwZSI6ICJ1'\n b'bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjog'\n b'eyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3sibmFtZSI6'\n b'ICJZZWFyIiwgImZpZWxkX25hbWUiOiAiWWVhciIsICJwYW5kYXNfdHlwZSI6'\n b'ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJNb250aCIsICJmaWVsZF9uYW1lIjogIk1vbnRo'\n b'IiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRheW9mTW9udGgi'\n b'LCAiZmllbGRfbmFtZSI6ICJEYXlvZk1vbnRoIiwgInBhbmRhc190eXBlIjog'\n b'ImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBu'\n b'dWxsfSwgeyJuYW1lIjogIkRheU9mV2VlayIsICJmaWVsZF9uYW1lIjogIkRh'\n b'eU9mV2VlayIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBl'\n b'IjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEZXBU'\n b'aW1lIiwgImZpZWxkX25hbWUiOiAiRGVwVGltZSIsICJwYW5kYXNfdHlwZSI6'\n b'ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0'\n b'YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTRGVwVGltZSIsICJmaWVsZF9uYW1l'\n b'IjogIkNSU0RlcFRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiQXJyVGltZSIsICJmaWVsZF9uYW1lIjogIkFyclRpbWUiLCAicGFuZGFz'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNSU0FyclRpbWUiLCAiZmll'\n b'bGRfbmFtZSI6ICJDUlNBcnJUaW1lIiwgInBhbmRhc190eXBlIjogImludDY0'\n b'IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIlVuaXF1ZUNhcnJpZXIiLCAiZmllbGRfbmFtZSI6ICJVbmlx'\n b'dWVDYXJyaWVyIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlf'\n b'dHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjog'\n b'IkZsaWdodE51bSIsICJmaWVsZF9uYW1lIjogIkZsaWdodE51bSIsICJwYW5k'\n b'YXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYWlsTnVtIiwgImZpZWxkX25h'\n b'bWUiOiAiVGFpbE51bSIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51'\n b'bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJBY3R1YWxFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkFjdHVh'\n b'bEVsYXBzZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJDUlNFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBz'\n b'ZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJB'\n b'aXJUaW1lIiwgImZpZWxkX25hbWUiOiAiQWlyVGltZSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQXJyRGVsYXkiLCAiZmllbGRfbmFt'\n b'ZSI6ICJBcnJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiRGVwRGVsYXkiLCAiZmllbGRfbmFtZSI6ICJEZXBEZWxheSIsICJw'\n b'YW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2'\n b'NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiT3JpZ2luIiwgImZp'\n b'ZWxkX25hbWUiOiAiT3JpZ2luIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUi'\n b'LCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIkRlc3QiLCAiZmllbGRfbmFtZSI6ICJEZXN0IiwgInBhbmRh'\n b'c190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRpc3RhbmNlIiwgImZpZWxk'\n b'X25hbWUiOiAiRGlzdGFuY2UiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiVGF4aUluIiwgImZpZWxkX25hbWUiOiAiVGF4aUluIiwgInBhbmRh'\n b'c190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxk'\n b'X25hbWUiOiAiVGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiQ2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVk'\n b'IiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhbmNlbGxhdGlv'\n b'bkNvZGUiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsYXRpb25Db2RlIiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXZlcnRlZCIsICJm'\n b'aWVsZF9uYW1lIjogIkRpdmVydGVkIiwgInBhbmRhc190eXBlIjogImludDY0'\n b'IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVsZF9uYW1lIjogIkNhcnJp'\n b'ZXJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5'\n b'cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAi'\n b'V2VhdGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAiV2VhdGhlckRlbGF5Iiwg'\n b'InBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9h'\n b'dDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJOQVNEZWxheSIs'\n b'ICJmaWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5IiwgImZpZWxkX25hbWUi'\n b'OiAiU2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJM'\n b'YXRlQWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiSXNBcnJEZWxheWVkIiwgImZpZWxkX25hbWUiOiAiSXNBcnJE'\n b'ZWxheWVkIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlw'\n b'ZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIklz'\n b'RGVwRGVsYXllZCIsICJmaWVsZF9uYW1lIjogIklzRGVwRGVsYXllZCIsICJw'\n b'YW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6'\n b'ICJweWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVy'\n b'c2lvbiI6ICIwLjI1LjMifQAfAAAAVAYAABAGAADYBQAAoAUAAGgFAAAwBQAA'\n b'AAUAAMgEAACQBAAAWAQAACwEAADwAwAAuAMAAIgDAABUAwAAIAMAAPQCAADI'\n b'AgAAkAIAAGACAAAwAgAA+AEAALwBAACEAQAATAEAABQBAADgAAAAqAAAAGwA'\n b'AAA4AAAABAAAADj6//8AAAEFFAAAAAwAAAAEAAAAAAAAAMz7//8MAAAASXNE'\n b'ZXBEZWxheWVkAAAAAGj6//8AAAEFFAAAAAwAAAAEAAAAAAAAAPz7//8MAAAA'\n b'SXNBcnJEZWxheWVkAAAAAJj6//8AAAEDGAAAAAwAAAAEAAAAAAAAAGL7//8A'\n b'AAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAA0Pr//wAAAQMYAAAADAAAAAQA'\n b'AAAAAAAAmvv//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAAT7//8AAAEDGAAA'\n b'AAwAAAAEAAAAAAAAAM77//8AAAIACAAAAE5BU0RlbGF5AAAAADT7//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAAP77//8AAAIADAAAAFdlYXRoZXJEZWxheQAAAABo'\n b'+///AAABAxgAAAAMAAAABAAAAAAAAAAy/P//AAACAAwAAABDYXJyaWVyRGVs'\n b'YXkAAAAAnPv//wAAAQIcAAAADAAAAAQAAAAAAAAAjPv//wAAAAFAAAAACAAA'\n b'AERpdmVydGVkAAAAAND7//8AAAEDGAAAAAwAAAAEAAAAAAAAAJr8//8AAAIA'\n b'EAAAAENhbmNlbGxhdGlvbkNvZGUAAAAACPz//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAA+Pv//wAAAAFAAAAACQAAAENhbmNlbGxlZAAAADz8//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAAb9//8AAAIABwAAAFRheGlPdXQAaPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAAMv3//wAAAgAGAAAAVGF4aUluAACU/P//AAABAhwAAAAMAAAA'\n b'BAAAAAAAAACE/P//AAAAAUAAAAAIAAAARGlzdGFuY2UAAAAAyPz//wAAAQUU'\n b'AAAADAAAAAQAAAAAAAAAXP7//wQAAABEZXN0AAAAAPD8//8AAAEFFAAAAAwA'\n b'AAAEAAAAAAAAAIT+//8GAAAAT3JpZ2luAAAY/f//AAABAxgAAAAMAAAABAAA'\n b'AAAAAADi/f//AAACAAgAAABEZXBEZWxheQAAAABI/f//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAAAS/v//AAACAAgAAABBcnJEZWxheQAAAAB4/f//AAABAxgAAAAM'\n b'AAAABAAAAAAAAABC/v//AAACAAcAAABBaXJUaW1lAKT9//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAG7+//8AAAIADgAAAENSU0VsYXBzZWRUaW1lAADY/f//AAAB'\n b'AxgAAAAMAAAABAAAAAAAAACi/v//AAACABEAAABBY3R1YWxFbGFwc2VkVGlt'\n b'ZQAAABD+//8AAAEFFAAAAAwAAAAEAAAAAAAAAKT///8HAAAAVGFpbE51bQA4'\n b'/v//AAABAhwAAAAMAAAABAAAAAAAAAAo/v//AAAAAUAAAAAJAAAARmxpZ2h0'\n b'TnVtAAAAbP7//wAAAQUYAAAAEAAAAAQAAAAAAAAABAAEAAQAAAANAAAAVW5p'\n b'cXVlQ2FycmllcgAAAKD+//8AAAECHAAAAAwAAAAEAAAAAAAAAJD+//8AAAAB'\n b'QAAAAAoAAABDUlNBcnJUaW1lAADU/v//AAABAxgAAAAMAAAABAAAAAAAAACe'\n b'////AAACAAcAAABBcnJUaW1lAAD///8AAAECHAAAAAwAAAAEAAAAAAAAAPD+'\n b'//8AAAABQAAAAAoAAABDUlNEZXBUaW1lAAA0////AAABAyAAAAAUAAAABAAA'\n b'AAAAAAAAAAYACAAGAAYAAAAAAAIABwAAAERlcFRpbWUAaP///wAAAQIcAAAA'\n b'DAAAAAQAAAAAAAAAWP///wAAAAFAAAAACQAAAERheU9mV2VlawAAAJz///8A'\n b'AAECHAAAAAwAAAAEAAAAAAAAAIz///8AAAABQAAAAAoAAABEYXlvZk1vbnRo'\n b'AADQ////AAABAhwAAAAMAAAABAAAAAAAAADA////AAAAAUAAAAAFAAAATW9u'\n b'dGgAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIkAAAAFAAAAAQAAAAAAAAA'\n b'CAAMAAgABwAIAAAAAAAAAUAAAAAEAAAAWWVhcgAAAAA=')])", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdataset\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpq\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mParquetDataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mTARGET_PATH\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mget_ipython\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun_line_magic\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'pinfo'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'dataset.read'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/pyarrow/parquet.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, read_dictionary, memory_map, buffer_size)\u001b[0m\n\u001b[1;32m 1058\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1059\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mvalidate_schema\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1060\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalidate_schemas\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1061\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1062\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mequals\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mother\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/pyarrow/parquet.py\u001b[0m in \u001b[0;36mvalidate_schemas\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 1111\u001b[0m \u001b[0;34m'{1!s}\\n\\nvs\\n\\n{2!s}'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1112\u001b[0m .format(piece, file_schema,\n\u001b[0;32m-> 1113\u001b[0;31m dataset_schema))\n\u001b[0m\u001b[1;32m 1114\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1115\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolumns\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muse_threads\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muse_pandas_metadata\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mValueError\u001b[0m: Schema in /User/mlrun/airlines/dataset/a3f9441653c14708ad69f63694749301.parquet was different. \nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nTailNum: double\nActualElapsedTime: double\nCRSElapsedTime: int64\nAirTime: double\nArrDelay: double\nDepDelay: double\nOrigin: string\nDest: string\nDistance: double\nTaxiIn: double\nTaxiOut: double\nCancelled: int64\nCancellationCode: double\nDiverted: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nIsArrDelayed: string\nIsDepDelayed: string\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'0, \"stop\": 10000, \"step\": 1}], \"column_indexes\": [{\"name\": n'\n b'ull, \"field_name\": null, \"pandas_type\": \"unicode\", \"numpy_ty'\n b'pe\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}], \"columns'\n b'\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas_type\": \"i'\n b'nt64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"M'\n b'onth\", \"field_name\": \"Month\", \"pandas_type\": \"int64\", \"numpy'\n b'_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayofMonth\", \"'\n b'field_name\": \"DayofMonth\", \"pandas_type\": \"int64\", \"numpy_ty'\n b'pe\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWeek\", \"fiel'\n b'd_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"numpy_type\": '\n b'\"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"field_name\"'\n b': \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"field_name\": '\n b'\"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\",'\n b' \"metadata\": null}, {\"name\": \"ArrTime\", \"field_name\": \"ArrTi'\n b'me\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"CRSArrT'\n b'ime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metada'\n b'ta\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"UniqueC'\n b'arrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"'\n b'metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": \"Fligh'\n b'tNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metad'\n b'ata\": null}, {\"name\": \"TailNum\", \"field_name\": \"TailNum\", \"p'\n b'andas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\":'\n b' null}, {\"name\": \"ActualElapsedTime\", \"field_name\": \"ActualE'\n b'lapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_nam'\n b'e\": \"CRSElapsedTime\", \"pandas_type\": \"int64\", \"numpy_type\": '\n b'\"int64\", \"metadata\": null}, {\"name\": \"AirTime\", \"field_name\"'\n b': \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"ArrDelay\", \"field_name\": \"A'\n b'rrDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\",'\n b' \"metadata\": null}, {\"name\": \"DepDelay\", \"field_name\": \"DepD'\n b'elay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"m'\n b'etadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Origin\", '\n b'\"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\"'\n b': null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pandas_type'\n b'\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {\"n'\n b'ame\": \"Distance\", \"field_name\": \"Distance\", \"pandas_type\": \"'\n b'float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name'\n b'\": \"TaxiIn\", \"field_name\": \"TaxiIn\", \"pandas_type\": \"float64'\n b'\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Tax'\n b'iOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"float64\", \"n'\n b'umpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Cancelle'\n b'd\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int64\", \"nump'\n b'y_type\": \"int64\", \"metadata\": null}, {\"name\": \"CancellationC'\n b'ode\", \"field_name\": \"CancellationCode\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'Diverted\", \"field_name\": \"Diverted\", \"pandas_type\": \"int64\",'\n b' \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carrier'\n b'Delay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"float6'\n b'4\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"We'\n b'atherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_type\"'\n b': \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"n'\n b'ame\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDelay\"'\n b', \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metada'\n b'ta\": null}, {\"name\": \"IsArrDelayed\", \"field_name\": \"IsArrDel'\n b'ayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"me'\n b'tadata\": null}, {\"name\": \"IsDepDelayed\", \"field_name\": \"IsDe'\n b'pDelayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\",'\n b' \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"vers'\n b'ion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////6AWAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAANwPAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAKQPAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDEwMDAw'\n b'LCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51'\n b'bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNv'\n b'ZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVu'\n b'Y29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogIlll'\n b'YXIiLCAiZmllbGRfbmFtZSI6ICJZZWFyIiwgInBhbmRhc190eXBlIjogImlu'\n b'dDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxs'\n b'fSwgeyJuYW1lIjogIk1vbnRoIiwgImZpZWxkX25hbWUiOiAiTW9udGgiLCAi'\n b'cGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIs'\n b'ICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiRGF5b2ZNb250aCIsICJm'\n b'aWVsZF9uYW1lIjogIkRheW9mTW9udGgiLCAicGFuZGFzX3R5cGUiOiAiaW50'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9'\n b'LCB7Im5hbWUiOiAiRGF5T2ZXZWVrIiwgImZpZWxkX25hbWUiOiAiRGF5T2ZX'\n b'ZWVrIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAi'\n b'aW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRlcFRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJEZXBUaW1lIiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJDUlNEZXBUaW1lIiwgImZpZWxkX25hbWUiOiAi'\n b'Q1JTRGVwVGltZSIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90'\n b'eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJB'\n b'cnJUaW1lIiwgImZpZWxkX25hbWUiOiAiQXJyVGltZSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTQXJyVGltZSIsICJmaWVsZF9u'\n b'YW1lIjogIkNSU0FyclRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiVW5pcXVlQ2FycmllciIsICJmaWVsZF9uYW1lIjogIlVuaXF1ZUNh'\n b'cnJpZXIiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBl'\n b'IjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiRmxp'\n b'Z2h0TnVtIiwgImZpZWxkX25hbWUiOiAiRmxpZ2h0TnVtIiwgInBhbmRhc190'\n b'eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIlRhaWxOdW0iLCAiZmllbGRfbmFtZSI6'\n b'ICJUYWlsTnVtIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJBY3R1YWxFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkFjdHVhbEVs'\n b'YXBzZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJDUlNFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBzZWRU'\n b'aW1lIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAi'\n b'aW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkFpclRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJBaXJUaW1lIiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJBcnJEZWxheSIsICJmaWVsZF9uYW1lIjogIkFy'\n b'ckRlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJE'\n b'ZXBEZWxheSIsICJmaWVsZF9uYW1lIjogIkRlcERlbGF5IiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJPcmlnaW4iLCAiZmllbGRfbmFt'\n b'ZSI6ICJPcmlnaW4iLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1w'\n b'eV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiRGVzdCIsICJmaWVsZF9uYW1lIjogIkRlc3QiLCAicGFuZGFzX3R5cGUi'\n b'OiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0'\n b'YSI6IG51bGx9LCB7Im5hbWUiOiAiRGlzdGFuY2UiLCAiZmllbGRfbmFtZSI6'\n b'ICJEaXN0YW5jZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiVGF4aUluIiwgImZpZWxkX25hbWUiOiAiVGF4aUluIiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxkX25h'\n b'bWUiOiAiVGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiQ2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVkIiwg'\n b'InBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQi'\n b'LCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhbmNlbGxhdGlvbkNv'\n b'ZGUiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsYXRpb25Db2RlIiwgInBhbmRh'\n b'c190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXZlcnRlZCIsICJmaWVs'\n b'ZF9uYW1lIjogIkRpdmVydGVkIiwgInBhbmRhc190eXBlIjogImludDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJu'\n b'YW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVsZF9uYW1lIjogIkNhcnJpZXJE'\n b'ZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUi'\n b'OiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiV2Vh'\n b'dGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAiV2VhdGhlckRlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJOQVNEZWxheSIsICJm'\n b'aWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVs'\n b'bH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5IiwgImZpZWxkX25hbWUiOiAi'\n b'U2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJMYXRl'\n b'QWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiSXNBcnJEZWxheWVkIiwgImZpZWxkX25hbWUiOiAiSXNBcnJEZWxh'\n b'eWVkIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6'\n b'ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIklzRGVw'\n b'RGVsYXllZCIsICJmaWVsZF9uYW1lIjogIklzRGVwRGVsYXllZCIsICJwYW5k'\n b'YXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJw'\n b'eWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVyc2lv'\n b'biI6ICIwLjI1LjMifQAAAAAfAAAAWAYAABQGAADcBQAApAUAAGwFAAA0BQAA'\n b'BAUAAMwEAACUBAAAXAQAACwEAADwAwAAtAMAAIQDAABQAwAAHAMAAPACAADE'\n b'AgAAkAIAAGACAAAwAgAA+AEAALwBAACEAQAATAEAABQBAADgAAAAqAAAAGwA'\n b'AAA4AAAABAAAADT6//8AAAEFFAAAAAwAAAAEAAAAAAAAAMj7//8MAAAASXNE'\n b'ZXBEZWxheWVkAAAAAGT6//8AAAEFFAAAAAwAAAAEAAAAAAAAAPj7//8MAAAA'\n b'SXNBcnJEZWxheWVkAAAAAJT6//8AAAEDGAAAAAwAAAAEAAAAAAAAAF77//8A'\n b'AAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAAzPr//wAAAQMYAAAADAAAAAQA'\n b'AAAAAAAAlvv//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAAD7//8AAAEDGAAA'\n b'AAwAAAAEAAAAAAAAAMr7//8AAAIACAAAAE5BU0RlbGF5AAAAADD7//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAAPr7//8AAAIADAAAAFdlYXRoZXJEZWxheQAAAABk'\n b'+///AAABAxgAAAAMAAAABAAAAAAAAAAu/P//AAACAAwAAABDYXJyaWVyRGVs'\n b'YXkAAAAAmPv//wAAAQIcAAAADAAAAAQAAAAAAAAAiPv//wAAAAFAAAAACAAA'\n b'AERpdmVydGVkAAAAAMz7//8AAAEDGAAAAAwAAAAEAAAAAAAAAJb8//8AAAIA'\n b'EAAAAENhbmNlbGxhdGlvbkNvZGUAAAAABPz//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAA9Pv//wAAAAFAAAAACQAAAENhbmNlbGxlZAAAADj8//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAAL9//8AAAIABwAAAFRheGlPdXQAZPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAALv3//wAAAgAGAAAAVGF4aUluAACQ/P//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAABa/f//AAACAAgAAABEaXN0YW5jZQAAAADA/P//AAABBRQAAAAM'\n b'AAAABAAAAAAAAABU/v//BAAAAERlc3QAAAAA6Pz//wAAAQUUAAAADAAAAAQA'\n b'AAAAAAAAfP7//wYAAABPcmlnaW4AABD9//8AAAEDGAAAAAwAAAAEAAAAAAAA'\n b'ANr9//8AAAIACAAAAERlcERlbGF5AAAAAED9//8AAAEDGAAAAAwAAAAEAAAA'\n b'AAAAAAr+//8AAAIACAAAAEFyckRlbGF5AAAAAHD9//8AAAEDGAAAAAwAAAAE'\n b'AAAAAAAAADr+//8AAAIABwAAAEFpclRpbWUAnP3//wAAAQIcAAAADAAAAAQA'\n b'AAAAAAAAjP3//wAAAAFAAAAADgAAAENSU0VsYXBzZWRUaW1lAADU/f//AAAB'\n b'AxgAAAAMAAAABAAAAAAAAACe/v//AAACABEAAABBY3R1YWxFbGFwc2VkVGlt'\n b'ZQAAAAz+//8AAAEDGAAAAAwAAAAEAAAAAAAAANb+//8AAAIABwAAAFRhaWxO'\n b'dW0AOP7//wAAAQIcAAAADAAAAAQAAAAAAAAAKP7//wAAAAFAAAAACQAAAEZs'\n b'aWdodE51bQAAAGz+//8AAAEFGAAAABAAAAAEAAAAAAAAAAQABAAEAAAADQAA'\n b'AFVuaXF1ZUNhcnJpZXIAAACg/v//AAABAhwAAAAMAAAABAAAAAAAAACQ/v//'\n b'AAAAAUAAAAAKAAAAQ1JTQXJyVGltZQAA1P7//wAAAQMYAAAADAAAAAQAAAAA'\n b'AAAAnv///wAAAgAHAAAAQXJyVGltZQAA////AAABAhwAAAAMAAAABAAAAAAA'\n b'AADw/v//AAAAAUAAAAAKAAAAQ1JTRGVwVGltZQAANP///wAAAQMgAAAAFAAA'\n b'AAQAAAAAAAAAAAAGAAgABgAGAAAAAAACAAcAAABEZXBUaW1lAGj///8AAAEC'\n b'HAAAAAwAAAAEAAAAAAAAAFj///8AAAABQAAAAAkAAABEYXlPZldlZWsAAACc'\n b'////AAABAhwAAAAMAAAABAAAAAAAAACM////AAAAAUAAAAAKAAAARGF5b2ZN'\n b'b250aAAA0P///wAAAQIcAAAADAAAAAQAAAAAAAAAwP///wAAAAFAAAAABQAA'\n b'AE1vbnRoAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAECJAAAABQAAAAEAAAA'\n b'AAAAAAgADAAIAAcACAAAAAAAAAFAAAAABAAAAFllYXIAAAAAAAAAAA==')])\n\nvs\n\nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nTailNum: string\nActualElapsedTime: double\nCRSElapsedTime: double\nAirTime: double\nArrDelay: double\nDepDelay: double\nOrigin: string\nDest: string\nDistance: int64\nTaxiIn: double\nTaxiOut: double\nCancelled: int64\nCancellationCode: double\nDiverted: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nIsArrDelayed: string\nIsDepDelayed: string\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'10000, \"stop\": 20000, \"step\": 1}], \"column_indexes\": [{\"name'\n b'\": null, \"field_name\": null, \"pandas_type\": \"unicode\", \"nump'\n b'y_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}], \"col'\n b'umns\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas_type\"'\n b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\"'\n b': \"Month\", \"field_name\": \"Month\", \"pandas_type\": \"int64\", \"n'\n b'umpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayofMonth'\n b'\", \"field_name\": \"DayofMonth\", \"pandas_type\": \"int64\", \"nump'\n b'y_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWeek\", \"'\n b'field_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"numpy_typ'\n b'e\": \"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"field_n'\n b'ame\": \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"field_nam'\n b'e\": \"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int'\n b'64\", \"metadata\": null}, {\"name\": \"ArrTime\", \"field_name\": \"A'\n b'rrTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", '\n b'\"metadata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"CRS'\n b'ArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"me'\n b'tadata\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"Uni'\n b'queCarrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object'\n b'\", \"metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": \"F'\n b'lightNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"m'\n b'etadata\": null}, {\"name\": \"TailNum\", \"field_name\": \"TailNum\"'\n b', \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadat'\n b'a\": null}, {\"name\": \"ActualElapsedTime\", \"field_name\": \"Actu'\n b'alElapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"flo'\n b'at64\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_'\n b'name\": \"CRSElapsedTime\", \"pandas_type\": \"float64\", \"numpy_ty'\n b'pe\": \"float64\", \"metadata\": null}, {\"name\": \"AirTime\", \"fiel'\n b'd_name\": \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": '\n b'\"float64\", \"metadata\": null}, {\"name\": \"ArrDelay\", \"field_na'\n b'me\": \"ArrDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"DepDelay\", \"field_name\"'\n b': \"DepDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float'\n b'64\", \"metadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Or'\n b'igin\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"me'\n b'tadata\": null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pand'\n b'as_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": nul'\n b'l}, {\"name\": \"Distance\", \"field_name\": \"Distance\", \"pandas_t'\n b'ype\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"n'\n b'ame\": \"TaxiIn\", \"field_name\": \"TaxiIn\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'TaxiOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"float64\",'\n b' \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Cance'\n b'lled\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int64\", \"n'\n b'umpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Cancellati'\n b'onCode\", \"field_name\": \"CancellationCode\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"Diverted\", \"field_name\": \"Diverted\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carr'\n b'ierDelay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"flo'\n b'at64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": '\n b'\"WeatherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\":'\n b' \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"na'\n b'me\": \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_ty'\n b'pe\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, '\n b'{\"name\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDel'\n b'ay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}, {\"name\": \"IsArrDelayed\", \"field_name\": \"IsArr'\n b'Delayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", '\n b'\"metadata\": null}, {\"name\": \"IsDepDelayed\", \"field_name\": \"I'\n b'sDepDelayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"objec'\n b't\", \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"v'\n b'ersion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////5gWAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAANwPAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAKcPAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAxMDAwMCwgInN0b3AiOiAy'\n b'MDAwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7Im5hbWUi'\n b'OiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlwZSI6ICJ1'\n b'bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjog'\n b'eyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3sibmFtZSI6'\n b'ICJZZWFyIiwgImZpZWxkX25hbWUiOiAiWWVhciIsICJwYW5kYXNfdHlwZSI6'\n b'ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJNb250aCIsICJmaWVsZF9uYW1lIjogIk1vbnRo'\n b'IiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRheW9mTW9udGgi'\n b'LCAiZmllbGRfbmFtZSI6ICJEYXlvZk1vbnRoIiwgInBhbmRhc190eXBlIjog'\n b'ImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBu'\n b'dWxsfSwgeyJuYW1lIjogIkRheU9mV2VlayIsICJmaWVsZF9uYW1lIjogIkRh'\n b'eU9mV2VlayIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBl'\n b'IjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEZXBU'\n b'aW1lIiwgImZpZWxkX25hbWUiOiAiRGVwVGltZSIsICJwYW5kYXNfdHlwZSI6'\n b'ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0'\n b'YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTRGVwVGltZSIsICJmaWVsZF9uYW1l'\n b'IjogIkNSU0RlcFRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiQXJyVGltZSIsICJmaWVsZF9uYW1lIjogIkFyclRpbWUiLCAicGFuZGFz'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNSU0FyclRpbWUiLCAiZmll'\n b'bGRfbmFtZSI6ICJDUlNBcnJUaW1lIiwgInBhbmRhc190eXBlIjogImludDY0'\n b'IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIlVuaXF1ZUNhcnJpZXIiLCAiZmllbGRfbmFtZSI6ICJVbmlx'\n b'dWVDYXJyaWVyIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlf'\n b'dHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjog'\n b'IkZsaWdodE51bSIsICJmaWVsZF9uYW1lIjogIkZsaWdodE51bSIsICJwYW5k'\n b'YXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYWlsTnVtIiwgImZpZWxkX25h'\n b'bWUiOiAiVGFpbE51bSIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51'\n b'bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJBY3R1YWxFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkFjdHVh'\n b'bEVsYXBzZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJDUlNFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBz'\n b'ZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJB'\n b'aXJUaW1lIiwgImZpZWxkX25hbWUiOiAiQWlyVGltZSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQXJyRGVsYXkiLCAiZmllbGRfbmFt'\n b'ZSI6ICJBcnJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiRGVwRGVsYXkiLCAiZmllbGRfbmFtZSI6ICJEZXBEZWxheSIsICJw'\n b'YW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2'\n b'NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiT3JpZ2luIiwgImZp'\n b'ZWxkX25hbWUiOiAiT3JpZ2luIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUi'\n b'LCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIkRlc3QiLCAiZmllbGRfbmFtZSI6ICJEZXN0IiwgInBhbmRh'\n b'c190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRpc3RhbmNlIiwgImZpZWxk'\n b'X25hbWUiOiAiRGlzdGFuY2UiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiVGF4aUluIiwgImZpZWxkX25hbWUiOiAiVGF4aUluIiwgInBhbmRh'\n b'c190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxk'\n b'X25hbWUiOiAiVGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiQ2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVk'\n b'IiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhbmNlbGxhdGlv'\n b'bkNvZGUiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsYXRpb25Db2RlIiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXZlcnRlZCIsICJm'\n b'aWVsZF9uYW1lIjogIkRpdmVydGVkIiwgInBhbmRhc190eXBlIjogImludDY0'\n b'IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVsZF9uYW1lIjogIkNhcnJp'\n b'ZXJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5'\n b'cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAi'\n b'V2VhdGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAiV2VhdGhlckRlbGF5Iiwg'\n b'InBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9h'\n b'dDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJOQVNEZWxheSIs'\n b'ICJmaWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5IiwgImZpZWxkX25hbWUi'\n b'OiAiU2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJM'\n b'YXRlQWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiSXNBcnJEZWxheWVkIiwgImZpZWxkX25hbWUiOiAiSXNBcnJE'\n b'ZWxheWVkIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlw'\n b'ZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIklz'\n b'RGVwRGVsYXllZCIsICJmaWVsZF9uYW1lIjogIklzRGVwRGVsYXllZCIsICJw'\n b'YW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6'\n b'ICJweWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVy'\n b'c2lvbiI6ICIwLjI1LjMifQAfAAAAVAYAABAGAADYBQAAoAUAAGgFAAAwBQAA'\n b'AAUAAMgEAACQBAAAWAQAACwEAADwAwAAuAMAAIgDAABUAwAAIAMAAPQCAADI'\n b'AgAAkAIAAGACAAAwAgAA+AEAALwBAACEAQAATAEAABQBAADgAAAAqAAAAGwA'\n b'AAA4AAAABAAAADj6//8AAAEFFAAAAAwAAAAEAAAAAAAAAMz7//8MAAAASXNE'\n b'ZXBEZWxheWVkAAAAAGj6//8AAAEFFAAAAAwAAAAEAAAAAAAAAPz7//8MAAAA'\n b'SXNBcnJEZWxheWVkAAAAAJj6//8AAAEDGAAAAAwAAAAEAAAAAAAAAGL7//8A'\n b'AAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAA0Pr//wAAAQMYAAAADAAAAAQA'\n b'AAAAAAAAmvv//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAAT7//8AAAEDGAAA'\n b'AAwAAAAEAAAAAAAAAM77//8AAAIACAAAAE5BU0RlbGF5AAAAADT7//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAAP77//8AAAIADAAAAFdlYXRoZXJEZWxheQAAAABo'\n b'+///AAABAxgAAAAMAAAABAAAAAAAAAAy/P//AAACAAwAAABDYXJyaWVyRGVs'\n b'YXkAAAAAnPv//wAAAQIcAAAADAAAAAQAAAAAAAAAjPv//wAAAAFAAAAACAAA'\n b'AERpdmVydGVkAAAAAND7//8AAAEDGAAAAAwAAAAEAAAAAAAAAJr8//8AAAIA'\n b'EAAAAENhbmNlbGxhdGlvbkNvZGUAAAAACPz//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAA+Pv//wAAAAFAAAAACQAAAENhbmNlbGxlZAAAADz8//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAAb9//8AAAIABwAAAFRheGlPdXQAaPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAAMv3//wAAAgAGAAAAVGF4aUluAACU/P//AAABAhwAAAAMAAAA'\n b'BAAAAAAAAACE/P//AAAAAUAAAAAIAAAARGlzdGFuY2UAAAAAyPz//wAAAQUU'\n b'AAAADAAAAAQAAAAAAAAAXP7//wQAAABEZXN0AAAAAPD8//8AAAEFFAAAAAwA'\n b'AAAEAAAAAAAAAIT+//8GAAAAT3JpZ2luAAAY/f//AAABAxgAAAAMAAAABAAA'\n b'AAAAAADi/f//AAACAAgAAABEZXBEZWxheQAAAABI/f//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAAAS/v//AAACAAgAAABBcnJEZWxheQAAAAB4/f//AAABAxgAAAAM'\n b'AAAABAAAAAAAAABC/v//AAACAAcAAABBaXJUaW1lAKT9//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAG7+//8AAAIADgAAAENSU0VsYXBzZWRUaW1lAADY/f//AAAB'\n b'AxgAAAAMAAAABAAAAAAAAACi/v//AAACABEAAABBY3R1YWxFbGFwc2VkVGlt'\n b'ZQAAABD+//8AAAEFFAAAAAwAAAAEAAAAAAAAAKT///8HAAAAVGFpbE51bQA4'\n b'/v//AAABAhwAAAAMAAAABAAAAAAAAAAo/v//AAAAAUAAAAAJAAAARmxpZ2h0'\n b'TnVtAAAAbP7//wAAAQUYAAAAEAAAAAQAAAAAAAAABAAEAAQAAAANAAAAVW5p'\n b'cXVlQ2FycmllcgAAAKD+//8AAAECHAAAAAwAAAAEAAAAAAAAAJD+//8AAAAB'\n b'QAAAAAoAAABDUlNBcnJUaW1lAADU/v//AAABAxgAAAAMAAAABAAAAAAAAACe'\n b'////AAACAAcAAABBcnJUaW1lAAD///8AAAECHAAAAAwAAAAEAAAAAAAAAPD+'\n b'//8AAAABQAAAAAoAAABDUlNEZXBUaW1lAAA0////AAABAyAAAAAUAAAABAAA'\n b'AAAAAAAAAAYACAAGAAYAAAAAAAIABwAAAERlcFRpbWUAaP///wAAAQIcAAAA'\n b'DAAAAAQAAAAAAAAAWP///wAAAAFAAAAACQAAAERheU9mV2VlawAAAJz///8A'\n b'AAECHAAAAAwAAAAEAAAAAAAAAIz///8AAAABQAAAAAoAAABEYXlvZk1vbnRo'\n b'AADQ////AAABAhwAAAAMAAAABAAAAAAAAADA////AAAAAUAAAAAFAAAATW9u'\n b'dGgAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIkAAAAFAAAAAQAAAAAAAAA'\n b'CAAMAAgABwAIAAAAAAAAAUAAAAAEAAAAWWVhcgAAAAA=')])" + ] + } + ], + "source": [ + "dataset.read?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## cleanup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# os.remove(parquet_file_path)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 5a80835a812cf0cd8e106721aac3dde4989cbead Mon Sep 17 00:00:00 2001 From: yasha Date: Mon, 27 Jan 2020 19:32:59 +0000 Subject: [PATCH 25/32] add dtype param for partitioning --- fileutils/arc_to_parquet/arc_to_parquet.py | 19 +- fileutils/arc_to_parquet/arc_to_parquet.yaml | 6 +- tests/arc_to_parquet-airlines.ipynb | 366 +++++-------------- 3 files changed, 107 insertions(+), 284 deletions(-) diff --git a/fileutils/arc_to_parquet/arc_to_parquet.py b/fileutils/arc_to_parquet/arc_to_parquet.py index 9bc2ef361..853dc7234 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.py +++ b/fileutils/arc_to_parquet/arc_to_parquet.py @@ -26,6 +26,7 @@ import os import json from pathlib import Path +import numpy as np import pandas as pd import pyarrow.parquet as pq import pyarrow as pa @@ -42,14 +43,19 @@ def arc_to_parquet( target_path: str = "", name: str = "", chunksize: int = 10_000, + dtype=None, + encoding: str = 'latin-1', log_data: bool = True, add_uid: bool = False, key: str = "raw_data", dataset: bool = False, - partition_cols = [] + partition_cols = [], + inc_cols: Optional[List[str]] = None ) -> None: """Open a file/object archive and save as a parquet file. + Partitioning requires precise specification of column types. + :param context: function context :param archive_url: any valid string path consistent with the path variable of pandas.read_csv, including strings as file paths, as urls, @@ -58,6 +64,8 @@ def arc_to_parquet( :param target_path: destination folder of table :param name: name file to be saved locally, also :param chunksize: (0) row size retrieved per iteration + :param inc_cols: include only these columns + :param dtype destination data type of specified columns :param key: key in artifact store (when log_data=True) :param dataset: (False) if True then target_path is folder for partitioned files @@ -65,13 +73,18 @@ def arc_to_parquet( """ if not name.endswith(".pqt"): name += ".pqt" - + dest_path = os.path.join(target_path, name) os.makedirs(os.path.join(target_path), exist_ok=True) if not os.path.isfile(dest_path): context.logger.info("destination file does not exist, downloading") pqwriter = None - for i, df in enumerate(pd.read_csv(archive_url, chunksize=chunksize, names=header)): + for i, df in enumerate(pd.read_csv(archive_url, + chunksize=chunksize, + names=header, + encoding=encoding, + usecols=inc_cols, + dtype=dtype)): table = pa.Table.from_pandas(df) if i == 0: pqwriter = pq.ParquetWriter(dest_path, table.schema) diff --git a/fileutils/arc_to_parquet/arc_to_parquet.yaml b/fileutils/arc_to_parquet/arc_to_parquet.yaml index 2e3f78c21..194d37b87 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.yaml +++ b/fileutils/arc_to_parquet/arc_to_parquet.yaml @@ -2,7 +2,7 @@ kind: job metadata: name: arc-to-parquet tag: '' - hash: 4f8217a2058dde23e82f681c5dc8d686b844ceb2 + hash: eca47e4446d7c75e2096b8dd4803aa557aad7d6e project: '' spec: command: '' @@ -12,7 +12,7 @@ spec: env: [] description: '' build: - functionSourceCode: IyBDb3B5cmlnaHQgMjAxOCBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgoKaW1wb3J0IHNzbAoKdHJ5OgogICAgX2NyZWF0ZV91bnZlcmlmaWVkX2h0dHBzX2NvbnRleHQgPSBzc2wuX2NyZWF0ZV91bnZlcmlmaWVkX2NvbnRleHQKZXhjZXB0IEF0dHJpYnV0ZUVycm9yOgogICAgIyBMZWdhY3kgUHl0aG9uIHRoYXQgZG9lc24ndCB2ZXJpZnkgSFRUUFMgY2VydGlmaWNhdGVzIGJ5IGRlZmF1bHQKICAgIHBhc3MKZWxzZToKICAgICMgSGFuZGxlIHRhcmdldCBlbnZpcm9ubWVudCB0aGF0IGRvZXNuJ3Qgc3VwcG9ydCBIVFRQUyB2ZXJpZmljYXRpb24KICAgIHNzbC5fY3JlYXRlX2RlZmF1bHRfaHR0cHNfY29udGV4dCA9IF9jcmVhdGVfdW52ZXJpZmllZF9odHRwc19jb250ZXh0CgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IHB5YXJyb3cucGFycXVldCBhcyBwcQppbXBvcnQgcHlhcnJvdyBhcyBwYQpmcm9tIHBpY2tsZSBpbXBvcnQgZHVtcCwgbG9hZAoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gdHlwaW5nIGltcG9ydCBJTywgQW55U3RyLCBVbmlvbiwgTGlzdCwgT3B0aW9uYWwKCgpkZWYgYXJjX3RvX3BhcnF1ZXQoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwKICAgIGFyY2hpdmVfdXJsOiBVbmlvbltzdHIsIFBhdGgsIElPW0FueVN0cl1dLAogICAgaGVhZGVyOiBPcHRpb25hbFtMaXN0W3N0cl1dID0gTm9uZSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAiIiwKICAgIG5hbWU6IHN0ciA9ICIiLAogICAgY2h1bmtzaXplOiBpbnQgPSAxMF8wMDAsCiAgICBsb2dfZGF0YTogYm9vbCA9IFRydWUsCiAgICBhZGRfdWlkOiBib29sID0gRmFsc2UsCiAgICBrZXk6IHN0ciA9ICJyYXdfZGF0YSIsCiAgICBkYXRhc2V0OiBib29sID0gRmFsc2UsCiAgICBwYXJ0aXRpb25fY29scyA9IFtdCikgLT4gTm9uZToKICAgICIiIk9wZW4gYSBmaWxlL29iamVjdCBhcmNoaXZlIGFuZCBzYXZlIGFzIGEgcGFycXVldCBmaWxlLgogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBhcmNoaXZlX3VybDogYW55IHZhbGlkIHN0cmluZyBwYXRoIGNvbnNpc3RlbnQgd2l0aCB0aGUgcGF0aCB2YXJpYWJsZQogICAgICAgICAgICAgICAgICAgICAgICBvZiBwYW5kYXMucmVhZF9jc3YsIGluY2x1ZGluZyBzdHJpbmdzIGFzIGZpbGUgcGF0aHMsIGFzIHVybHMsIAogICAgICAgICAgICAgICAgICAgICAgICBwYXRobGliLlBhdGggb2JqZWN0cywgZXRjLi4uCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgIGNvbHVtbiBuYW1lcwogICAgOnBhcmFtIHRhcmdldF9wYXRoOiBkZXN0aW5hdGlvbiBmb2xkZXIgb2YgdGFibGUKICAgIDpwYXJhbSBuYW1lOiAgICAgICAgbmFtZSBmaWxlIHRvIGJlIHNhdmVkIGxvY2FsbHksIGFsc28KICAgIDpwYXJhbSBjaHVua3NpemU6ICAgKDApIHJvdyBzaXplIHJldHJpZXZlZCBwZXIgaXRlcmF0aW9uCiAgICA6cGFyYW0ga2V5OiAgICAgICAgIGtleSBpbiBhcnRpZmFjdCBzdG9yZSAod2hlbiBsb2dfZGF0YT1UcnVlKQogICAgOnBhcmFtIGRhdGFzZXQ6ICAgICAoRmFsc2UpIGlmIFRydWUgdGhlbiB0YXJnZXRfcGF0aCBpcyBmb2xkZXIgZm9yCiAgICAgICAgICAgICAgICAgICAgICAgIHBhcnRpdGlvbmVkIGZpbGVzCiAgICA6cGFyYW0gcGFydF9jb2xzOiAgIChbXSkgbGlzdCBvZiBwYXJ0aXRpb25pbmcgY29sdW1ucwogICAgIiIiCiAgICBpZiBub3QgbmFtZS5lbmRzd2l0aCgiLnBxdCIpOgogICAgICAgIG5hbWUgKz0gIi5wcXQiCgogICAgZGVzdF9wYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgb3MubWFrZWRpcnMob3MucGF0aC5qb2luKHRhcmdldF9wYXRoKSwgZXhpc3Rfb2s9VHJ1ZSkKICAgIGlmIG5vdCBvcy5wYXRoLmlzZmlsZShkZXN0X3BhdGgpOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oImRlc3RpbmF0aW9uIGZpbGUgZG9lcyBub3QgZXhpc3QsIGRvd25sb2FkaW5nIikKICAgICAgICBwcXdyaXRlciA9IE5vbmUKICAgICAgICBmb3IgaSwgZGYgaW4gZW51bWVyYXRlKHBkLnJlYWRfY3N2KGFyY2hpdmVfdXJsLCBjaHVua3NpemU9Y2h1bmtzaXplLCBuYW1lcz1oZWFkZXIpKToKICAgICAgICAgICAgdGFibGUgPSBwYS5UYWJsZS5mcm9tX3BhbmRhcyhkZikKICAgICAgICAgICAgaWYgaSA9PSAwOgogICAgICAgICAgICAgICAgcHF3cml0ZXIgPSBwcS5QYXJxdWV0V3JpdGVyKGRlc3RfcGF0aCwgdGFibGUuc2NoZW1hKQogICAgICAgICAgICBpZiBkYXRhc2V0OgogICAgICAgICAgICAgICAgcHEud3JpdGVfdG9fZGF0YXNldCh0YWJsZSwgcm9vdF9wYXRoPXRhcmdldF9wYXRoLCBwYXJ0aXRpb25fY29scz1wYXJ0aXRpb25fY29scykKICAgICAgICAgICAgZWxzZToKICAgICAgICAgICAgICAgIHBxd3JpdGVyLndyaXRlX3RhYmxlKHRhYmxlKQogICAgICAgICAgICAKICAgICAgICBpZiBwcXdyaXRlcjoKICAgICAgICAgICAgcHF3cml0ZXIuY2xvc2UoKQoKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKGYic2F2ZWQgdGFibGUgdG8ge2Rlc3RfcGF0aH0iKQogICAgZWxzZToKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCJkZXN0aW5hdGlvbiBmaWxlIGFscmVhZHkgZXhpc3RzIikKCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWRlc3RfcGF0aCkKICAgIAogICAgIyBsb2cgaGVhZGVyCiAgICBmaWxlcGF0aCA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgJ2hlYWRlci5wa2wnKQogICAgZHVtcChoZWFkZXIsIG9wZW4oZmlsZXBhdGgsICd3YicpKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ2hlYWRlcicsIHRhcmdldF9wYXRoPWZpbGVwYXRoKQo= + functionSourceCode: IyBDb3B5cmlnaHQgMjAxOCBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgoKaW1wb3J0IHNzbAoKdHJ5OgogICAgX2NyZWF0ZV91bnZlcmlmaWVkX2h0dHBzX2NvbnRleHQgPSBzc2wuX2NyZWF0ZV91bnZlcmlmaWVkX2NvbnRleHQKZXhjZXB0IEF0dHJpYnV0ZUVycm9yOgogICAgIyBMZWdhY3kgUHl0aG9uIHRoYXQgZG9lc24ndCB2ZXJpZnkgSFRUUFMgY2VydGlmaWNhdGVzIGJ5IGRlZmF1bHQKICAgIHBhc3MKZWxzZToKICAgICMgSGFuZGxlIHRhcmdldCBlbnZpcm9ubWVudCB0aGF0IGRvZXNuJ3Qgc3VwcG9ydCBIVFRQUyB2ZXJpZmljYXRpb24KICAgIHNzbC5fY3JlYXRlX2RlZmF1bHRfaHR0cHNfY29udGV4dCA9IF9jcmVhdGVfdW52ZXJpZmllZF9odHRwc19jb250ZXh0CgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBudW1weSBhcyBucAppbXBvcnQgcGFuZGFzIGFzIHBkCmltcG9ydCBweWFycm93LnBhcnF1ZXQgYXMgcHEKaW1wb3J0IHB5YXJyb3cgYXMgcGEKZnJvbSBwaWNrbGUgaW1wb3J0IGR1bXAsIGxvYWQKCmZyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIHR5cGluZyBpbXBvcnQgSU8sIEFueVN0ciwgVW5pb24sIExpc3QsIE9wdGlvbmFsCgoKZGVmIGFyY190b19wYXJxdWV0KAogICAgY29udGV4dDogTUxDbGllbnRDdHgsCiAgICBhcmNoaXZlX3VybDogVW5pb25bc3RyLCBQYXRoLCBJT1tBbnlTdHJdXSwKICAgIGhlYWRlcjogT3B0aW9uYWxbTGlzdFtzdHJdXSA9IE5vbmUsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gIiIsCiAgICBuYW1lOiBzdHIgPSAiIiwKICAgIGNodW5rc2l6ZTogaW50ID0gMTBfMDAwLAogICAgZHR5cGU9Tm9uZSwKICAgIGVuY29kaW5nOiBzdHIgPSAnbGF0aW4tMScsCiAgICBsb2dfZGF0YTogYm9vbCA9IFRydWUsCiAgICBhZGRfdWlkOiBib29sID0gRmFsc2UsCiAgICBrZXk6IHN0ciA9ICJyYXdfZGF0YSIsCiAgICBkYXRhc2V0OiBib29sID0gRmFsc2UsCiAgICBwYXJ0aXRpb25fY29scyA9IFtdLAogICAgaW5jX2NvbHM6IE9wdGlvbmFsW0xpc3Rbc3RyXV0gPSBOb25lCikgLT4gTm9uZToKICAgICIiIk9wZW4gYSBmaWxlL29iamVjdCBhcmNoaXZlIGFuZCBzYXZlIGFzIGEgcGFycXVldCBmaWxlLgogICAgCiAgICBQYXJ0aXRpb25pbmcgcmVxdWlyZXMgcHJlY2lzZSBzcGVjaWZpY2F0aW9uIG9mIGNvbHVtbiB0eXBlcy4KICAgIAogICAgOnBhcmFtIGNvbnRleHQ6ICAgICBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gYXJjaGl2ZV91cmw6IGFueSB2YWxpZCBzdHJpbmcgcGF0aCBjb25zaXN0ZW50IHdpdGggdGhlIHBhdGggdmFyaWFibGUKICAgICAgICAgICAgICAgICAgICAgICAgb2YgcGFuZGFzLnJlYWRfY3N2LCBpbmNsdWRpbmcgc3RyaW5ncyBhcyBmaWxlIHBhdGhzLCBhcyB1cmxzLCAKICAgICAgICAgICAgICAgICAgICAgICAgcGF0aGxpYi5QYXRoIG9iamVjdHMsIGV0Yy4uLgogICAgOnBhcmFtIGhlYWRlcjogICAgICBjb2x1bW4gbmFtZXMKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogZGVzdGluYXRpb24gZm9sZGVyIG9mIHRhYmxlCiAgICA6cGFyYW0gbmFtZTogICAgICAgIG5hbWUgZmlsZSB0byBiZSBzYXZlZCBsb2NhbGx5LCBhbHNvCiAgICA6cGFyYW0gY2h1bmtzaXplOiAgICgwKSByb3cgc2l6ZSByZXRyaWV2ZWQgcGVyIGl0ZXJhdGlvbgogICAgOnBhcmFtIGluY19jb2xzOiAgICBpbmNsdWRlIG9ubHkgdGhlc2UgY29sdW1ucwogICAgOnBhcmFtIGR0eXBlICAgICAgICBkZXN0aW5hdGlvbiBkYXRhIHR5cGUgb2Ygc3BlY2lmaWVkIGNvbHVtbnMKICAgIDpwYXJhbSBrZXk6ICAgICAgICAga2V5IGluIGFydGlmYWN0IHN0b3JlICh3aGVuIGxvZ19kYXRhPVRydWUpCiAgICA6cGFyYW0gZGF0YXNldDogICAgIChGYWxzZSkgaWYgVHJ1ZSB0aGVuIHRhcmdldF9wYXRoIGlzIGZvbGRlciBmb3IKICAgICAgICAgICAgICAgICAgICAgICAgcGFydGl0aW9uZWQgZmlsZXMKICAgIDpwYXJhbSBwYXJ0X2NvbHM6ICAgKFtdKSBsaXN0IG9mIHBhcnRpdGlvbmluZyBjb2x1bW5zCiAgICAiIiIKICAgIGlmIG5vdCBuYW1lLmVuZHN3aXRoKCIucHF0Iik6CiAgICAgICAgbmFtZSArPSAiLnBxdCIKICAgIAogICAgZGVzdF9wYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgb3MubWFrZWRpcnMob3MucGF0aC5qb2luKHRhcmdldF9wYXRoKSwgZXhpc3Rfb2s9VHJ1ZSkKICAgIGlmIG5vdCBvcy5wYXRoLmlzZmlsZShkZXN0X3BhdGgpOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oImRlc3RpbmF0aW9uIGZpbGUgZG9lcyBub3QgZXhpc3QsIGRvd25sb2FkaW5nIikKICAgICAgICBwcXdyaXRlciA9IE5vbmUKICAgICAgICBmb3IgaSwgZGYgaW4gZW51bWVyYXRlKHBkLnJlYWRfY3N2KGFyY2hpdmVfdXJsLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGNodW5rc2l6ZT1jaHVua3NpemUsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbmFtZXM9aGVhZGVyLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGVuY29kaW5nPWVuY29kaW5nLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHVzZWNvbHM9aW5jX2NvbHMsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZHR5cGU9ZHR5cGUpKToKICAgICAgICAgICAgdGFibGUgPSBwYS5UYWJsZS5mcm9tX3BhbmRhcyhkZikKICAgICAgICAgICAgaWYgaSA9PSAwOgogICAgICAgICAgICAgICAgcHF3cml0ZXIgPSBwcS5QYXJxdWV0V3JpdGVyKGRlc3RfcGF0aCwgdGFibGUuc2NoZW1hKQogICAgICAgICAgICBpZiBkYXRhc2V0OgogICAgICAgICAgICAgICAgcHEud3JpdGVfdG9fZGF0YXNldCh0YWJsZSwgcm9vdF9wYXRoPXRhcmdldF9wYXRoLCBwYXJ0aXRpb25fY29scz1wYXJ0aXRpb25fY29scykKICAgICAgICAgICAgZWxzZToKICAgICAgICAgICAgICAgIHBxd3JpdGVyLndyaXRlX3RhYmxlKHRhYmxlKQogICAgICAgICAgICAKICAgICAgICBpZiBwcXdyaXRlcjoKICAgICAgICAgICAgcHF3cml0ZXIuY2xvc2UoKQoKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKGYic2F2ZWQgdGFibGUgdG8ge2Rlc3RfcGF0aH0iKQogICAgZWxzZToKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCJkZXN0aW5hdGlvbiBmaWxlIGFscmVhZHkgZXhpc3RzIikKCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWRlc3RfcGF0aCkKICAgIAogICAgIyBsb2cgaGVhZGVyCiAgICBmaWxlcGF0aCA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgJ2hlYWRlci5wa2wnKQogICAgZHVtcChoZWFkZXIsIG9wZW4oZmlsZXBhdGgsICd3YicpKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ2hlYWRlcicsIHRhcmdldF9wYXRoPWZpbGVwYXRoKQo= base_image: yjbds/mlrun-files:latest commands: [] - code_origin: https://github.com/yjb-ds/functions.git#eb009dac39c64611adac33d24c7e33ba0856c941:/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py + code_origin: https://github.com/yjb-ds/functions.git#e67bbbba46a2b5767f5ed561da882d90a979e670:/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py diff --git a/tests/arc_to_parquet-airlines.ipynb b/tests/arc_to_parquet-airlines.ipynb index dca07524d..c6da8182b 100644 --- a/tests/arc_to_parquet-airlines.ipynb +++ b/tests/arc_to_parquet-airlines.ipynb @@ -11,7 +11,7 @@ }, { "cell_type": "code", - "execution_count": 56, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -20,6 +20,18 @@ "mlrun.mlconf.dbpath = 'http://mlrun-api:8080'" ] }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import pyarrow as pa\n", + "import pyarrow.parquet as pq" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -29,17 +41,19 @@ }, { "cell_type": "code", - "execution_count": 67, + "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "BASE_IMAGE = 'yjbds/mlrun-files:latest'\n", "\n", "CODE_BASE = '/User/repos/functions/'\n", + "PROJECT = 'fileutils/arc_to_parquet'\n", + "\n", "TARGET_PATH = '/User/mlrun/airlines/dataset'\n", "\n", "ARCHIVE_BIG = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears_10.csv\"\n", - "ARCHIVE = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears.csv\"\n", + "ARCHIVE = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears.csv\"\n", "ARCHIVE_SMALL = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv\"" ] }, @@ -52,7 +66,7 @@ }, { "cell_type": "code", - "execution_count": 68, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -61,7 +75,7 @@ }, { "cell_type": "code", - "execution_count": 69, + "execution_count": 19, "metadata": {}, "outputs": [], "source": [ @@ -73,12 +87,35 @@ " 'UniqueCarrier','FlightNum','TailNum','ActualElapsedTime','CRSElapsedTime','AirTime',\n", " 'ArrDelay','DepDelay','Origin','Dest','Distance','TaxiIn','TaxiOut','Cancelled',\n", " 'CancellationCode','Diverted','CarrierDelay','WeatherDelay','NASDelay','SecurityDelay',\n", - " 'LateAircraftDelay']" + " 'LateAircraftDelay']\n", + "INC_COLS = ['Year','Month','DayofMonth','DayOfWeek','DepTime','CRSDepTime','ArrTime','CRSArrTime',\n", + " 'UniqueCarrier','FlightNum', 'CRSElapsedTime','AirTime',\n", + " 'Origin','Dest','Distance','TaxiOut','Cancelled',\n", + " 'CarrierDelay','WeatherDelay','NASDelay','SecurityDelay',\n", + " 'LateAircraftDelay']\n", + "\n", + "ENCODING = 'latin-1'\n", + "\n", + "DTYPES_COLS = {\n", + " 'CRSElapsedTime': 'float64', \n", + " 'TailNum': 'str', \n", + " 'Distance': 'float64', \n", + " 'TaxiOut': 'float64',\n", + " 'ArrTime': 'float64',\n", + " 'DepTime':'float64', \n", + " 'CarrierDelay': 'float64', \n", + " 'WeatherDelay': 'float64', \n", + " 'NASDelay':'float64', \n", + " 'SecurityDelay':'float64', \n", + " 'LateAircraftDelay':'float64'}\n", + "\n", + "USE_PARTITIONS = True\n", + "PARTITION_COLS = ['Year', 'Month']" ] }, { "cell_type": "code", - "execution_count": 70, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -96,24 +133,24 @@ }, { "cell_type": "code", - "execution_count": 71, + "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-27 15:27:39,471 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" + "[mlrun] 2020-01-27 19:05:25,830 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" ] } ], "source": [ "# load function from a local Python file\n", "arctoparq = mlrun.code_to_function(\n", - " filename=os.path.join(CODE_BASE, 'fileutils/arc_to_parquet', 'arc_to_parquet.py'), \n", + " filename=os.path.join(CODE_BASE, PROJECT, 'arc_to_parquet.py'), \n", " kind='job')\n", "arctoparq.build_config(base_image=BASE_IMAGE, commands=[])\n", - "yaml_name = os.path.join(CODE_BASE, 'fileutils/arc_to_parquet', 'arc_to_parquet.yaml')\n", + "yaml_name = os.path.join(CODE_BASE, PROJECT, 'arc_to_parquet.yaml')\n", "arctoparq.export(yaml_name)" ] }, @@ -126,12 +163,12 @@ }, { "cell_type": "code", - "execution_count": 72, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "arctoparq = mlrun.import_function(\n", - " os.path.join(CODE_BASE, 'fileutils/arc_to_parquet', 'arc_to_parquet.yaml')\n", + " os.path.join(CODE_BASE, PROJECT, 'arc_to_parquet.yaml')\n", ").apply(mlrun.mount_v3io())" ] }, @@ -151,7 +188,7 @@ }, { "cell_type": "code", - "execution_count": 73, + "execution_count": 9, "metadata": {}, "outputs": [ { @@ -160,7 +197,7 @@ "'ready'" ] }, - "execution_count": 73, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } @@ -178,12 +215,13 @@ "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-27 15:27:41,838 starting run arc2parq uid=d3774f73733c4abeb45c9dd7a7a4862c -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-27 15:27:41,973 Job is running in the background, pod: arc2parq-6d86t\n" + "[mlrun] 2020-01-27 19:21:20,254 starting run arc2parq uid=d3a5446edb94436d91efe4b2d7c64c2b -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-27 19:21:20,379 Job is running in the background, pod: arc2parq-5qs84\n" ] } ], "source": [ + "%%time\n", "# create and run the task\n", "arc_to_parq_task = mlrun.NewTask(\n", " 'arc2parq', \n", @@ -193,9 +231,11 @@ " 'name' : FILE_NAME, \n", " 'key' : KEY,\n", " 'archive_url': USE_ARCHIVE,\n", - " 'dataset' : True,\n", - " 'part_cols' : ['Year', 'Month']\n", - " })\n", + " 'dataset' : USE_PARTITIONS,\n", + " 'part_cols' : PARTITION_COLS,\n", + " 'encoding' : ENCODING,\n", + " 'inc_cols' : INC_COLS,\n", + " 'dtype' : DTYPES_COLS})\n", "# run\n", "run = arctoparq.run(arc_to_parq_task)" ] @@ -218,298 +258,65 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### single parquet file\n", - "\n", - "run this only when `dataset=False`" - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import numpy as np\n", - "import pandas as pd" - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "metadata": {}, - "outputs": [], - "source": [ - "# add more context tests\n", - "# convert these to real tests" - ] - }, - { - "cell_type": "code", - "execution_count": 42, - "metadata": {}, - "outputs": [], - "source": [ - "assert KEY in run.outputs.keys(), f\"mlrun.functions: key {KEY} not found in outputs\"\n", - "assert os.path.isfile(TARGET_PATH+'/'+ FILE_NAME), f\"mlrun.functions: artifact source not found at {TARGET_PATH+'/'+ FILE_NAME}\"" - ] - }, - { - "cell_type": "code", - "execution_count": 43, - "metadata": {}, - "outputs": [], - "source": [ - "copied = pd.read_parquet(TARGET_PATH+'/'+ FILE_NAME, engine=\"pyarrow\")" - ] - }, - { - "cell_type": "code", - "execution_count": 44, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
YearMonthDayofMonthDayOfWeekDepTimeCRSDepTimeArrTimeCRSArrTimeUniqueCarrierFlightNum...CancelledCancellationCodeDivertedCarrierDelayWeatherDelayNASDelaySecurityDelayLateAircraftDelayIsArrDelayedIsDepDelayed
\n", - "

0 rows × 31 columns

\n", - "
" - ], - "text/plain": [ - "Empty DataFrame\n", - "Columns: [Year, Month, DayofMonth, DayOfWeek, DepTime, CRSDepTime, ArrTime, CRSArrTime, UniqueCarrier, FlightNum, TailNum, ActualElapsedTime, CRSElapsedTime, AirTime, ArrDelay, DepDelay, Origin, Dest, Distance, TaxiIn, TaxiOut, Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay, IsArrDelayed, IsDepDelayed]\n", - "Index: []\n", - "\n", - "[0 rows x 31 columns]" - ] - }, - "execution_count": 44, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "copied.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 45, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(0, 31)" - ] - }, - "execution_count": 45, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "copied.shape" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "run this only when `dataset=True`" + "### a partitioned parquet table" ] }, { "cell_type": "code", - "execution_count": 47, - "metadata": {}, - "outputs": [], - "source": [ - "import pyarrow.parquet as pq\n", - "import pyarrow as pa" - ] - }, - { - "cell_type": "code", - "execution_count": 65, + "execution_count": 16, "metadata": {}, "outputs": [ { "ename": "ValueError", - "evalue": "Schema in /User/mlrun/airlines/dataset/a3f9441653c14708ad69f63694749301.parquet was different. \nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nTailNum: double\nActualElapsedTime: double\nCRSElapsedTime: int64\nAirTime: double\nArrDelay: double\nDepDelay: double\nOrigin: string\nDest: string\nDistance: double\nTaxiIn: double\nTaxiOut: double\nCancelled: int64\nCancellationCode: double\nDiverted: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nIsArrDelayed: string\nIsDepDelayed: string\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'0, \"stop\": 10000, \"step\": 1}], \"column_indexes\": [{\"name\": n'\n b'ull, \"field_name\": null, \"pandas_type\": \"unicode\", \"numpy_ty'\n b'pe\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}], \"columns'\n b'\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas_type\": \"i'\n b'nt64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"M'\n b'onth\", \"field_name\": \"Month\", \"pandas_type\": \"int64\", \"numpy'\n b'_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayofMonth\", \"'\n b'field_name\": \"DayofMonth\", \"pandas_type\": \"int64\", \"numpy_ty'\n b'pe\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWeek\", \"fiel'\n b'd_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"numpy_type\": '\n b'\"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"field_name\"'\n b': \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"field_name\": '\n b'\"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\",'\n b' \"metadata\": null}, {\"name\": \"ArrTime\", \"field_name\": \"ArrTi'\n b'me\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"CRSArrT'\n b'ime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metada'\n b'ta\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"UniqueC'\n b'arrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"'\n b'metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": \"Fligh'\n b'tNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metad'\n b'ata\": null}, {\"name\": \"TailNum\", \"field_name\": \"TailNum\", \"p'\n b'andas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\":'\n b' null}, {\"name\": \"ActualElapsedTime\", \"field_name\": \"ActualE'\n b'lapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_nam'\n b'e\": \"CRSElapsedTime\", \"pandas_type\": \"int64\", \"numpy_type\": '\n b'\"int64\", \"metadata\": null}, {\"name\": \"AirTime\", \"field_name\"'\n b': \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"ArrDelay\", \"field_name\": \"A'\n b'rrDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\",'\n b' \"metadata\": null}, {\"name\": \"DepDelay\", \"field_name\": \"DepD'\n b'elay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"m'\n b'etadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Origin\", '\n b'\"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\"'\n b': null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pandas_type'\n b'\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {\"n'\n b'ame\": \"Distance\", \"field_name\": \"Distance\", \"pandas_type\": \"'\n b'float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name'\n b'\": \"TaxiIn\", \"field_name\": \"TaxiIn\", \"pandas_type\": \"float64'\n b'\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Tax'\n b'iOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"float64\", \"n'\n b'umpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Cancelle'\n b'd\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int64\", \"nump'\n b'y_type\": \"int64\", \"metadata\": null}, {\"name\": \"CancellationC'\n b'ode\", \"field_name\": \"CancellationCode\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'Diverted\", \"field_name\": \"Diverted\", \"pandas_type\": \"int64\",'\n b' \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carrier'\n b'Delay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"float6'\n b'4\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"We'\n b'atherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_type\"'\n b': \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"n'\n b'ame\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDelay\"'\n b', \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metada'\n b'ta\": null}, {\"name\": \"IsArrDelayed\", \"field_name\": \"IsArrDel'\n b'ayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"me'\n b'tadata\": null}, {\"name\": \"IsDepDelayed\", \"field_name\": \"IsDe'\n b'pDelayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\",'\n b' \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"vers'\n b'ion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////6AWAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAANwPAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAKQPAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDEwMDAw'\n b'LCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51'\n b'bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNv'\n b'ZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVu'\n b'Y29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogIlll'\n b'YXIiLCAiZmllbGRfbmFtZSI6ICJZZWFyIiwgInBhbmRhc190eXBlIjogImlu'\n b'dDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxs'\n b'fSwgeyJuYW1lIjogIk1vbnRoIiwgImZpZWxkX25hbWUiOiAiTW9udGgiLCAi'\n b'cGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIs'\n b'ICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiRGF5b2ZNb250aCIsICJm'\n b'aWVsZF9uYW1lIjogIkRheW9mTW9udGgiLCAicGFuZGFzX3R5cGUiOiAiaW50'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9'\n b'LCB7Im5hbWUiOiAiRGF5T2ZXZWVrIiwgImZpZWxkX25hbWUiOiAiRGF5T2ZX'\n b'ZWVrIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAi'\n b'aW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRlcFRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJEZXBUaW1lIiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJDUlNEZXBUaW1lIiwgImZpZWxkX25hbWUiOiAi'\n b'Q1JTRGVwVGltZSIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90'\n b'eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJB'\n b'cnJUaW1lIiwgImZpZWxkX25hbWUiOiAiQXJyVGltZSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTQXJyVGltZSIsICJmaWVsZF9u'\n b'YW1lIjogIkNSU0FyclRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiVW5pcXVlQ2FycmllciIsICJmaWVsZF9uYW1lIjogIlVuaXF1ZUNh'\n b'cnJpZXIiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBl'\n b'IjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiRmxp'\n b'Z2h0TnVtIiwgImZpZWxkX25hbWUiOiAiRmxpZ2h0TnVtIiwgInBhbmRhc190'\n b'eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIlRhaWxOdW0iLCAiZmllbGRfbmFtZSI6'\n b'ICJUYWlsTnVtIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJBY3R1YWxFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkFjdHVhbEVs'\n b'YXBzZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJDUlNFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBzZWRU'\n b'aW1lIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAi'\n b'aW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkFpclRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJBaXJUaW1lIiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJBcnJEZWxheSIsICJmaWVsZF9uYW1lIjogIkFy'\n b'ckRlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJE'\n b'ZXBEZWxheSIsICJmaWVsZF9uYW1lIjogIkRlcERlbGF5IiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJPcmlnaW4iLCAiZmllbGRfbmFt'\n b'ZSI6ICJPcmlnaW4iLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1w'\n b'eV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiRGVzdCIsICJmaWVsZF9uYW1lIjogIkRlc3QiLCAicGFuZGFzX3R5cGUi'\n b'OiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0'\n b'YSI6IG51bGx9LCB7Im5hbWUiOiAiRGlzdGFuY2UiLCAiZmllbGRfbmFtZSI6'\n b'ICJEaXN0YW5jZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiVGF4aUluIiwgImZpZWxkX25hbWUiOiAiVGF4aUluIiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxkX25h'\n b'bWUiOiAiVGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiQ2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVkIiwg'\n b'InBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQi'\n b'LCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhbmNlbGxhdGlvbkNv'\n b'ZGUiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsYXRpb25Db2RlIiwgInBhbmRh'\n b'c190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXZlcnRlZCIsICJmaWVs'\n b'ZF9uYW1lIjogIkRpdmVydGVkIiwgInBhbmRhc190eXBlIjogImludDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJu'\n b'YW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVsZF9uYW1lIjogIkNhcnJpZXJE'\n b'ZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUi'\n b'OiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiV2Vh'\n b'dGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAiV2VhdGhlckRlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJOQVNEZWxheSIsICJm'\n b'aWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVs'\n b'bH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5IiwgImZpZWxkX25hbWUiOiAi'\n b'U2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJMYXRl'\n b'QWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiSXNBcnJEZWxheWVkIiwgImZpZWxkX25hbWUiOiAiSXNBcnJEZWxh'\n b'eWVkIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6'\n b'ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIklzRGVw'\n b'RGVsYXllZCIsICJmaWVsZF9uYW1lIjogIklzRGVwRGVsYXllZCIsICJwYW5k'\n b'YXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJw'\n b'eWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVyc2lv'\n b'biI6ICIwLjI1LjMifQAAAAAfAAAAWAYAABQGAADcBQAApAUAAGwFAAA0BQAA'\n b'BAUAAMwEAACUBAAAXAQAACwEAADwAwAAtAMAAIQDAABQAwAAHAMAAPACAADE'\n b'AgAAkAIAAGACAAAwAgAA+AEAALwBAACEAQAATAEAABQBAADgAAAAqAAAAGwA'\n b'AAA4AAAABAAAADT6//8AAAEFFAAAAAwAAAAEAAAAAAAAAMj7//8MAAAASXNE'\n b'ZXBEZWxheWVkAAAAAGT6//8AAAEFFAAAAAwAAAAEAAAAAAAAAPj7//8MAAAA'\n b'SXNBcnJEZWxheWVkAAAAAJT6//8AAAEDGAAAAAwAAAAEAAAAAAAAAF77//8A'\n b'AAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAAzPr//wAAAQMYAAAADAAAAAQA'\n b'AAAAAAAAlvv//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAAD7//8AAAEDGAAA'\n b'AAwAAAAEAAAAAAAAAMr7//8AAAIACAAAAE5BU0RlbGF5AAAAADD7//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAAPr7//8AAAIADAAAAFdlYXRoZXJEZWxheQAAAABk'\n b'+///AAABAxgAAAAMAAAABAAAAAAAAAAu/P//AAACAAwAAABDYXJyaWVyRGVs'\n b'YXkAAAAAmPv//wAAAQIcAAAADAAAAAQAAAAAAAAAiPv//wAAAAFAAAAACAAA'\n b'AERpdmVydGVkAAAAAMz7//8AAAEDGAAAAAwAAAAEAAAAAAAAAJb8//8AAAIA'\n b'EAAAAENhbmNlbGxhdGlvbkNvZGUAAAAABPz//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAA9Pv//wAAAAFAAAAACQAAAENhbmNlbGxlZAAAADj8//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAAL9//8AAAIABwAAAFRheGlPdXQAZPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAALv3//wAAAgAGAAAAVGF4aUluAACQ/P//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAABa/f//AAACAAgAAABEaXN0YW5jZQAAAADA/P//AAABBRQAAAAM'\n b'AAAABAAAAAAAAABU/v//BAAAAERlc3QAAAAA6Pz//wAAAQUUAAAADAAAAAQA'\n b'AAAAAAAAfP7//wYAAABPcmlnaW4AABD9//8AAAEDGAAAAAwAAAAEAAAAAAAA'\n b'ANr9//8AAAIACAAAAERlcERlbGF5AAAAAED9//8AAAEDGAAAAAwAAAAEAAAA'\n b'AAAAAAr+//8AAAIACAAAAEFyckRlbGF5AAAAAHD9//8AAAEDGAAAAAwAAAAE'\n b'AAAAAAAAADr+//8AAAIABwAAAEFpclRpbWUAnP3//wAAAQIcAAAADAAAAAQA'\n b'AAAAAAAAjP3//wAAAAFAAAAADgAAAENSU0VsYXBzZWRUaW1lAADU/f//AAAB'\n b'AxgAAAAMAAAABAAAAAAAAACe/v//AAACABEAAABBY3R1YWxFbGFwc2VkVGlt'\n b'ZQAAAAz+//8AAAEDGAAAAAwAAAAEAAAAAAAAANb+//8AAAIABwAAAFRhaWxO'\n b'dW0AOP7//wAAAQIcAAAADAAAAAQAAAAAAAAAKP7//wAAAAFAAAAACQAAAEZs'\n b'aWdodE51bQAAAGz+//8AAAEFGAAAABAAAAAEAAAAAAAAAAQABAAEAAAADQAA'\n b'AFVuaXF1ZUNhcnJpZXIAAACg/v//AAABAhwAAAAMAAAABAAAAAAAAACQ/v//'\n b'AAAAAUAAAAAKAAAAQ1JTQXJyVGltZQAA1P7//wAAAQMYAAAADAAAAAQAAAAA'\n b'AAAAnv///wAAAgAHAAAAQXJyVGltZQAA////AAABAhwAAAAMAAAABAAAAAAA'\n b'AADw/v//AAAAAUAAAAAKAAAAQ1JTRGVwVGltZQAANP///wAAAQMgAAAAFAAA'\n b'AAQAAAAAAAAAAAAGAAgABgAGAAAAAAACAAcAAABEZXBUaW1lAGj///8AAAEC'\n b'HAAAAAwAAAAEAAAAAAAAAFj///8AAAABQAAAAAkAAABEYXlPZldlZWsAAACc'\n b'////AAABAhwAAAAMAAAABAAAAAAAAACM////AAAAAUAAAAAKAAAARGF5b2ZN'\n b'b250aAAA0P///wAAAQIcAAAADAAAAAQAAAAAAAAAwP///wAAAAFAAAAABQAA'\n b'AE1vbnRoAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAECJAAAABQAAAAEAAAA'\n b'AAAAAAgADAAIAAcACAAAAAAAAAFAAAAABAAAAFllYXIAAAAAAAAAAA==')])\n\nvs\n\nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nTailNum: string\nActualElapsedTime: double\nCRSElapsedTime: double\nAirTime: double\nArrDelay: double\nDepDelay: double\nOrigin: string\nDest: string\nDistance: int64\nTaxiIn: double\nTaxiOut: double\nCancelled: int64\nCancellationCode: double\nDiverted: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nIsArrDelayed: string\nIsDepDelayed: string\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'10000, \"stop\": 20000, \"step\": 1}], \"column_indexes\": [{\"name'\n b'\": null, \"field_name\": null, \"pandas_type\": \"unicode\", \"nump'\n b'y_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}], \"col'\n b'umns\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas_type\"'\n b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\"'\n b': \"Month\", \"field_name\": \"Month\", \"pandas_type\": \"int64\", \"n'\n b'umpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayofMonth'\n b'\", \"field_name\": \"DayofMonth\", \"pandas_type\": \"int64\", \"nump'\n b'y_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWeek\", \"'\n b'field_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"numpy_typ'\n b'e\": \"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"field_n'\n b'ame\": \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"field_nam'\n b'e\": \"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int'\n b'64\", \"metadata\": null}, {\"name\": \"ArrTime\", \"field_name\": \"A'\n b'rrTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", '\n b'\"metadata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"CRS'\n b'ArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"me'\n b'tadata\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"Uni'\n b'queCarrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object'\n b'\", \"metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": \"F'\n b'lightNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"m'\n b'etadata\": null}, {\"name\": \"TailNum\", \"field_name\": \"TailNum\"'\n b', \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadat'\n b'a\": null}, {\"name\": \"ActualElapsedTime\", \"field_name\": \"Actu'\n b'alElapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"flo'\n b'at64\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_'\n b'name\": \"CRSElapsedTime\", \"pandas_type\": \"float64\", \"numpy_ty'\n b'pe\": \"float64\", \"metadata\": null}, {\"name\": \"AirTime\", \"fiel'\n b'd_name\": \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": '\n b'\"float64\", \"metadata\": null}, {\"name\": \"ArrDelay\", \"field_na'\n b'me\": \"ArrDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"DepDelay\", \"field_name\"'\n b': \"DepDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float'\n b'64\", \"metadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Or'\n b'igin\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"me'\n b'tadata\": null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pand'\n b'as_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": nul'\n b'l}, {\"name\": \"Distance\", \"field_name\": \"Distance\", \"pandas_t'\n b'ype\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"n'\n b'ame\": \"TaxiIn\", \"field_name\": \"TaxiIn\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'TaxiOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"float64\",'\n b' \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Cance'\n b'lled\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int64\", \"n'\n b'umpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Cancellati'\n b'onCode\", \"field_name\": \"CancellationCode\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"Diverted\", \"field_name\": \"Diverted\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carr'\n b'ierDelay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"flo'\n b'at64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": '\n b'\"WeatherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\":'\n b' \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"na'\n b'me\": \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_ty'\n b'pe\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, '\n b'{\"name\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDel'\n b'ay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}, {\"name\": \"IsArrDelayed\", \"field_name\": \"IsArr'\n b'Delayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", '\n b'\"metadata\": null}, {\"name\": \"IsDepDelayed\", \"field_name\": \"I'\n b'sDepDelayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"objec'\n b't\", \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"v'\n b'ersion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////5gWAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAANwPAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAKcPAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAxMDAwMCwgInN0b3AiOiAy'\n b'MDAwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7Im5hbWUi'\n b'OiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlwZSI6ICJ1'\n b'bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjog'\n b'eyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3sibmFtZSI6'\n b'ICJZZWFyIiwgImZpZWxkX25hbWUiOiAiWWVhciIsICJwYW5kYXNfdHlwZSI6'\n b'ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJNb250aCIsICJmaWVsZF9uYW1lIjogIk1vbnRo'\n b'IiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRheW9mTW9udGgi'\n b'LCAiZmllbGRfbmFtZSI6ICJEYXlvZk1vbnRoIiwgInBhbmRhc190eXBlIjog'\n b'ImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBu'\n b'dWxsfSwgeyJuYW1lIjogIkRheU9mV2VlayIsICJmaWVsZF9uYW1lIjogIkRh'\n b'eU9mV2VlayIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBl'\n b'IjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEZXBU'\n b'aW1lIiwgImZpZWxkX25hbWUiOiAiRGVwVGltZSIsICJwYW5kYXNfdHlwZSI6'\n b'ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0'\n b'YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTRGVwVGltZSIsICJmaWVsZF9uYW1l'\n b'IjogIkNSU0RlcFRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiQXJyVGltZSIsICJmaWVsZF9uYW1lIjogIkFyclRpbWUiLCAicGFuZGFz'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNSU0FyclRpbWUiLCAiZmll'\n b'bGRfbmFtZSI6ICJDUlNBcnJUaW1lIiwgInBhbmRhc190eXBlIjogImludDY0'\n b'IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIlVuaXF1ZUNhcnJpZXIiLCAiZmllbGRfbmFtZSI6ICJVbmlx'\n b'dWVDYXJyaWVyIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlf'\n b'dHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjog'\n b'IkZsaWdodE51bSIsICJmaWVsZF9uYW1lIjogIkZsaWdodE51bSIsICJwYW5k'\n b'YXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYWlsTnVtIiwgImZpZWxkX25h'\n b'bWUiOiAiVGFpbE51bSIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51'\n b'bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJBY3R1YWxFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkFjdHVh'\n b'bEVsYXBzZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJDUlNFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBz'\n b'ZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJB'\n b'aXJUaW1lIiwgImZpZWxkX25hbWUiOiAiQWlyVGltZSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQXJyRGVsYXkiLCAiZmllbGRfbmFt'\n b'ZSI6ICJBcnJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiRGVwRGVsYXkiLCAiZmllbGRfbmFtZSI6ICJEZXBEZWxheSIsICJw'\n b'YW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2'\n b'NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiT3JpZ2luIiwgImZp'\n b'ZWxkX25hbWUiOiAiT3JpZ2luIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUi'\n b'LCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIkRlc3QiLCAiZmllbGRfbmFtZSI6ICJEZXN0IiwgInBhbmRh'\n b'c190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRpc3RhbmNlIiwgImZpZWxk'\n b'X25hbWUiOiAiRGlzdGFuY2UiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiVGF4aUluIiwgImZpZWxkX25hbWUiOiAiVGF4aUluIiwgInBhbmRh'\n b'c190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxk'\n b'X25hbWUiOiAiVGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiQ2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVk'\n b'IiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhbmNlbGxhdGlv'\n b'bkNvZGUiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsYXRpb25Db2RlIiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXZlcnRlZCIsICJm'\n b'aWVsZF9uYW1lIjogIkRpdmVydGVkIiwgInBhbmRhc190eXBlIjogImludDY0'\n b'IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVsZF9uYW1lIjogIkNhcnJp'\n b'ZXJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5'\n b'cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAi'\n b'V2VhdGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAiV2VhdGhlckRlbGF5Iiwg'\n b'InBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9h'\n b'dDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJOQVNEZWxheSIs'\n b'ICJmaWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5IiwgImZpZWxkX25hbWUi'\n b'OiAiU2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJM'\n b'YXRlQWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiSXNBcnJEZWxheWVkIiwgImZpZWxkX25hbWUiOiAiSXNBcnJE'\n b'ZWxheWVkIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlw'\n b'ZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIklz'\n b'RGVwRGVsYXllZCIsICJmaWVsZF9uYW1lIjogIklzRGVwRGVsYXllZCIsICJw'\n b'YW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6'\n b'ICJweWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVy'\n b'c2lvbiI6ICIwLjI1LjMifQAfAAAAVAYAABAGAADYBQAAoAUAAGgFAAAwBQAA'\n b'AAUAAMgEAACQBAAAWAQAACwEAADwAwAAuAMAAIgDAABUAwAAIAMAAPQCAADI'\n b'AgAAkAIAAGACAAAwAgAA+AEAALwBAACEAQAATAEAABQBAADgAAAAqAAAAGwA'\n b'AAA4AAAABAAAADj6//8AAAEFFAAAAAwAAAAEAAAAAAAAAMz7//8MAAAASXNE'\n b'ZXBEZWxheWVkAAAAAGj6//8AAAEFFAAAAAwAAAAEAAAAAAAAAPz7//8MAAAA'\n b'SXNBcnJEZWxheWVkAAAAAJj6//8AAAEDGAAAAAwAAAAEAAAAAAAAAGL7//8A'\n b'AAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAA0Pr//wAAAQMYAAAADAAAAAQA'\n b'AAAAAAAAmvv//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAAT7//8AAAEDGAAA'\n b'AAwAAAAEAAAAAAAAAM77//8AAAIACAAAAE5BU0RlbGF5AAAAADT7//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAAP77//8AAAIADAAAAFdlYXRoZXJEZWxheQAAAABo'\n b'+///AAABAxgAAAAMAAAABAAAAAAAAAAy/P//AAACAAwAAABDYXJyaWVyRGVs'\n b'YXkAAAAAnPv//wAAAQIcAAAADAAAAAQAAAAAAAAAjPv//wAAAAFAAAAACAAA'\n b'AERpdmVydGVkAAAAAND7//8AAAEDGAAAAAwAAAAEAAAAAAAAAJr8//8AAAIA'\n b'EAAAAENhbmNlbGxhdGlvbkNvZGUAAAAACPz//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAA+Pv//wAAAAFAAAAACQAAAENhbmNlbGxlZAAAADz8//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAAb9//8AAAIABwAAAFRheGlPdXQAaPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAAMv3//wAAAgAGAAAAVGF4aUluAACU/P//AAABAhwAAAAMAAAA'\n b'BAAAAAAAAACE/P//AAAAAUAAAAAIAAAARGlzdGFuY2UAAAAAyPz//wAAAQUU'\n b'AAAADAAAAAQAAAAAAAAAXP7//wQAAABEZXN0AAAAAPD8//8AAAEFFAAAAAwA'\n b'AAAEAAAAAAAAAIT+//8GAAAAT3JpZ2luAAAY/f//AAABAxgAAAAMAAAABAAA'\n b'AAAAAADi/f//AAACAAgAAABEZXBEZWxheQAAAABI/f//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAAAS/v//AAACAAgAAABBcnJEZWxheQAAAAB4/f//AAABAxgAAAAM'\n b'AAAABAAAAAAAAABC/v//AAACAAcAAABBaXJUaW1lAKT9//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAG7+//8AAAIADgAAAENSU0VsYXBzZWRUaW1lAADY/f//AAAB'\n b'AxgAAAAMAAAABAAAAAAAAACi/v//AAACABEAAABBY3R1YWxFbGFwc2VkVGlt'\n b'ZQAAABD+//8AAAEFFAAAAAwAAAAEAAAAAAAAAKT///8HAAAAVGFpbE51bQA4'\n b'/v//AAABAhwAAAAMAAAABAAAAAAAAAAo/v//AAAAAUAAAAAJAAAARmxpZ2h0'\n b'TnVtAAAAbP7//wAAAQUYAAAAEAAAAAQAAAAAAAAABAAEAAQAAAANAAAAVW5p'\n b'cXVlQ2FycmllcgAAAKD+//8AAAECHAAAAAwAAAAEAAAAAAAAAJD+//8AAAAB'\n b'QAAAAAoAAABDUlNBcnJUaW1lAADU/v//AAABAxgAAAAMAAAABAAAAAAAAACe'\n b'////AAACAAcAAABBcnJUaW1lAAD///8AAAECHAAAAAwAAAAEAAAAAAAAAPD+'\n b'//8AAAABQAAAAAoAAABDUlNEZXBUaW1lAAA0////AAABAyAAAAAUAAAABAAA'\n b'AAAAAAAAAAYACAAGAAYAAAAAAAIABwAAAERlcFRpbWUAaP///wAAAQIcAAAA'\n b'DAAAAAQAAAAAAAAAWP///wAAAAFAAAAACQAAAERheU9mV2VlawAAAJz///8A'\n b'AAECHAAAAAwAAAAEAAAAAAAAAIz///8AAAABQAAAAAoAAABEYXlvZk1vbnRo'\n b'AADQ////AAABAhwAAAAMAAAABAAAAAAAAADA////AAAAAUAAAAAFAAAATW9u'\n b'dGgAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIkAAAAFAAAAAQAAAAAAAAA'\n b'CAAMAAgABwAIAAAAAAAAAUAAAAAEAAAAWWVhcgAAAAA=')])", + "evalue": "Schema in /User/mlrun/airlines/dataset/1278d2c85afc40cabc8e5add8d12892e.parquet was different. \nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: int64\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nCRSElapsedTime: double\nAirTime: int64\nOrigin: string\nDest: string\nDistance: double\nTaxiOut: double\nCancelled: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'95550000, \"stop\": 95560000, \"step\": 1}], \"column_indexes\": ['\n b'{\"name\": null, \"field_name\": null, \"pandas_type\": \"unicode\",'\n b' \"numpy_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}]'\n b', \"columns\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas'\n b'_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {'\n b'\"name\": \"Month\", \"field_name\": \"Month\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Dayo'\n b'fMonth\", \"field_name\": \"DayofMonth\", \"pandas_type\": \"int64\",'\n b' \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWe'\n b'ek\", \"field_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"num'\n b'py_type\": \"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"f'\n b'ield_name\": \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type'\n b'\": \"float64\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"fie'\n b'ld_name\": \"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\"'\n b': \"int64\", \"metadata\": null}, {\"name\": \"ArrTime\", \"field_nam'\n b'e\": \"ArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\"'\n b', \"metadata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"C'\n b'RSArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"'\n b'metadata\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"U'\n b'niqueCarrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"obje'\n b'ct\", \"metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": '\n b'\"FlightNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", '\n b'\"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_name\": '\n b'\"CRSElapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"f'\n b'loat64\", \"metadata\": null}, {\"name\": \"AirTime\", \"field_name\"'\n b': \"AirTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", '\n b'\"metadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Origin\"'\n b', \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadat'\n b'a\": null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pandas_ty'\n b'pe\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {'\n b'\"name\": \"Distance\", \"field_name\": \"Distance\", \"pandas_type\":'\n b' \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"na'\n b'me\": \"TaxiOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"flo'\n b'at64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": '\n b'\"Cancelled\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carr'\n b'ierDelay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"flo'\n b'at64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": '\n b'\"WeatherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\":'\n b' \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"na'\n b'me\": \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_ty'\n b'pe\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, '\n b'{\"name\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDel'\n b'ay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}], \"creator\": {\"library\": \"pyarrow\", \"version\":'\n b' \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////4AQAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAAJwLAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAGcLAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiA5NTU1MDAwMCwgInN0b3Ai'\n b'OiA5NTU2MDAwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7'\n b'Im5hbWUiOiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlw'\n b'ZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFk'\n b'YXRhIjogeyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3si'\n b'bmFtZSI6ICJZZWFyIiwgImZpZWxkX25hbWUiOiAiWWVhciIsICJwYW5kYXNf'\n b'dHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFk'\n b'YXRhIjogbnVsbH0sIHsibmFtZSI6ICJNb250aCIsICJmaWVsZF9uYW1lIjog'\n b'Ik1vbnRoIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUi'\n b'OiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRheW9m'\n b'TW9udGgiLCAiZmllbGRfbmFtZSI6ICJEYXlvZk1vbnRoIiwgInBhbmRhc190'\n b'eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIkRheU9mV2VlayIsICJmaWVsZF9uYW1l'\n b'IjogIkRheU9mV2VlayIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1w'\n b'eV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJEZXBUaW1lIiwgImZpZWxkX25hbWUiOiAiRGVwVGltZSIsICJwYW5kYXNf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJt'\n b'ZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTRGVwVGltZSIsICJmaWVs'\n b'ZF9uYW1lIjogIkNSU0RlcFRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQi'\n b'LCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiQXJyVGltZSIsICJmaWVsZF9uYW1lIjogIkFyclRpbWUiLCAi'\n b'cGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIs'\n b'ICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTQXJyVGltZSIsICJm'\n b'aWVsZF9uYW1lIjogIkNSU0FyclRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9'\n b'LCB7Im5hbWUiOiAiVW5pcXVlQ2FycmllciIsICJmaWVsZF9uYW1lIjogIlVu'\n b'aXF1ZUNhcnJpZXIiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1w'\n b'eV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiRmxpZ2h0TnVtIiwgImZpZWxkX25hbWUiOiAiRmxpZ2h0TnVtIiwgInBh'\n b'bmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNSU0VsYXBzZWRUaW1lIiwg'\n b'ImZpZWxkX25hbWUiOiAiQ1JTRWxhcHNlZFRpbWUiLCAicGFuZGFzX3R5cGUi'\n b'OiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIkFpclRpbWUiLCAiZmllbGRfbmFtZSI6'\n b'ICJBaXJUaW1lIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5'\n b'cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIk9y'\n b'aWdpbiIsICJmaWVsZF9uYW1lIjogIk9yaWdpbiIsICJwYW5kYXNfdHlwZSI6'\n b'ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRh'\n b'IjogbnVsbH0sIHsibmFtZSI6ICJEZXN0IiwgImZpZWxkX25hbWUiOiAiRGVz'\n b'dCIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAi'\n b'b2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXN0YW5j'\n b'ZSIsICJmaWVsZF9uYW1lIjogIkRpc3RhbmNlIiwgInBhbmRhc190eXBlIjog'\n b'ImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRh'\n b'IjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxkX25hbWUiOiAi'\n b'VGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5'\n b'cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAi'\n b'Q2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVkIiwgInBhbmRh'\n b'c190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0'\n b'YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVs'\n b'ZF9uYW1lIjogIkNhcnJpZXJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9h'\n b'dDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51'\n b'bGx9LCB7Im5hbWUiOiAiV2VhdGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAi'\n b'V2VhdGhlckRlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJOQVNEZWxheSIsICJmaWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5'\n b'IiwgImZpZWxkX25hbWUiOiAiU2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAi'\n b'ZmllbGRfbmFtZSI6ICJMYXRlQWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9XSwgImNyZWF0b3IiOiB7ImxpYnJhcnkiOiAicHlhcnJv'\n b'dyIsICJ2ZXJzaW9uIjogIjAuMTUuMSJ9LCAicGFuZGFzX3ZlcnNpb24iOiAi'\n b'MC4yNS4zIn0AFgAAAHwEAAA4BAAAAAQAAMgDAACQAwAAWAMAACQDAADsAgAA'\n b'tAIAAHwCAABEAgAAEAIAAOQBAAC4AQAAhAEAAFQBAAAcAQAA5AAAAKwAAAB4'\n b'AAAAQAAAAAQAAADs+///AAABAxgAAAAMAAAABAAAAAAAAAC2/P//AAACABEA'\n b'AABMYXRlQWlyY3JhZnREZWxheQAAACT8//8AAAEDGAAAAAwAAAAEAAAAAAAA'\n b'AO78//8AAAIADQAAAFNlY3VyaXR5RGVsYXkAAABY/P//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAAAi/f//AAACAAgAAABOQVNEZWxheQAAAACI/P//AAABAxgAAAAM'\n b'AAAABAAAAAAAAABS/f//AAACAAwAAABXZWF0aGVyRGVsYXkAAAAAvPz//wAA'\n b'AQMYAAAADAAAAAQAAAAAAAAAhv3//wAAAgAMAAAAQ2FycmllckRlbGF5AAAA'\n b'APD8//8AAAECHAAAAAwAAAAEAAAAAAAAAOD8//8AAAABQAAAAAkAAABDYW5j'\n b'ZWxsZWQAAAAk/f//AAABAxgAAAAMAAAABAAAAAAAAADu/f//AAACAAcAAABU'\n b'YXhpT3V0AFD9//8AAAEDGAAAAAwAAAAEAAAAAAAAABr+//8AAAIACAAAAERp'\n b'c3RhbmNlAAAAAID9//8AAAEFFAAAAAwAAAAEAAAAAAAAABj///8EAAAARGVz'\n b'dAAAAACo/f//AAABBRQAAAAMAAAABAAAAAAAAABA////BgAAAE9yaWdpbgAA'\n b'0P3//wAAAQIcAAAADAAAAAQAAAAAAAAAwP3//wAAAAFAAAAABwAAAEFpclRp'\n b'bWUAAP7//wAAAQMYAAAADAAAAAQAAAAAAAAAyv7//wAAAgAOAAAAQ1JTRWxh'\n b'cHNlZFRpbWUAADT+//8AAAECHAAAAAwAAAAEAAAAAAAAACT+//8AAAABQAAA'\n b'AAkAAABGbGlnaHROdW0AAABo/v//AAABBRgAAAAQAAAABAAAAAAAAAAEAAQA'\n b'BAAAAA0AAABVbmlxdWVDYXJyaWVyAAAAnP7//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAAjP7//wAAAAFAAAAACgAAAENSU0FyclRpbWUAAND+//8AAAECHAAAAAwA'\n b'AAAEAAAAAAAAAMD+//8AAAABQAAAAAcAAABBcnJUaW1lAAD///8AAAECHAAA'\n b'AAwAAAAEAAAAAAAAAPD+//8AAAABQAAAAAoAAABDUlNEZXBUaW1lAAA0////'\n b'AAABAyAAAAAUAAAABAAAAAAAAAAAAAYACAAGAAYAAAAAAAIABwAAAERlcFRp'\n b'bWUAaP///wAAAQIcAAAADAAAAAQAAAAAAAAAWP///wAAAAFAAAAACQAAAERh'\n b'eU9mV2VlawAAAJz///8AAAECHAAAAAwAAAAEAAAAAAAAAIz///8AAAABQAAA'\n b'AAoAAABEYXlvZk1vbnRoAADQ////AAABAhwAAAAMAAAABAAAAAAAAADA////'\n b'AAAAAUAAAAAFAAAATW9udGgAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIk'\n b'AAAAFAAAAAQAAAAAAAAACAAMAAgABwAIAAAAAAAAAUAAAAAEAAAAWWVhcgAA'\n b'AAA=')])\n\nvs\n\nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nCRSElapsedTime: double\nAirTime: double\nOrigin: string\nDest: string\nDistance: double\nTaxiOut: double\nCancelled: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'90730000, \"stop\": 90740000, \"step\": 1}], \"column_indexes\": ['\n b'{\"name\": null, \"field_name\": null, \"pandas_type\": \"unicode\",'\n b' \"numpy_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}]'\n b', \"columns\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas'\n b'_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {'\n b'\"name\": \"Month\", \"field_name\": \"Month\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Dayo'\n b'fMonth\", \"field_name\": \"DayofMonth\", \"pandas_type\": \"int64\",'\n b' \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWe'\n b'ek\", \"field_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"num'\n b'py_type\": \"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"f'\n b'ield_name\": \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type'\n b'\": \"float64\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"fie'\n b'ld_name\": \"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\"'\n b': \"int64\", \"metadata\": null}, {\"name\": \"ArrTime\", \"field_nam'\n b'e\": \"ArrTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"floa'\n b't64\", \"metadata\": null}, {\"name\": \"CRSArrTime\", \"field_name\"'\n b': \"CRSArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64'\n b'\", \"metadata\": null}, {\"name\": \"UniqueCarrier\", \"field_name\"'\n b': \"UniqueCarrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"'\n b'object\", \"metadata\": null}, {\"name\": \"FlightNum\", \"field_nam'\n b'e\": \"FlightNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_nam'\n b'e\": \"CRSElapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\"'\n b': \"float64\", \"metadata\": null}, {\"name\": \"AirTime\", \"field_n'\n b'ame\": \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"Origin\", \"field_name\": '\n b'\"Origin\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", '\n b'\"metadata\": null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"p'\n b'andas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": '\n b'null}, {\"name\": \"Distance\", \"field_name\": \"Distance\", \"panda'\n b's_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": nul'\n b'l}, {\"name\": \"TaxiOut\", \"field_name\": \"TaxiOut\", \"pandas_typ'\n b'e\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {'\n b'\"name\": \"Cancelled\", \"field_name\": \"Cancelled\", \"pandas_type'\n b'\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name'\n b'\": \"CarrierDelay\", \"field_name\": \"CarrierDelay\", \"pandas_typ'\n b'e\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {'\n b'\"name\": \"WeatherDelay\", \"field_name\": \"WeatherDelay\", \"panda'\n b's_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": nul'\n b'l}, {\"name\": \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_t'\n b'ype\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null},'\n b' {\"name\": \"SecurityDelay\", \"field_name\": \"SecurityDelay\", \"p'\n b'andas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\":'\n b' null}, {\"name\": \"LateAircraftDelay\", \"field_name\": \"LateAir'\n b'craftDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"v'\n b'ersion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////4AQAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAAKQLAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAG8LAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiA5MDczMDAwMCwgInN0b3Ai'\n b'OiA5MDc0MDAwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7'\n b'Im5hbWUiOiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlw'\n b'ZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFk'\n b'YXRhIjogeyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3si'\n b'bmFtZSI6ICJZZWFyIiwgImZpZWxkX25hbWUiOiAiWWVhciIsICJwYW5kYXNf'\n b'dHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFk'\n b'YXRhIjogbnVsbH0sIHsibmFtZSI6ICJNb250aCIsICJmaWVsZF9uYW1lIjog'\n b'Ik1vbnRoIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUi'\n b'OiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRheW9m'\n b'TW9udGgiLCAiZmllbGRfbmFtZSI6ICJEYXlvZk1vbnRoIiwgInBhbmRhc190'\n b'eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIkRheU9mV2VlayIsICJmaWVsZF9uYW1l'\n b'IjogIkRheU9mV2VlayIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1w'\n b'eV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJEZXBUaW1lIiwgImZpZWxkX25hbWUiOiAiRGVwVGltZSIsICJwYW5kYXNf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJt'\n b'ZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTRGVwVGltZSIsICJmaWVs'\n b'ZF9uYW1lIjogIkNSU0RlcFRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQi'\n b'LCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiQXJyVGltZSIsICJmaWVsZF9uYW1lIjogIkFyclRpbWUiLCAi'\n b'cGFuZGFzX3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNSU0FyclRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJDUlNBcnJUaW1lIiwgInBhbmRhc190eXBlIjog'\n b'ImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBu'\n b'dWxsfSwgeyJuYW1lIjogIlVuaXF1ZUNhcnJpZXIiLCAiZmllbGRfbmFtZSI6'\n b'ICJVbmlxdWVDYXJyaWVyIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAi'\n b'bnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJu'\n b'YW1lIjogIkZsaWdodE51bSIsICJmaWVsZF9uYW1lIjogIkZsaWdodE51bSIs'\n b'ICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJDUlNFbGFwc2VkVGlt'\n b'ZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBzZWRUaW1lIiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJBaXJUaW1lIiwgImZpZWxkX25h'\n b'bWUiOiAiQWlyVGltZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiT3JpZ2luIiwgImZpZWxkX25hbWUiOiAiT3JpZ2luIiwgInBhbmRh'\n b'c190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRlc3QiLCAiZmllbGRfbmFt'\n b'ZSI6ICJEZXN0IiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlf'\n b'dHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjog'\n b'IkRpc3RhbmNlIiwgImZpZWxkX25hbWUiOiAiRGlzdGFuY2UiLCAicGFuZGFz'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIlRheGlPdXQiLCAiZmllbGRf'\n b'bmFtZSI6ICJUYXhpT3V0IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsi'\n b'bmFtZSI6ICJDYW5jZWxsZWQiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsZWQi'\n b'LCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2'\n b'NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ2FycmllckRlbGF5'\n b'IiwgImZpZWxkX25hbWUiOiAiQ2FycmllckRlbGF5IiwgInBhbmRhc190eXBl'\n b'IjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFk'\n b'YXRhIjogbnVsbH0sIHsibmFtZSI6ICJXZWF0aGVyRGVsYXkiLCAiZmllbGRf'\n b'bmFtZSI6ICJXZWF0aGVyRGVsYXkiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQ2'\n b'NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAibWV0YWRhdGEiOiBudWxs'\n b'fSwgeyJuYW1lIjogIk5BU0RlbGF5IiwgImZpZWxkX25hbWUiOiAiTkFTRGVs'\n b'YXkiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjog'\n b'ImZsb2F0NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIlNlY3Vy'\n b'aXR5RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJTZWN1cml0eURlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJMYXRlQWlyY3JhZnRE'\n b'ZWxheSIsICJmaWVsZF9uYW1lIjogIkxhdGVBaXJjcmFmdERlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6'\n b'ICJweWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVy'\n b'c2lvbiI6ICIwLjI1LjMifQAWAAAAdAQAADAEAAD4AwAAwAMAAIgDAABQAwAA'\n b'IAMAAOgCAACwAgAAeAIAAEACAAAQAgAA5AEAALgBAACEAQAAVAEAABwBAADk'\n b'AAAArAAAAHgAAABAAAAABAAAAPT7//8AAAEDGAAAAAwAAAAEAAAAAAAAAL78'\n b'//8AAAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAALPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAA9vz//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAGD8//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAACr9//8AAAIACAAAAE5BU0RlbGF5AAAAAJD8//8A'\n b'AAEDGAAAAAwAAAAEAAAAAAAAAFr9//8AAAIADAAAAFdlYXRoZXJEZWxheQAA'\n b'AADE/P//AAABAxgAAAAMAAAABAAAAAAAAACO/f//AAACAAwAAABDYXJyaWVy'\n b'RGVsYXkAAAAA+Pz//wAAAQIcAAAADAAAAAQAAAAAAAAA6Pz//wAAAAFAAAAA'\n b'CQAAAENhbmNlbGxlZAAAACz9//8AAAEDGAAAAAwAAAAEAAAAAAAAAPb9//8A'\n b'AAIABwAAAFRheGlPdXQAWP3//wAAAQMYAAAADAAAAAQAAAAAAAAAIv7//wAA'\n b'AgAIAAAARGlzdGFuY2UAAAAAiP3//wAAAQUUAAAADAAAAAQAAAAAAAAAHP//'\n b'/wQAAABEZXN0AAAAALD9//8AAAEFFAAAAAwAAAAEAAAAAAAAAET///8GAAAA'\n b'T3JpZ2luAADY/f//AAABAxgAAAAMAAAABAAAAAAAAACi/v//AAACAAcAAABB'\n b'aXJUaW1lAAT+//8AAAEDGAAAAAwAAAAEAAAAAAAAAM7+//8AAAIADgAAAENS'\n b'U0VsYXBzZWRUaW1lAAA4/v//AAABAhwAAAAMAAAABAAAAAAAAAAo/v//AAAA'\n b'AUAAAAAJAAAARmxpZ2h0TnVtAAAAbP7//wAAAQUYAAAAEAAAAAQAAAAAAAAA'\n b'BAAEAAQAAAANAAAAVW5pcXVlQ2FycmllcgAAAKD+//8AAAECHAAAAAwAAAAE'\n b'AAAAAAAAAJD+//8AAAABQAAAAAoAAABDUlNBcnJUaW1lAADU/v//AAABAxgA'\n b'AAAMAAAABAAAAAAAAACe////AAACAAcAAABBcnJUaW1lAAD///8AAAECHAAA'\n b'AAwAAAAEAAAAAAAAAPD+//8AAAABQAAAAAoAAABDUlNEZXBUaW1lAAA0////'\n b'AAABAyAAAAAUAAAABAAAAAAAAAAAAAYACAAGAAYAAAAAAAIABwAAAERlcFRp'\n b'bWUAaP///wAAAQIcAAAADAAAAAQAAAAAAAAAWP///wAAAAFAAAAACQAAAERh'\n b'eU9mV2VlawAAAJz///8AAAECHAAAAAwAAAAEAAAAAAAAAIz///8AAAABQAAA'\n b'AAoAAABEYXlvZk1vbnRoAADQ////AAABAhwAAAAMAAAABAAAAAAAAADA////'\n b'AAAAAUAAAAAFAAAATW9udGgAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIk'\n b'AAAAFAAAAAQAAAAAAAAACAAMAAgABwAIAAAAAAAAAUAAAAAEAAAAWWVhcgAA'\n b'AAA=')])", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdataset\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpq\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mParquetDataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mTARGET_PATH\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdataset\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpq\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mParquetDataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mTARGET_PATH\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/pyarrow/parquet.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, read_dictionary, memory_map, buffer_size)\u001b[0m\n\u001b[1;32m 1058\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1059\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mvalidate_schema\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1060\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalidate_schemas\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1061\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1062\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mequals\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mother\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/pyarrow/parquet.py\u001b[0m in \u001b[0;36mvalidate_schemas\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 1111\u001b[0m \u001b[0;34m'{1!s}\\n\\nvs\\n\\n{2!s}'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1112\u001b[0m .format(piece, file_schema,\n\u001b[0;32m-> 1113\u001b[0;31m dataset_schema))\n\u001b[0m\u001b[1;32m 1114\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1115\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolumns\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muse_threads\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muse_pandas_metadata\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mValueError\u001b[0m: Schema in /User/mlrun/airlines/dataset/a3f9441653c14708ad69f63694749301.parquet was different. \nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nTailNum: double\nActualElapsedTime: double\nCRSElapsedTime: int64\nAirTime: double\nArrDelay: double\nDepDelay: double\nOrigin: string\nDest: string\nDistance: double\nTaxiIn: double\nTaxiOut: double\nCancelled: int64\nCancellationCode: double\nDiverted: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nIsArrDelayed: string\nIsDepDelayed: string\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'0, \"stop\": 10000, \"step\": 1}], \"column_indexes\": [{\"name\": n'\n b'ull, \"field_name\": null, \"pandas_type\": \"unicode\", \"numpy_ty'\n b'pe\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}], \"columns'\n b'\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas_type\": \"i'\n b'nt64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"M'\n b'onth\", \"field_name\": \"Month\", \"pandas_type\": \"int64\", \"numpy'\n b'_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayofMonth\", \"'\n b'field_name\": \"DayofMonth\", \"pandas_type\": \"int64\", \"numpy_ty'\n b'pe\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWeek\", \"fiel'\n b'd_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"numpy_type\": '\n b'\"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"field_name\"'\n b': \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"field_name\": '\n b'\"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\",'\n b' \"metadata\": null}, {\"name\": \"ArrTime\", \"field_name\": \"ArrTi'\n b'me\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"CRSArrT'\n b'ime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metada'\n b'ta\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"UniqueC'\n b'arrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"'\n b'metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": \"Fligh'\n b'tNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metad'\n b'ata\": null}, {\"name\": \"TailNum\", \"field_name\": \"TailNum\", \"p'\n b'andas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\":'\n b' null}, {\"name\": \"ActualElapsedTime\", \"field_name\": \"ActualE'\n b'lapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_nam'\n b'e\": \"CRSElapsedTime\", \"pandas_type\": \"int64\", \"numpy_type\": '\n b'\"int64\", \"metadata\": null}, {\"name\": \"AirTime\", \"field_name\"'\n b': \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"ArrDelay\", \"field_name\": \"A'\n b'rrDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\",'\n b' \"metadata\": null}, {\"name\": \"DepDelay\", \"field_name\": \"DepD'\n b'elay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"m'\n b'etadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Origin\", '\n b'\"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\"'\n b': null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pandas_type'\n b'\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {\"n'\n b'ame\": \"Distance\", \"field_name\": \"Distance\", \"pandas_type\": \"'\n b'float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name'\n b'\": \"TaxiIn\", \"field_name\": \"TaxiIn\", \"pandas_type\": \"float64'\n b'\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Tax'\n b'iOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"float64\", \"n'\n b'umpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Cancelle'\n b'd\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int64\", \"nump'\n b'y_type\": \"int64\", \"metadata\": null}, {\"name\": \"CancellationC'\n b'ode\", \"field_name\": \"CancellationCode\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'Diverted\", \"field_name\": \"Diverted\", \"pandas_type\": \"int64\",'\n b' \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carrier'\n b'Delay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"float6'\n b'4\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"We'\n b'atherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_type\"'\n b': \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"n'\n b'ame\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDelay\"'\n b', \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metada'\n b'ta\": null}, {\"name\": \"IsArrDelayed\", \"field_name\": \"IsArrDel'\n b'ayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"me'\n b'tadata\": null}, {\"name\": \"IsDepDelayed\", \"field_name\": \"IsDe'\n b'pDelayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\",'\n b' \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"vers'\n b'ion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////6AWAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAANwPAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAKQPAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDEwMDAw'\n b'LCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51'\n b'bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNv'\n b'ZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVu'\n b'Y29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogIlll'\n b'YXIiLCAiZmllbGRfbmFtZSI6ICJZZWFyIiwgInBhbmRhc190eXBlIjogImlu'\n b'dDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxs'\n b'fSwgeyJuYW1lIjogIk1vbnRoIiwgImZpZWxkX25hbWUiOiAiTW9udGgiLCAi'\n b'cGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIs'\n b'ICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiRGF5b2ZNb250aCIsICJm'\n b'aWVsZF9uYW1lIjogIkRheW9mTW9udGgiLCAicGFuZGFzX3R5cGUiOiAiaW50'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9'\n b'LCB7Im5hbWUiOiAiRGF5T2ZXZWVrIiwgImZpZWxkX25hbWUiOiAiRGF5T2ZX'\n b'ZWVrIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAi'\n b'aW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRlcFRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJEZXBUaW1lIiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJDUlNEZXBUaW1lIiwgImZpZWxkX25hbWUiOiAi'\n b'Q1JTRGVwVGltZSIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90'\n b'eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJB'\n b'cnJUaW1lIiwgImZpZWxkX25hbWUiOiAiQXJyVGltZSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTQXJyVGltZSIsICJmaWVsZF9u'\n b'YW1lIjogIkNSU0FyclRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiVW5pcXVlQ2FycmllciIsICJmaWVsZF9uYW1lIjogIlVuaXF1ZUNh'\n b'cnJpZXIiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBl'\n b'IjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiRmxp'\n b'Z2h0TnVtIiwgImZpZWxkX25hbWUiOiAiRmxpZ2h0TnVtIiwgInBhbmRhc190'\n b'eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIlRhaWxOdW0iLCAiZmllbGRfbmFtZSI6'\n b'ICJUYWlsTnVtIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJBY3R1YWxFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkFjdHVhbEVs'\n b'YXBzZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJDUlNFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBzZWRU'\n b'aW1lIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAi'\n b'aW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkFpclRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJBaXJUaW1lIiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJBcnJEZWxheSIsICJmaWVsZF9uYW1lIjogIkFy'\n b'ckRlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJE'\n b'ZXBEZWxheSIsICJmaWVsZF9uYW1lIjogIkRlcERlbGF5IiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJPcmlnaW4iLCAiZmllbGRfbmFt'\n b'ZSI6ICJPcmlnaW4iLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1w'\n b'eV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiRGVzdCIsICJmaWVsZF9uYW1lIjogIkRlc3QiLCAicGFuZGFzX3R5cGUi'\n b'OiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0'\n b'YSI6IG51bGx9LCB7Im5hbWUiOiAiRGlzdGFuY2UiLCAiZmllbGRfbmFtZSI6'\n b'ICJEaXN0YW5jZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiVGF4aUluIiwgImZpZWxkX25hbWUiOiAiVGF4aUluIiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxkX25h'\n b'bWUiOiAiVGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiQ2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVkIiwg'\n b'InBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQi'\n b'LCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhbmNlbGxhdGlvbkNv'\n b'ZGUiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsYXRpb25Db2RlIiwgInBhbmRh'\n b'c190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXZlcnRlZCIsICJmaWVs'\n b'ZF9uYW1lIjogIkRpdmVydGVkIiwgInBhbmRhc190eXBlIjogImludDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJu'\n b'YW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVsZF9uYW1lIjogIkNhcnJpZXJE'\n b'ZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUi'\n b'OiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiV2Vh'\n b'dGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAiV2VhdGhlckRlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJOQVNEZWxheSIsICJm'\n b'aWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVs'\n b'bH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5IiwgImZpZWxkX25hbWUiOiAi'\n b'U2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJMYXRl'\n b'QWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiSXNBcnJEZWxheWVkIiwgImZpZWxkX25hbWUiOiAiSXNBcnJEZWxh'\n b'eWVkIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6'\n b'ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIklzRGVw'\n b'RGVsYXllZCIsICJmaWVsZF9uYW1lIjogIklzRGVwRGVsYXllZCIsICJwYW5k'\n b'YXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJw'\n b'eWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVyc2lv'\n b'biI6ICIwLjI1LjMifQAAAAAfAAAAWAYAABQGAADcBQAApAUAAGwFAAA0BQAA'\n b'BAUAAMwEAACUBAAAXAQAACwEAADwAwAAtAMAAIQDAABQAwAAHAMAAPACAADE'\n b'AgAAkAIAAGACAAAwAgAA+AEAALwBAACEAQAATAEAABQBAADgAAAAqAAAAGwA'\n b'AAA4AAAABAAAADT6//8AAAEFFAAAAAwAAAAEAAAAAAAAAMj7//8MAAAASXNE'\n b'ZXBEZWxheWVkAAAAAGT6//8AAAEFFAAAAAwAAAAEAAAAAAAAAPj7//8MAAAA'\n b'SXNBcnJEZWxheWVkAAAAAJT6//8AAAEDGAAAAAwAAAAEAAAAAAAAAF77//8A'\n b'AAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAAzPr//wAAAQMYAAAADAAAAAQA'\n b'AAAAAAAAlvv//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAAD7//8AAAEDGAAA'\n b'AAwAAAAEAAAAAAAAAMr7//8AAAIACAAAAE5BU0RlbGF5AAAAADD7//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAAPr7//8AAAIADAAAAFdlYXRoZXJEZWxheQAAAABk'\n b'+///AAABAxgAAAAMAAAABAAAAAAAAAAu/P//AAACAAwAAABDYXJyaWVyRGVs'\n b'YXkAAAAAmPv//wAAAQIcAAAADAAAAAQAAAAAAAAAiPv//wAAAAFAAAAACAAA'\n b'AERpdmVydGVkAAAAAMz7//8AAAEDGAAAAAwAAAAEAAAAAAAAAJb8//8AAAIA'\n b'EAAAAENhbmNlbGxhdGlvbkNvZGUAAAAABPz//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAA9Pv//wAAAAFAAAAACQAAAENhbmNlbGxlZAAAADj8//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAAL9//8AAAIABwAAAFRheGlPdXQAZPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAALv3//wAAAgAGAAAAVGF4aUluAACQ/P//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAABa/f//AAACAAgAAABEaXN0YW5jZQAAAADA/P//AAABBRQAAAAM'\n b'AAAABAAAAAAAAABU/v//BAAAAERlc3QAAAAA6Pz//wAAAQUUAAAADAAAAAQA'\n b'AAAAAAAAfP7//wYAAABPcmlnaW4AABD9//8AAAEDGAAAAAwAAAAEAAAAAAAA'\n b'ANr9//8AAAIACAAAAERlcERlbGF5AAAAAED9//8AAAEDGAAAAAwAAAAEAAAA'\n b'AAAAAAr+//8AAAIACAAAAEFyckRlbGF5AAAAAHD9//8AAAEDGAAAAAwAAAAE'\n b'AAAAAAAAADr+//8AAAIABwAAAEFpclRpbWUAnP3//wAAAQIcAAAADAAAAAQA'\n b'AAAAAAAAjP3//wAAAAFAAAAADgAAAENSU0VsYXBzZWRUaW1lAADU/f//AAAB'\n b'AxgAAAAMAAAABAAAAAAAAACe/v//AAACABEAAABBY3R1YWxFbGFwc2VkVGlt'\n b'ZQAAAAz+//8AAAEDGAAAAAwAAAAEAAAAAAAAANb+//8AAAIABwAAAFRhaWxO'\n b'dW0AOP7//wAAAQIcAAAADAAAAAQAAAAAAAAAKP7//wAAAAFAAAAACQAAAEZs'\n b'aWdodE51bQAAAGz+//8AAAEFGAAAABAAAAAEAAAAAAAAAAQABAAEAAAADQAA'\n b'AFVuaXF1ZUNhcnJpZXIAAACg/v//AAABAhwAAAAMAAAABAAAAAAAAACQ/v//'\n b'AAAAAUAAAAAKAAAAQ1JTQXJyVGltZQAA1P7//wAAAQMYAAAADAAAAAQAAAAA'\n b'AAAAnv///wAAAgAHAAAAQXJyVGltZQAA////AAABAhwAAAAMAAAABAAAAAAA'\n b'AADw/v//AAAAAUAAAAAKAAAAQ1JTRGVwVGltZQAANP///wAAAQMgAAAAFAAA'\n b'AAQAAAAAAAAAAAAGAAgABgAGAAAAAAACAAcAAABEZXBUaW1lAGj///8AAAEC'\n b'HAAAAAwAAAAEAAAAAAAAAFj///8AAAABQAAAAAkAAABEYXlPZldlZWsAAACc'\n b'////AAABAhwAAAAMAAAABAAAAAAAAACM////AAAAAUAAAAAKAAAARGF5b2ZN'\n b'b250aAAA0P///wAAAQIcAAAADAAAAAQAAAAAAAAAwP///wAAAAFAAAAABQAA'\n b'AE1vbnRoAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAECJAAAABQAAAAEAAAA'\n b'AAAAAAgADAAIAAcACAAAAAAAAAFAAAAABAAAAFllYXIAAAAAAAAAAA==')])\n\nvs\n\nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nTailNum: string\nActualElapsedTime: double\nCRSElapsedTime: double\nAirTime: double\nArrDelay: double\nDepDelay: double\nOrigin: string\nDest: string\nDistance: int64\nTaxiIn: double\nTaxiOut: double\nCancelled: int64\nCancellationCode: double\nDiverted: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nIsArrDelayed: string\nIsDepDelayed: string\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'10000, \"stop\": 20000, \"step\": 1}], \"column_indexes\": [{\"name'\n b'\": null, \"field_name\": null, \"pandas_type\": \"unicode\", \"nump'\n b'y_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}], \"col'\n b'umns\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas_type\"'\n b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\"'\n b': \"Month\", \"field_name\": \"Month\", \"pandas_type\": \"int64\", \"n'\n b'umpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayofMonth'\n b'\", \"field_name\": \"DayofMonth\", \"pandas_type\": \"int64\", \"nump'\n b'y_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWeek\", \"'\n b'field_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"numpy_typ'\n b'e\": \"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"field_n'\n b'ame\": \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"field_nam'\n b'e\": \"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int'\n b'64\", \"metadata\": null}, {\"name\": \"ArrTime\", \"field_name\": \"A'\n b'rrTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", '\n b'\"metadata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"CRS'\n b'ArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"me'\n b'tadata\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"Uni'\n b'queCarrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object'\n b'\", \"metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": \"F'\n b'lightNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"m'\n b'etadata\": null}, {\"name\": \"TailNum\", \"field_name\": \"TailNum\"'\n b', \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadat'\n b'a\": null}, {\"name\": \"ActualElapsedTime\", \"field_name\": \"Actu'\n b'alElapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"flo'\n b'at64\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_'\n b'name\": \"CRSElapsedTime\", \"pandas_type\": \"float64\", \"numpy_ty'\n b'pe\": \"float64\", \"metadata\": null}, {\"name\": \"AirTime\", \"fiel'\n b'd_name\": \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": '\n b'\"float64\", \"metadata\": null}, {\"name\": \"ArrDelay\", \"field_na'\n b'me\": \"ArrDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"DepDelay\", \"field_name\"'\n b': \"DepDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float'\n b'64\", \"metadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Or'\n b'igin\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"me'\n b'tadata\": null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pand'\n b'as_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": nul'\n b'l}, {\"name\": \"Distance\", \"field_name\": \"Distance\", \"pandas_t'\n b'ype\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"n'\n b'ame\": \"TaxiIn\", \"field_name\": \"TaxiIn\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'TaxiOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"float64\",'\n b' \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Cance'\n b'lled\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int64\", \"n'\n b'umpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Cancellati'\n b'onCode\", \"field_name\": \"CancellationCode\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"Diverted\", \"field_name\": \"Diverted\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carr'\n b'ierDelay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"flo'\n b'at64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": '\n b'\"WeatherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\":'\n b' \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"na'\n b'me\": \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_ty'\n b'pe\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, '\n b'{\"name\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDel'\n b'ay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}, {\"name\": \"IsArrDelayed\", \"field_name\": \"IsArr'\n b'Delayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", '\n b'\"metadata\": null}, {\"name\": \"IsDepDelayed\", \"field_name\": \"I'\n b'sDepDelayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"objec'\n b't\", \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"v'\n b'ersion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////5gWAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAANwPAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAKcPAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAxMDAwMCwgInN0b3AiOiAy'\n b'MDAwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7Im5hbWUi'\n b'OiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlwZSI6ICJ1'\n b'bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjog'\n b'eyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3sibmFtZSI6'\n b'ICJZZWFyIiwgImZpZWxkX25hbWUiOiAiWWVhciIsICJwYW5kYXNfdHlwZSI6'\n b'ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJNb250aCIsICJmaWVsZF9uYW1lIjogIk1vbnRo'\n b'IiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRheW9mTW9udGgi'\n b'LCAiZmllbGRfbmFtZSI6ICJEYXlvZk1vbnRoIiwgInBhbmRhc190eXBlIjog'\n b'ImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBu'\n b'dWxsfSwgeyJuYW1lIjogIkRheU9mV2VlayIsICJmaWVsZF9uYW1lIjogIkRh'\n b'eU9mV2VlayIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBl'\n b'IjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEZXBU'\n b'aW1lIiwgImZpZWxkX25hbWUiOiAiRGVwVGltZSIsICJwYW5kYXNfdHlwZSI6'\n b'ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0'\n b'YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTRGVwVGltZSIsICJmaWVsZF9uYW1l'\n b'IjogIkNSU0RlcFRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiQXJyVGltZSIsICJmaWVsZF9uYW1lIjogIkFyclRpbWUiLCAicGFuZGFz'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNSU0FyclRpbWUiLCAiZmll'\n b'bGRfbmFtZSI6ICJDUlNBcnJUaW1lIiwgInBhbmRhc190eXBlIjogImludDY0'\n b'IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIlVuaXF1ZUNhcnJpZXIiLCAiZmllbGRfbmFtZSI6ICJVbmlx'\n b'dWVDYXJyaWVyIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlf'\n b'dHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjog'\n b'IkZsaWdodE51bSIsICJmaWVsZF9uYW1lIjogIkZsaWdodE51bSIsICJwYW5k'\n b'YXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYWlsTnVtIiwgImZpZWxkX25h'\n b'bWUiOiAiVGFpbE51bSIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51'\n b'bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJBY3R1YWxFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkFjdHVh'\n b'bEVsYXBzZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJDUlNFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBz'\n b'ZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJB'\n b'aXJUaW1lIiwgImZpZWxkX25hbWUiOiAiQWlyVGltZSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQXJyRGVsYXkiLCAiZmllbGRfbmFt'\n b'ZSI6ICJBcnJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiRGVwRGVsYXkiLCAiZmllbGRfbmFtZSI6ICJEZXBEZWxheSIsICJw'\n b'YW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2'\n b'NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiT3JpZ2luIiwgImZp'\n b'ZWxkX25hbWUiOiAiT3JpZ2luIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUi'\n b'LCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIkRlc3QiLCAiZmllbGRfbmFtZSI6ICJEZXN0IiwgInBhbmRh'\n b'c190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRpc3RhbmNlIiwgImZpZWxk'\n b'X25hbWUiOiAiRGlzdGFuY2UiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiVGF4aUluIiwgImZpZWxkX25hbWUiOiAiVGF4aUluIiwgInBhbmRh'\n b'c190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxk'\n b'X25hbWUiOiAiVGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiQ2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVk'\n b'IiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhbmNlbGxhdGlv'\n b'bkNvZGUiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsYXRpb25Db2RlIiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXZlcnRlZCIsICJm'\n b'aWVsZF9uYW1lIjogIkRpdmVydGVkIiwgInBhbmRhc190eXBlIjogImludDY0'\n b'IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVsZF9uYW1lIjogIkNhcnJp'\n b'ZXJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5'\n b'cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAi'\n b'V2VhdGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAiV2VhdGhlckRlbGF5Iiwg'\n b'InBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9h'\n b'dDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJOQVNEZWxheSIs'\n b'ICJmaWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5IiwgImZpZWxkX25hbWUi'\n b'OiAiU2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJM'\n b'YXRlQWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiSXNBcnJEZWxheWVkIiwgImZpZWxkX25hbWUiOiAiSXNBcnJE'\n b'ZWxheWVkIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlw'\n b'ZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIklz'\n b'RGVwRGVsYXllZCIsICJmaWVsZF9uYW1lIjogIklzRGVwRGVsYXllZCIsICJw'\n b'YW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6'\n b'ICJweWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVy'\n b'c2lvbiI6ICIwLjI1LjMifQAfAAAAVAYAABAGAADYBQAAoAUAAGgFAAAwBQAA'\n b'AAUAAMgEAACQBAAAWAQAACwEAADwAwAAuAMAAIgDAABUAwAAIAMAAPQCAADI'\n b'AgAAkAIAAGACAAAwAgAA+AEAALwBAACEAQAATAEAABQBAADgAAAAqAAAAGwA'\n b'AAA4AAAABAAAADj6//8AAAEFFAAAAAwAAAAEAAAAAAAAAMz7//8MAAAASXNE'\n b'ZXBEZWxheWVkAAAAAGj6//8AAAEFFAAAAAwAAAAEAAAAAAAAAPz7//8MAAAA'\n b'SXNBcnJEZWxheWVkAAAAAJj6//8AAAEDGAAAAAwAAAAEAAAAAAAAAGL7//8A'\n b'AAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAA0Pr//wAAAQMYAAAADAAAAAQA'\n b'AAAAAAAAmvv//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAAT7//8AAAEDGAAA'\n b'AAwAAAAEAAAAAAAAAM77//8AAAIACAAAAE5BU0RlbGF5AAAAADT7//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAAP77//8AAAIADAAAAFdlYXRoZXJEZWxheQAAAABo'\n b'+///AAABAxgAAAAMAAAABAAAAAAAAAAy/P//AAACAAwAAABDYXJyaWVyRGVs'\n b'YXkAAAAAnPv//wAAAQIcAAAADAAAAAQAAAAAAAAAjPv//wAAAAFAAAAACAAA'\n b'AERpdmVydGVkAAAAAND7//8AAAEDGAAAAAwAAAAEAAAAAAAAAJr8//8AAAIA'\n b'EAAAAENhbmNlbGxhdGlvbkNvZGUAAAAACPz//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAA+Pv//wAAAAFAAAAACQAAAENhbmNlbGxlZAAAADz8//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAAb9//8AAAIABwAAAFRheGlPdXQAaPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAAMv3//wAAAgAGAAAAVGF4aUluAACU/P//AAABAhwAAAAMAAAA'\n b'BAAAAAAAAACE/P//AAAAAUAAAAAIAAAARGlzdGFuY2UAAAAAyPz//wAAAQUU'\n b'AAAADAAAAAQAAAAAAAAAXP7//wQAAABEZXN0AAAAAPD8//8AAAEFFAAAAAwA'\n b'AAAEAAAAAAAAAIT+//8GAAAAT3JpZ2luAAAY/f//AAABAxgAAAAMAAAABAAA'\n b'AAAAAADi/f//AAACAAgAAABEZXBEZWxheQAAAABI/f//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAAAS/v//AAACAAgAAABBcnJEZWxheQAAAAB4/f//AAABAxgAAAAM'\n b'AAAABAAAAAAAAABC/v//AAACAAcAAABBaXJUaW1lAKT9//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAG7+//8AAAIADgAAAENSU0VsYXBzZWRUaW1lAADY/f//AAAB'\n b'AxgAAAAMAAAABAAAAAAAAACi/v//AAACABEAAABBY3R1YWxFbGFwc2VkVGlt'\n b'ZQAAABD+//8AAAEFFAAAAAwAAAAEAAAAAAAAAKT///8HAAAAVGFpbE51bQA4'\n b'/v//AAABAhwAAAAMAAAABAAAAAAAAAAo/v//AAAAAUAAAAAJAAAARmxpZ2h0'\n b'TnVtAAAAbP7//wAAAQUYAAAAEAAAAAQAAAAAAAAABAAEAAQAAAANAAAAVW5p'\n b'cXVlQ2FycmllcgAAAKD+//8AAAECHAAAAAwAAAAEAAAAAAAAAJD+//8AAAAB'\n b'QAAAAAoAAABDUlNBcnJUaW1lAADU/v//AAABAxgAAAAMAAAABAAAAAAAAACe'\n b'////AAACAAcAAABBcnJUaW1lAAD///8AAAECHAAAAAwAAAAEAAAAAAAAAPD+'\n b'//8AAAABQAAAAAoAAABDUlNEZXBUaW1lAAA0////AAABAyAAAAAUAAAABAAA'\n b'AAAAAAAAAAYACAAGAAYAAAAAAAIABwAAAERlcFRpbWUAaP///wAAAQIcAAAA'\n b'DAAAAAQAAAAAAAAAWP///wAAAAFAAAAACQAAAERheU9mV2VlawAAAJz///8A'\n b'AAECHAAAAAwAAAAEAAAAAAAAAIz///8AAAABQAAAAAoAAABEYXlvZk1vbnRo'\n b'AADQ////AAABAhwAAAAMAAAABAAAAAAAAADA////AAAAAUAAAAAFAAAATW9u'\n b'dGgAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIkAAAAFAAAAAQAAAAAAAAA'\n b'CAAMAAgABwAIAAAAAAAAAUAAAAAEAAAAWWVhcgAAAAA=')])" + "\u001b[0;31mValueError\u001b[0m: Schema in /User/mlrun/airlines/dataset/1278d2c85afc40cabc8e5add8d12892e.parquet was different. \nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: int64\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nCRSElapsedTime: double\nAirTime: int64\nOrigin: string\nDest: string\nDistance: double\nTaxiOut: double\nCancelled: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'95550000, \"stop\": 95560000, \"step\": 1}], \"column_indexes\": ['\n b'{\"name\": null, \"field_name\": null, \"pandas_type\": \"unicode\",'\n b' \"numpy_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}]'\n b', \"columns\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas'\n b'_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {'\n b'\"name\": \"Month\", \"field_name\": \"Month\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Dayo'\n b'fMonth\", \"field_name\": \"DayofMonth\", \"pandas_type\": \"int64\",'\n b' \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWe'\n b'ek\", \"field_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"num'\n b'py_type\": \"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"f'\n b'ield_name\": \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type'\n b'\": \"float64\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"fie'\n b'ld_name\": \"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\"'\n b': \"int64\", \"metadata\": null}, {\"name\": \"ArrTime\", \"field_nam'\n b'e\": \"ArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\"'\n b', \"metadata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"C'\n b'RSArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"'\n b'metadata\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"U'\n b'niqueCarrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"obje'\n b'ct\", \"metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": '\n b'\"FlightNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", '\n b'\"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_name\": '\n b'\"CRSElapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"f'\n b'loat64\", \"metadata\": null}, {\"name\": \"AirTime\", \"field_name\"'\n b': \"AirTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", '\n b'\"metadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Origin\"'\n b', \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadat'\n b'a\": null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pandas_ty'\n b'pe\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {'\n b'\"name\": \"Distance\", \"field_name\": \"Distance\", \"pandas_type\":'\n b' \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"na'\n b'me\": \"TaxiOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"flo'\n b'at64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": '\n b'\"Cancelled\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carr'\n b'ierDelay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"flo'\n b'at64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": '\n b'\"WeatherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\":'\n b' \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"na'\n b'me\": \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_ty'\n b'pe\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, '\n b'{\"name\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDel'\n b'ay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}], \"creator\": {\"library\": \"pyarrow\", \"version\":'\n b' \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////4AQAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAAJwLAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAGcLAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiA5NTU1MDAwMCwgInN0b3Ai'\n b'OiA5NTU2MDAwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7'\n b'Im5hbWUiOiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlw'\n b'ZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFk'\n b'YXRhIjogeyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3si'\n b'bmFtZSI6ICJZZWFyIiwgImZpZWxkX25hbWUiOiAiWWVhciIsICJwYW5kYXNf'\n b'dHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFk'\n b'YXRhIjogbnVsbH0sIHsibmFtZSI6ICJNb250aCIsICJmaWVsZF9uYW1lIjog'\n b'Ik1vbnRoIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUi'\n b'OiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRheW9m'\n b'TW9udGgiLCAiZmllbGRfbmFtZSI6ICJEYXlvZk1vbnRoIiwgInBhbmRhc190'\n b'eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIkRheU9mV2VlayIsICJmaWVsZF9uYW1l'\n b'IjogIkRheU9mV2VlayIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1w'\n b'eV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJEZXBUaW1lIiwgImZpZWxkX25hbWUiOiAiRGVwVGltZSIsICJwYW5kYXNf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJt'\n b'ZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTRGVwVGltZSIsICJmaWVs'\n b'ZF9uYW1lIjogIkNSU0RlcFRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQi'\n b'LCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiQXJyVGltZSIsICJmaWVsZF9uYW1lIjogIkFyclRpbWUiLCAi'\n b'cGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIs'\n b'ICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTQXJyVGltZSIsICJm'\n b'aWVsZF9uYW1lIjogIkNSU0FyclRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9'\n b'LCB7Im5hbWUiOiAiVW5pcXVlQ2FycmllciIsICJmaWVsZF9uYW1lIjogIlVu'\n b'aXF1ZUNhcnJpZXIiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1w'\n b'eV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiRmxpZ2h0TnVtIiwgImZpZWxkX25hbWUiOiAiRmxpZ2h0TnVtIiwgInBh'\n b'bmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNSU0VsYXBzZWRUaW1lIiwg'\n b'ImZpZWxkX25hbWUiOiAiQ1JTRWxhcHNlZFRpbWUiLCAicGFuZGFzX3R5cGUi'\n b'OiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIkFpclRpbWUiLCAiZmllbGRfbmFtZSI6'\n b'ICJBaXJUaW1lIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5'\n b'cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIk9y'\n b'aWdpbiIsICJmaWVsZF9uYW1lIjogIk9yaWdpbiIsICJwYW5kYXNfdHlwZSI6'\n b'ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRh'\n b'IjogbnVsbH0sIHsibmFtZSI6ICJEZXN0IiwgImZpZWxkX25hbWUiOiAiRGVz'\n b'dCIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAi'\n b'b2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXN0YW5j'\n b'ZSIsICJmaWVsZF9uYW1lIjogIkRpc3RhbmNlIiwgInBhbmRhc190eXBlIjog'\n b'ImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRh'\n b'IjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxkX25hbWUiOiAi'\n b'VGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5'\n b'cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAi'\n b'Q2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVkIiwgInBhbmRh'\n b'c190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0'\n b'YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVs'\n b'ZF9uYW1lIjogIkNhcnJpZXJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9h'\n b'dDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51'\n b'bGx9LCB7Im5hbWUiOiAiV2VhdGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAi'\n b'V2VhdGhlckRlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJOQVNEZWxheSIsICJmaWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5'\n b'IiwgImZpZWxkX25hbWUiOiAiU2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAi'\n b'ZmllbGRfbmFtZSI6ICJMYXRlQWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9XSwgImNyZWF0b3IiOiB7ImxpYnJhcnkiOiAicHlhcnJv'\n b'dyIsICJ2ZXJzaW9uIjogIjAuMTUuMSJ9LCAicGFuZGFzX3ZlcnNpb24iOiAi'\n b'MC4yNS4zIn0AFgAAAHwEAAA4BAAAAAQAAMgDAACQAwAAWAMAACQDAADsAgAA'\n b'tAIAAHwCAABEAgAAEAIAAOQBAAC4AQAAhAEAAFQBAAAcAQAA5AAAAKwAAAB4'\n b'AAAAQAAAAAQAAADs+///AAABAxgAAAAMAAAABAAAAAAAAAC2/P//AAACABEA'\n b'AABMYXRlQWlyY3JhZnREZWxheQAAACT8//8AAAEDGAAAAAwAAAAEAAAAAAAA'\n b'AO78//8AAAIADQAAAFNlY3VyaXR5RGVsYXkAAABY/P//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAAAi/f//AAACAAgAAABOQVNEZWxheQAAAACI/P//AAABAxgAAAAM'\n b'AAAABAAAAAAAAABS/f//AAACAAwAAABXZWF0aGVyRGVsYXkAAAAAvPz//wAA'\n b'AQMYAAAADAAAAAQAAAAAAAAAhv3//wAAAgAMAAAAQ2FycmllckRlbGF5AAAA'\n b'APD8//8AAAECHAAAAAwAAAAEAAAAAAAAAOD8//8AAAABQAAAAAkAAABDYW5j'\n b'ZWxsZWQAAAAk/f//AAABAxgAAAAMAAAABAAAAAAAAADu/f//AAACAAcAAABU'\n b'YXhpT3V0AFD9//8AAAEDGAAAAAwAAAAEAAAAAAAAABr+//8AAAIACAAAAERp'\n b'c3RhbmNlAAAAAID9//8AAAEFFAAAAAwAAAAEAAAAAAAAABj///8EAAAARGVz'\n b'dAAAAACo/f//AAABBRQAAAAMAAAABAAAAAAAAABA////BgAAAE9yaWdpbgAA'\n b'0P3//wAAAQIcAAAADAAAAAQAAAAAAAAAwP3//wAAAAFAAAAABwAAAEFpclRp'\n b'bWUAAP7//wAAAQMYAAAADAAAAAQAAAAAAAAAyv7//wAAAgAOAAAAQ1JTRWxh'\n b'cHNlZFRpbWUAADT+//8AAAECHAAAAAwAAAAEAAAAAAAAACT+//8AAAABQAAA'\n b'AAkAAABGbGlnaHROdW0AAABo/v//AAABBRgAAAAQAAAABAAAAAAAAAAEAAQA'\n b'BAAAAA0AAABVbmlxdWVDYXJyaWVyAAAAnP7//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAAjP7//wAAAAFAAAAACgAAAENSU0FyclRpbWUAAND+//8AAAECHAAAAAwA'\n b'AAAEAAAAAAAAAMD+//8AAAABQAAAAAcAAABBcnJUaW1lAAD///8AAAECHAAA'\n b'AAwAAAAEAAAAAAAAAPD+//8AAAABQAAAAAoAAABDUlNEZXBUaW1lAAA0////'\n b'AAABAyAAAAAUAAAABAAAAAAAAAAAAAYACAAGAAYAAAAAAAIABwAAAERlcFRp'\n b'bWUAaP///wAAAQIcAAAADAAAAAQAAAAAAAAAWP///wAAAAFAAAAACQAAAERh'\n b'eU9mV2VlawAAAJz///8AAAECHAAAAAwAAAAEAAAAAAAAAIz///8AAAABQAAA'\n b'AAoAAABEYXlvZk1vbnRoAADQ////AAABAhwAAAAMAAAABAAAAAAAAADA////'\n b'AAAAAUAAAAAFAAAATW9udGgAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIk'\n b'AAAAFAAAAAQAAAAAAAAACAAMAAgABwAIAAAAAAAAAUAAAAAEAAAAWWVhcgAA'\n b'AAA=')])\n\nvs\n\nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nCRSElapsedTime: double\nAirTime: double\nOrigin: string\nDest: string\nDistance: double\nTaxiOut: double\nCancelled: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'90730000, \"stop\": 90740000, \"step\": 1}], \"column_indexes\": ['\n b'{\"name\": null, \"field_name\": null, \"pandas_type\": \"unicode\",'\n b' \"numpy_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}]'\n b', \"columns\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas'\n b'_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {'\n b'\"name\": \"Month\", \"field_name\": \"Month\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Dayo'\n b'fMonth\", \"field_name\": \"DayofMonth\", \"pandas_type\": \"int64\",'\n b' \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWe'\n b'ek\", \"field_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"num'\n b'py_type\": \"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"f'\n b'ield_name\": \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type'\n b'\": \"float64\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"fie'\n b'ld_name\": \"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\"'\n b': \"int64\", \"metadata\": null}, {\"name\": \"ArrTime\", \"field_nam'\n b'e\": \"ArrTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"floa'\n b't64\", \"metadata\": null}, {\"name\": \"CRSArrTime\", \"field_name\"'\n b': \"CRSArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64'\n b'\", \"metadata\": null}, {\"name\": \"UniqueCarrier\", \"field_name\"'\n b': \"UniqueCarrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"'\n b'object\", \"metadata\": null}, {\"name\": \"FlightNum\", \"field_nam'\n b'e\": \"FlightNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_nam'\n b'e\": \"CRSElapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\"'\n b': \"float64\", \"metadata\": null}, {\"name\": \"AirTime\", \"field_n'\n b'ame\": \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"Origin\", \"field_name\": '\n b'\"Origin\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", '\n b'\"metadata\": null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"p'\n b'andas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": '\n b'null}, {\"name\": \"Distance\", \"field_name\": \"Distance\", \"panda'\n b's_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": nul'\n b'l}, {\"name\": \"TaxiOut\", \"field_name\": \"TaxiOut\", \"pandas_typ'\n b'e\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {'\n b'\"name\": \"Cancelled\", \"field_name\": \"Cancelled\", \"pandas_type'\n b'\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name'\n b'\": \"CarrierDelay\", \"field_name\": \"CarrierDelay\", \"pandas_typ'\n b'e\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {'\n b'\"name\": \"WeatherDelay\", \"field_name\": \"WeatherDelay\", \"panda'\n b's_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": nul'\n b'l}, {\"name\": \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_t'\n b'ype\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null},'\n b' {\"name\": \"SecurityDelay\", \"field_name\": \"SecurityDelay\", \"p'\n b'andas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\":'\n b' null}, {\"name\": \"LateAircraftDelay\", \"field_name\": \"LateAir'\n b'craftDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"v'\n b'ersion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////4AQAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAAKQLAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAG8LAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiA5MDczMDAwMCwgInN0b3Ai'\n b'OiA5MDc0MDAwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7'\n b'Im5hbWUiOiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlw'\n b'ZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFk'\n b'YXRhIjogeyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3si'\n b'bmFtZSI6ICJZZWFyIiwgImZpZWxkX25hbWUiOiAiWWVhciIsICJwYW5kYXNf'\n b'dHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFk'\n b'YXRhIjogbnVsbH0sIHsibmFtZSI6ICJNb250aCIsICJmaWVsZF9uYW1lIjog'\n b'Ik1vbnRoIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUi'\n b'OiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRheW9m'\n b'TW9udGgiLCAiZmllbGRfbmFtZSI6ICJEYXlvZk1vbnRoIiwgInBhbmRhc190'\n b'eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIkRheU9mV2VlayIsICJmaWVsZF9uYW1l'\n b'IjogIkRheU9mV2VlayIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1w'\n b'eV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJEZXBUaW1lIiwgImZpZWxkX25hbWUiOiAiRGVwVGltZSIsICJwYW5kYXNf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJt'\n b'ZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTRGVwVGltZSIsICJmaWVs'\n b'ZF9uYW1lIjogIkNSU0RlcFRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQi'\n b'LCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiQXJyVGltZSIsICJmaWVsZF9uYW1lIjogIkFyclRpbWUiLCAi'\n b'cGFuZGFzX3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNSU0FyclRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJDUlNBcnJUaW1lIiwgInBhbmRhc190eXBlIjog'\n b'ImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBu'\n b'dWxsfSwgeyJuYW1lIjogIlVuaXF1ZUNhcnJpZXIiLCAiZmllbGRfbmFtZSI6'\n b'ICJVbmlxdWVDYXJyaWVyIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAi'\n b'bnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJu'\n b'YW1lIjogIkZsaWdodE51bSIsICJmaWVsZF9uYW1lIjogIkZsaWdodE51bSIs'\n b'ICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJDUlNFbGFwc2VkVGlt'\n b'ZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBzZWRUaW1lIiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJBaXJUaW1lIiwgImZpZWxkX25h'\n b'bWUiOiAiQWlyVGltZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiT3JpZ2luIiwgImZpZWxkX25hbWUiOiAiT3JpZ2luIiwgInBhbmRh'\n b'c190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRlc3QiLCAiZmllbGRfbmFt'\n b'ZSI6ICJEZXN0IiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlf'\n b'dHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjog'\n b'IkRpc3RhbmNlIiwgImZpZWxkX25hbWUiOiAiRGlzdGFuY2UiLCAicGFuZGFz'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIlRheGlPdXQiLCAiZmllbGRf'\n b'bmFtZSI6ICJUYXhpT3V0IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsi'\n b'bmFtZSI6ICJDYW5jZWxsZWQiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsZWQi'\n b'LCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2'\n b'NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ2FycmllckRlbGF5'\n b'IiwgImZpZWxkX25hbWUiOiAiQ2FycmllckRlbGF5IiwgInBhbmRhc190eXBl'\n b'IjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFk'\n b'YXRhIjogbnVsbH0sIHsibmFtZSI6ICJXZWF0aGVyRGVsYXkiLCAiZmllbGRf'\n b'bmFtZSI6ICJXZWF0aGVyRGVsYXkiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQ2'\n b'NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAibWV0YWRhdGEiOiBudWxs'\n b'fSwgeyJuYW1lIjogIk5BU0RlbGF5IiwgImZpZWxkX25hbWUiOiAiTkFTRGVs'\n b'YXkiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjog'\n b'ImZsb2F0NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIlNlY3Vy'\n b'aXR5RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJTZWN1cml0eURlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJMYXRlQWlyY3JhZnRE'\n b'ZWxheSIsICJmaWVsZF9uYW1lIjogIkxhdGVBaXJjcmFmdERlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6'\n b'ICJweWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVy'\n b'c2lvbiI6ICIwLjI1LjMifQAWAAAAdAQAADAEAAD4AwAAwAMAAIgDAABQAwAA'\n b'IAMAAOgCAACwAgAAeAIAAEACAAAQAgAA5AEAALgBAACEAQAAVAEAABwBAADk'\n b'AAAArAAAAHgAAABAAAAABAAAAPT7//8AAAEDGAAAAAwAAAAEAAAAAAAAAL78'\n b'//8AAAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAALPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAA9vz//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAGD8//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAACr9//8AAAIACAAAAE5BU0RlbGF5AAAAAJD8//8A'\n b'AAEDGAAAAAwAAAAEAAAAAAAAAFr9//8AAAIADAAAAFdlYXRoZXJEZWxheQAA'\n b'AADE/P//AAABAxgAAAAMAAAABAAAAAAAAACO/f//AAACAAwAAABDYXJyaWVy'\n b'RGVsYXkAAAAA+Pz//wAAAQIcAAAADAAAAAQAAAAAAAAA6Pz//wAAAAFAAAAA'\n b'CQAAAENhbmNlbGxlZAAAACz9//8AAAEDGAAAAAwAAAAEAAAAAAAAAPb9//8A'\n b'AAIABwAAAFRheGlPdXQAWP3//wAAAQMYAAAADAAAAAQAAAAAAAAAIv7//wAA'\n b'AgAIAAAARGlzdGFuY2UAAAAAiP3//wAAAQUUAAAADAAAAAQAAAAAAAAAHP//'\n b'/wQAAABEZXN0AAAAALD9//8AAAEFFAAAAAwAAAAEAAAAAAAAAET///8GAAAA'\n b'T3JpZ2luAADY/f//AAABAxgAAAAMAAAABAAAAAAAAACi/v//AAACAAcAAABB'\n b'aXJUaW1lAAT+//8AAAEDGAAAAAwAAAAEAAAAAAAAAM7+//8AAAIADgAAAENS'\n b'U0VsYXBzZWRUaW1lAAA4/v//AAABAhwAAAAMAAAABAAAAAAAAAAo/v//AAAA'\n b'AUAAAAAJAAAARmxpZ2h0TnVtAAAAbP7//wAAAQUYAAAAEAAAAAQAAAAAAAAA'\n b'BAAEAAQAAAANAAAAVW5pcXVlQ2FycmllcgAAAKD+//8AAAECHAAAAAwAAAAE'\n b'AAAAAAAAAJD+//8AAAABQAAAAAoAAABDUlNBcnJUaW1lAADU/v//AAABAxgA'\n b'AAAMAAAABAAAAAAAAACe////AAACAAcAAABBcnJUaW1lAAD///8AAAECHAAA'\n b'AAwAAAAEAAAAAAAAAPD+//8AAAABQAAAAAoAAABDUlNEZXBUaW1lAAA0////'\n b'AAABAyAAAAAUAAAABAAAAAAAAAAAAAYACAAGAAYAAAAAAAIABwAAAERlcFRp'\n b'bWUAaP///wAAAQIcAAAADAAAAAQAAAAAAAAAWP///wAAAAFAAAAACQAAAERh'\n b'eU9mV2VlawAAAJz///8AAAECHAAAAAwAAAAEAAAAAAAAAIz///8AAAABQAAA'\n b'AAoAAABEYXlvZk1vbnRoAADQ////AAABAhwAAAAMAAAABAAAAAAAAADA////'\n b'AAAAAUAAAAAFAAAATW9udGgAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIk'\n b'AAAAFAAAAAQAAAAAAAAACAAMAAgABwAIAAAAAAAAAUAAAAAEAAAAWWVhcgAA'\n b'AAA=')])" ] } ], "source": [ - "table = pq.read_table(TARGET_PATH)" + "dataset = pq.ParquetDataset(TARGET_PATH)" ] }, { "cell_type": "code", - "execution_count": 53, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "\u001b[0;31mInit signature:\u001b[0m\n", - "\u001b[0mpq\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mParquetDataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mpath_or_paths\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mfilesystem\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mschema\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mmetadata\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0msplit_row_groups\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mvalidate_schema\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mfilters\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mmetadata_nthreads\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mread_dictionary\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mmemory_map\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m \u001b[0mbuffer_size\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", - "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mDocstring:\u001b[0m \n", - "Encapsulates details of reading a complete Parquet dataset possibly\n", - "consisting of multiple files and partitions in subdirectories\n", - "\n", - "Parameters\n", - "----------\n", - "path_or_paths : str or List[str]\n", - " A directory name, single file name, or list of file names\n", - "filesystem : FileSystem, default None\n", - " If nothing passed, paths assumed to be found in the local on-disk\n", - " filesystem\n", - "metadata : pyarrow.parquet.FileMetaData\n", - " Use metadata obtained elsewhere to validate file schemas\n", - "schema : pyarrow.parquet.Schema\n", - " Use schema obtained elsewhere to validate file schemas. Alternative to\n", - " metadata parameter\n", - "split_row_groups : boolean, default False\n", - " Divide files into pieces for each row group in the file\n", - "validate_schema : boolean, default True\n", - " Check that individual file schemas are all the same / compatible\n", - "filters : List[Tuple] or List[List[Tuple]] or None (default)\n", - " List of filters to apply, like ``[[('x', '=', 0), ...], ...]``. This\n", - " implements partition-level (hive) filtering only, i.e., to prevent the\n", - " loading of some files of the dataset.\n", - "\n", - " Predicates are expressed in disjunctive normal form (DNF). This means\n", - " that the innermost tuple describe a single column predicate. These\n", - " inner predicate make are all combined with a conjunction (AND) into a\n", - " larger predicate. The most outer list then combines all filters\n", - " with a disjunction (OR). By this, we should be able to express all\n", - " kinds of filters that are possible using boolean logic.\n", - "\n", - " This function also supports passing in as List[Tuple]. These predicates\n", - " are evaluated as a conjunction. To express OR in predictates, one must\n", - " use the (preferred) List[List[Tuple]] notation.\n", - "metadata_nthreads: int, default 1\n", - " How many threads to allow the thread pool which is used to read the\n", - " dataset metadata. Increasing this is helpful to read partitioned\n", - " datasets.\n", - "read_dictionary : list, default None\n", - " List of names or column paths (for nested types) to read directly\n", - " as DictionaryArray. Only supported for BYTE_ARRAY storage. To read\n", - " a flat column as dictionary-encoded pass the column name. For\n", - " nested types, you must pass the full column \"path\", which could be\n", - " something like level1.level2.list.item. Refer to the Parquet\n", - " file's schema to obtain the paths.\n", - "memory_map : boolean, default False\n", - " If the source is a file path, use a memory map to read file, which can\n", - " improve performance in some environments\n", - "buffer_size : int, default 0\n", - " If positive, perform read buffering when deserializing individual\n", - " column chunks. Otherwise IO calls are unbuffered.\n", - "\u001b[0;31mFile:\u001b[0m ~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/pyarrow/parquet.py\n", - "\u001b[0;31mType:\u001b[0m type\n", - "\u001b[0;31mSubclasses:\u001b[0m \n" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ - "dataset = pq.ParquetDataset(TARGET_PATH)" + "df = dataset.read().to_pandas().set_index(['Year', 'Month'], inplace=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## cleanup" ] }, { "cell_type": "code", - "execution_count": 51, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "ename": "ValueError", - "evalue": "Schema in /User/mlrun/airlines/dataset/a3f9441653c14708ad69f63694749301.parquet was different. \nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nTailNum: double\nActualElapsedTime: double\nCRSElapsedTime: int64\nAirTime: double\nArrDelay: double\nDepDelay: double\nOrigin: string\nDest: string\nDistance: double\nTaxiIn: double\nTaxiOut: double\nCancelled: int64\nCancellationCode: double\nDiverted: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nIsArrDelayed: string\nIsDepDelayed: string\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'0, \"stop\": 10000, \"step\": 1}], \"column_indexes\": [{\"name\": n'\n b'ull, \"field_name\": null, \"pandas_type\": \"unicode\", \"numpy_ty'\n b'pe\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}], \"columns'\n b'\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas_type\": \"i'\n b'nt64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"M'\n b'onth\", \"field_name\": \"Month\", \"pandas_type\": \"int64\", \"numpy'\n b'_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayofMonth\", \"'\n b'field_name\": \"DayofMonth\", \"pandas_type\": \"int64\", \"numpy_ty'\n b'pe\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWeek\", \"fiel'\n b'd_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"numpy_type\": '\n b'\"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"field_name\"'\n b': \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"field_name\": '\n b'\"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\",'\n b' \"metadata\": null}, {\"name\": \"ArrTime\", \"field_name\": \"ArrTi'\n b'me\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"CRSArrT'\n b'ime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metada'\n b'ta\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"UniqueC'\n b'arrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"'\n b'metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": \"Fligh'\n b'tNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metad'\n b'ata\": null}, {\"name\": \"TailNum\", \"field_name\": \"TailNum\", \"p'\n b'andas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\":'\n b' null}, {\"name\": \"ActualElapsedTime\", \"field_name\": \"ActualE'\n b'lapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_nam'\n b'e\": \"CRSElapsedTime\", \"pandas_type\": \"int64\", \"numpy_type\": '\n b'\"int64\", \"metadata\": null}, {\"name\": \"AirTime\", \"field_name\"'\n b': \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"ArrDelay\", \"field_name\": \"A'\n b'rrDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\",'\n b' \"metadata\": null}, {\"name\": \"DepDelay\", \"field_name\": \"DepD'\n b'elay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"m'\n b'etadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Origin\", '\n b'\"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\"'\n b': null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pandas_type'\n b'\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {\"n'\n b'ame\": \"Distance\", \"field_name\": \"Distance\", \"pandas_type\": \"'\n b'float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name'\n b'\": \"TaxiIn\", \"field_name\": \"TaxiIn\", \"pandas_type\": \"float64'\n b'\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Tax'\n b'iOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"float64\", \"n'\n b'umpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Cancelle'\n b'd\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int64\", \"nump'\n b'y_type\": \"int64\", \"metadata\": null}, {\"name\": \"CancellationC'\n b'ode\", \"field_name\": \"CancellationCode\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'Diverted\", \"field_name\": \"Diverted\", \"pandas_type\": \"int64\",'\n b' \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carrier'\n b'Delay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"float6'\n b'4\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"We'\n b'atherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_type\"'\n b': \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"n'\n b'ame\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDelay\"'\n b', \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metada'\n b'ta\": null}, {\"name\": \"IsArrDelayed\", \"field_name\": \"IsArrDel'\n b'ayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"me'\n b'tadata\": null}, {\"name\": \"IsDepDelayed\", \"field_name\": \"IsDe'\n b'pDelayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\",'\n b' \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"vers'\n b'ion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////6AWAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAANwPAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAKQPAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDEwMDAw'\n b'LCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51'\n b'bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNv'\n b'ZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVu'\n b'Y29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogIlll'\n b'YXIiLCAiZmllbGRfbmFtZSI6ICJZZWFyIiwgInBhbmRhc190eXBlIjogImlu'\n b'dDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxs'\n b'fSwgeyJuYW1lIjogIk1vbnRoIiwgImZpZWxkX25hbWUiOiAiTW9udGgiLCAi'\n b'cGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIs'\n b'ICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiRGF5b2ZNb250aCIsICJm'\n b'aWVsZF9uYW1lIjogIkRheW9mTW9udGgiLCAicGFuZGFzX3R5cGUiOiAiaW50'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9'\n b'LCB7Im5hbWUiOiAiRGF5T2ZXZWVrIiwgImZpZWxkX25hbWUiOiAiRGF5T2ZX'\n b'ZWVrIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAi'\n b'aW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRlcFRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJEZXBUaW1lIiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJDUlNEZXBUaW1lIiwgImZpZWxkX25hbWUiOiAi'\n b'Q1JTRGVwVGltZSIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90'\n b'eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJB'\n b'cnJUaW1lIiwgImZpZWxkX25hbWUiOiAiQXJyVGltZSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTQXJyVGltZSIsICJmaWVsZF9u'\n b'YW1lIjogIkNSU0FyclRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiVW5pcXVlQ2FycmllciIsICJmaWVsZF9uYW1lIjogIlVuaXF1ZUNh'\n b'cnJpZXIiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBl'\n b'IjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiRmxp'\n b'Z2h0TnVtIiwgImZpZWxkX25hbWUiOiAiRmxpZ2h0TnVtIiwgInBhbmRhc190'\n b'eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIlRhaWxOdW0iLCAiZmllbGRfbmFtZSI6'\n b'ICJUYWlsTnVtIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJBY3R1YWxFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkFjdHVhbEVs'\n b'YXBzZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJDUlNFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBzZWRU'\n b'aW1lIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAi'\n b'aW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkFpclRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJBaXJUaW1lIiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJBcnJEZWxheSIsICJmaWVsZF9uYW1lIjogIkFy'\n b'ckRlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJE'\n b'ZXBEZWxheSIsICJmaWVsZF9uYW1lIjogIkRlcERlbGF5IiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJPcmlnaW4iLCAiZmllbGRfbmFt'\n b'ZSI6ICJPcmlnaW4iLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1w'\n b'eV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiRGVzdCIsICJmaWVsZF9uYW1lIjogIkRlc3QiLCAicGFuZGFzX3R5cGUi'\n b'OiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0'\n b'YSI6IG51bGx9LCB7Im5hbWUiOiAiRGlzdGFuY2UiLCAiZmllbGRfbmFtZSI6'\n b'ICJEaXN0YW5jZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiVGF4aUluIiwgImZpZWxkX25hbWUiOiAiVGF4aUluIiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxkX25h'\n b'bWUiOiAiVGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiQ2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVkIiwg'\n b'InBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQi'\n b'LCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhbmNlbGxhdGlvbkNv'\n b'ZGUiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsYXRpb25Db2RlIiwgInBhbmRh'\n b'c190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXZlcnRlZCIsICJmaWVs'\n b'ZF9uYW1lIjogIkRpdmVydGVkIiwgInBhbmRhc190eXBlIjogImludDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJu'\n b'YW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVsZF9uYW1lIjogIkNhcnJpZXJE'\n b'ZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUi'\n b'OiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiV2Vh'\n b'dGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAiV2VhdGhlckRlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJOQVNEZWxheSIsICJm'\n b'aWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVs'\n b'bH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5IiwgImZpZWxkX25hbWUiOiAi'\n b'U2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJMYXRl'\n b'QWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiSXNBcnJEZWxheWVkIiwgImZpZWxkX25hbWUiOiAiSXNBcnJEZWxh'\n b'eWVkIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6'\n b'ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIklzRGVw'\n b'RGVsYXllZCIsICJmaWVsZF9uYW1lIjogIklzRGVwRGVsYXllZCIsICJwYW5k'\n b'YXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJw'\n b'eWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVyc2lv'\n b'biI6ICIwLjI1LjMifQAAAAAfAAAAWAYAABQGAADcBQAApAUAAGwFAAA0BQAA'\n b'BAUAAMwEAACUBAAAXAQAACwEAADwAwAAtAMAAIQDAABQAwAAHAMAAPACAADE'\n b'AgAAkAIAAGACAAAwAgAA+AEAALwBAACEAQAATAEAABQBAADgAAAAqAAAAGwA'\n b'AAA4AAAABAAAADT6//8AAAEFFAAAAAwAAAAEAAAAAAAAAMj7//8MAAAASXNE'\n b'ZXBEZWxheWVkAAAAAGT6//8AAAEFFAAAAAwAAAAEAAAAAAAAAPj7//8MAAAA'\n b'SXNBcnJEZWxheWVkAAAAAJT6//8AAAEDGAAAAAwAAAAEAAAAAAAAAF77//8A'\n b'AAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAAzPr//wAAAQMYAAAADAAAAAQA'\n b'AAAAAAAAlvv//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAAD7//8AAAEDGAAA'\n b'AAwAAAAEAAAAAAAAAMr7//8AAAIACAAAAE5BU0RlbGF5AAAAADD7//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAAPr7//8AAAIADAAAAFdlYXRoZXJEZWxheQAAAABk'\n b'+///AAABAxgAAAAMAAAABAAAAAAAAAAu/P//AAACAAwAAABDYXJyaWVyRGVs'\n b'YXkAAAAAmPv//wAAAQIcAAAADAAAAAQAAAAAAAAAiPv//wAAAAFAAAAACAAA'\n b'AERpdmVydGVkAAAAAMz7//8AAAEDGAAAAAwAAAAEAAAAAAAAAJb8//8AAAIA'\n b'EAAAAENhbmNlbGxhdGlvbkNvZGUAAAAABPz//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAA9Pv//wAAAAFAAAAACQAAAENhbmNlbGxlZAAAADj8//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAAL9//8AAAIABwAAAFRheGlPdXQAZPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAALv3//wAAAgAGAAAAVGF4aUluAACQ/P//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAABa/f//AAACAAgAAABEaXN0YW5jZQAAAADA/P//AAABBRQAAAAM'\n b'AAAABAAAAAAAAABU/v//BAAAAERlc3QAAAAA6Pz//wAAAQUUAAAADAAAAAQA'\n b'AAAAAAAAfP7//wYAAABPcmlnaW4AABD9//8AAAEDGAAAAAwAAAAEAAAAAAAA'\n b'ANr9//8AAAIACAAAAERlcERlbGF5AAAAAED9//8AAAEDGAAAAAwAAAAEAAAA'\n b'AAAAAAr+//8AAAIACAAAAEFyckRlbGF5AAAAAHD9//8AAAEDGAAAAAwAAAAE'\n b'AAAAAAAAADr+//8AAAIABwAAAEFpclRpbWUAnP3//wAAAQIcAAAADAAAAAQA'\n b'AAAAAAAAjP3//wAAAAFAAAAADgAAAENSU0VsYXBzZWRUaW1lAADU/f//AAAB'\n b'AxgAAAAMAAAABAAAAAAAAACe/v//AAACABEAAABBY3R1YWxFbGFwc2VkVGlt'\n b'ZQAAAAz+//8AAAEDGAAAAAwAAAAEAAAAAAAAANb+//8AAAIABwAAAFRhaWxO'\n b'dW0AOP7//wAAAQIcAAAADAAAAAQAAAAAAAAAKP7//wAAAAFAAAAACQAAAEZs'\n b'aWdodE51bQAAAGz+//8AAAEFGAAAABAAAAAEAAAAAAAAAAQABAAEAAAADQAA'\n b'AFVuaXF1ZUNhcnJpZXIAAACg/v//AAABAhwAAAAMAAAABAAAAAAAAACQ/v//'\n b'AAAAAUAAAAAKAAAAQ1JTQXJyVGltZQAA1P7//wAAAQMYAAAADAAAAAQAAAAA'\n b'AAAAnv///wAAAgAHAAAAQXJyVGltZQAA////AAABAhwAAAAMAAAABAAAAAAA'\n b'AADw/v//AAAAAUAAAAAKAAAAQ1JTRGVwVGltZQAANP///wAAAQMgAAAAFAAA'\n b'AAQAAAAAAAAAAAAGAAgABgAGAAAAAAACAAcAAABEZXBUaW1lAGj///8AAAEC'\n b'HAAAAAwAAAAEAAAAAAAAAFj///8AAAABQAAAAAkAAABEYXlPZldlZWsAAACc'\n b'////AAABAhwAAAAMAAAABAAAAAAAAACM////AAAAAUAAAAAKAAAARGF5b2ZN'\n b'b250aAAA0P///wAAAQIcAAAADAAAAAQAAAAAAAAAwP///wAAAAFAAAAABQAA'\n b'AE1vbnRoAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAECJAAAABQAAAAEAAAA'\n b'AAAAAAgADAAIAAcACAAAAAAAAAFAAAAABAAAAFllYXIAAAAAAAAAAA==')])\n\nvs\n\nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nTailNum: string\nActualElapsedTime: double\nCRSElapsedTime: double\nAirTime: double\nArrDelay: double\nDepDelay: double\nOrigin: string\nDest: string\nDistance: int64\nTaxiIn: double\nTaxiOut: double\nCancelled: int64\nCancellationCode: double\nDiverted: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nIsArrDelayed: string\nIsDepDelayed: string\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'10000, \"stop\": 20000, \"step\": 1}], \"column_indexes\": [{\"name'\n b'\": null, \"field_name\": null, \"pandas_type\": \"unicode\", \"nump'\n b'y_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}], \"col'\n b'umns\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas_type\"'\n b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\"'\n b': \"Month\", \"field_name\": \"Month\", \"pandas_type\": \"int64\", \"n'\n b'umpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayofMonth'\n b'\", \"field_name\": \"DayofMonth\", \"pandas_type\": \"int64\", \"nump'\n b'y_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWeek\", \"'\n b'field_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"numpy_typ'\n b'e\": \"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"field_n'\n b'ame\": \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"field_nam'\n b'e\": \"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int'\n b'64\", \"metadata\": null}, {\"name\": \"ArrTime\", \"field_name\": \"A'\n b'rrTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", '\n b'\"metadata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"CRS'\n b'ArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"me'\n b'tadata\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"Uni'\n b'queCarrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object'\n b'\", \"metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": \"F'\n b'lightNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"m'\n b'etadata\": null}, {\"name\": \"TailNum\", \"field_name\": \"TailNum\"'\n b', \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadat'\n b'a\": null}, {\"name\": \"ActualElapsedTime\", \"field_name\": \"Actu'\n b'alElapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"flo'\n b'at64\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_'\n b'name\": \"CRSElapsedTime\", \"pandas_type\": \"float64\", \"numpy_ty'\n b'pe\": \"float64\", \"metadata\": null}, {\"name\": \"AirTime\", \"fiel'\n b'd_name\": \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": '\n b'\"float64\", \"metadata\": null}, {\"name\": \"ArrDelay\", \"field_na'\n b'me\": \"ArrDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"DepDelay\", \"field_name\"'\n b': \"DepDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float'\n b'64\", \"metadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Or'\n b'igin\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"me'\n b'tadata\": null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pand'\n b'as_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": nul'\n b'l}, {\"name\": \"Distance\", \"field_name\": \"Distance\", \"pandas_t'\n b'ype\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"n'\n b'ame\": \"TaxiIn\", \"field_name\": \"TaxiIn\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'TaxiOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"float64\",'\n b' \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Cance'\n b'lled\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int64\", \"n'\n b'umpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Cancellati'\n b'onCode\", \"field_name\": \"CancellationCode\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"Diverted\", \"field_name\": \"Diverted\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carr'\n b'ierDelay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"flo'\n b'at64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": '\n b'\"WeatherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\":'\n b' \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"na'\n b'me\": \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_ty'\n b'pe\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, '\n b'{\"name\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDel'\n b'ay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}, {\"name\": \"IsArrDelayed\", \"field_name\": \"IsArr'\n b'Delayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", '\n b'\"metadata\": null}, {\"name\": \"IsDepDelayed\", \"field_name\": \"I'\n b'sDepDelayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"objec'\n b't\", \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"v'\n b'ersion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////5gWAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAANwPAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAKcPAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAxMDAwMCwgInN0b3AiOiAy'\n b'MDAwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7Im5hbWUi'\n b'OiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlwZSI6ICJ1'\n b'bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjog'\n b'eyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3sibmFtZSI6'\n b'ICJZZWFyIiwgImZpZWxkX25hbWUiOiAiWWVhciIsICJwYW5kYXNfdHlwZSI6'\n b'ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJNb250aCIsICJmaWVsZF9uYW1lIjogIk1vbnRo'\n b'IiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRheW9mTW9udGgi'\n b'LCAiZmllbGRfbmFtZSI6ICJEYXlvZk1vbnRoIiwgInBhbmRhc190eXBlIjog'\n b'ImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBu'\n b'dWxsfSwgeyJuYW1lIjogIkRheU9mV2VlayIsICJmaWVsZF9uYW1lIjogIkRh'\n b'eU9mV2VlayIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBl'\n b'IjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEZXBU'\n b'aW1lIiwgImZpZWxkX25hbWUiOiAiRGVwVGltZSIsICJwYW5kYXNfdHlwZSI6'\n b'ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0'\n b'YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTRGVwVGltZSIsICJmaWVsZF9uYW1l'\n b'IjogIkNSU0RlcFRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiQXJyVGltZSIsICJmaWVsZF9uYW1lIjogIkFyclRpbWUiLCAicGFuZGFz'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNSU0FyclRpbWUiLCAiZmll'\n b'bGRfbmFtZSI6ICJDUlNBcnJUaW1lIiwgInBhbmRhc190eXBlIjogImludDY0'\n b'IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIlVuaXF1ZUNhcnJpZXIiLCAiZmllbGRfbmFtZSI6ICJVbmlx'\n b'dWVDYXJyaWVyIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlf'\n b'dHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjog'\n b'IkZsaWdodE51bSIsICJmaWVsZF9uYW1lIjogIkZsaWdodE51bSIsICJwYW5k'\n b'YXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYWlsTnVtIiwgImZpZWxkX25h'\n b'bWUiOiAiVGFpbE51bSIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51'\n b'bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJBY3R1YWxFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkFjdHVh'\n b'bEVsYXBzZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJDUlNFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBz'\n b'ZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJB'\n b'aXJUaW1lIiwgImZpZWxkX25hbWUiOiAiQWlyVGltZSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQXJyRGVsYXkiLCAiZmllbGRfbmFt'\n b'ZSI6ICJBcnJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiRGVwRGVsYXkiLCAiZmllbGRfbmFtZSI6ICJEZXBEZWxheSIsICJw'\n b'YW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2'\n b'NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiT3JpZ2luIiwgImZp'\n b'ZWxkX25hbWUiOiAiT3JpZ2luIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUi'\n b'LCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIkRlc3QiLCAiZmllbGRfbmFtZSI6ICJEZXN0IiwgInBhbmRh'\n b'c190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRpc3RhbmNlIiwgImZpZWxk'\n b'X25hbWUiOiAiRGlzdGFuY2UiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiVGF4aUluIiwgImZpZWxkX25hbWUiOiAiVGF4aUluIiwgInBhbmRh'\n b'c190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxk'\n b'X25hbWUiOiAiVGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiQ2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVk'\n b'IiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhbmNlbGxhdGlv'\n b'bkNvZGUiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsYXRpb25Db2RlIiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXZlcnRlZCIsICJm'\n b'aWVsZF9uYW1lIjogIkRpdmVydGVkIiwgInBhbmRhc190eXBlIjogImludDY0'\n b'IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVsZF9uYW1lIjogIkNhcnJp'\n b'ZXJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5'\n b'cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAi'\n b'V2VhdGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAiV2VhdGhlckRlbGF5Iiwg'\n b'InBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9h'\n b'dDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJOQVNEZWxheSIs'\n b'ICJmaWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5IiwgImZpZWxkX25hbWUi'\n b'OiAiU2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJM'\n b'YXRlQWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiSXNBcnJEZWxheWVkIiwgImZpZWxkX25hbWUiOiAiSXNBcnJE'\n b'ZWxheWVkIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlw'\n b'ZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIklz'\n b'RGVwRGVsYXllZCIsICJmaWVsZF9uYW1lIjogIklzRGVwRGVsYXllZCIsICJw'\n b'YW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6'\n b'ICJweWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVy'\n b'c2lvbiI6ICIwLjI1LjMifQAfAAAAVAYAABAGAADYBQAAoAUAAGgFAAAwBQAA'\n b'AAUAAMgEAACQBAAAWAQAACwEAADwAwAAuAMAAIgDAABUAwAAIAMAAPQCAADI'\n b'AgAAkAIAAGACAAAwAgAA+AEAALwBAACEAQAATAEAABQBAADgAAAAqAAAAGwA'\n b'AAA4AAAABAAAADj6//8AAAEFFAAAAAwAAAAEAAAAAAAAAMz7//8MAAAASXNE'\n b'ZXBEZWxheWVkAAAAAGj6//8AAAEFFAAAAAwAAAAEAAAAAAAAAPz7//8MAAAA'\n b'SXNBcnJEZWxheWVkAAAAAJj6//8AAAEDGAAAAAwAAAAEAAAAAAAAAGL7//8A'\n b'AAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAA0Pr//wAAAQMYAAAADAAAAAQA'\n b'AAAAAAAAmvv//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAAT7//8AAAEDGAAA'\n b'AAwAAAAEAAAAAAAAAM77//8AAAIACAAAAE5BU0RlbGF5AAAAADT7//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAAP77//8AAAIADAAAAFdlYXRoZXJEZWxheQAAAABo'\n b'+///AAABAxgAAAAMAAAABAAAAAAAAAAy/P//AAACAAwAAABDYXJyaWVyRGVs'\n b'YXkAAAAAnPv//wAAAQIcAAAADAAAAAQAAAAAAAAAjPv//wAAAAFAAAAACAAA'\n b'AERpdmVydGVkAAAAAND7//8AAAEDGAAAAAwAAAAEAAAAAAAAAJr8//8AAAIA'\n b'EAAAAENhbmNlbGxhdGlvbkNvZGUAAAAACPz//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAA+Pv//wAAAAFAAAAACQAAAENhbmNlbGxlZAAAADz8//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAAb9//8AAAIABwAAAFRheGlPdXQAaPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAAMv3//wAAAgAGAAAAVGF4aUluAACU/P//AAABAhwAAAAMAAAA'\n b'BAAAAAAAAACE/P//AAAAAUAAAAAIAAAARGlzdGFuY2UAAAAAyPz//wAAAQUU'\n b'AAAADAAAAAQAAAAAAAAAXP7//wQAAABEZXN0AAAAAPD8//8AAAEFFAAAAAwA'\n b'AAAEAAAAAAAAAIT+//8GAAAAT3JpZ2luAAAY/f//AAABAxgAAAAMAAAABAAA'\n b'AAAAAADi/f//AAACAAgAAABEZXBEZWxheQAAAABI/f//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAAAS/v//AAACAAgAAABBcnJEZWxheQAAAAB4/f//AAABAxgAAAAM'\n b'AAAABAAAAAAAAABC/v//AAACAAcAAABBaXJUaW1lAKT9//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAG7+//8AAAIADgAAAENSU0VsYXBzZWRUaW1lAADY/f//AAAB'\n b'AxgAAAAMAAAABAAAAAAAAACi/v//AAACABEAAABBY3R1YWxFbGFwc2VkVGlt'\n b'ZQAAABD+//8AAAEFFAAAAAwAAAAEAAAAAAAAAKT///8HAAAAVGFpbE51bQA4'\n b'/v//AAABAhwAAAAMAAAABAAAAAAAAAAo/v//AAAAAUAAAAAJAAAARmxpZ2h0'\n b'TnVtAAAAbP7//wAAAQUYAAAAEAAAAAQAAAAAAAAABAAEAAQAAAANAAAAVW5p'\n b'cXVlQ2FycmllcgAAAKD+//8AAAECHAAAAAwAAAAEAAAAAAAAAJD+//8AAAAB'\n b'QAAAAAoAAABDUlNBcnJUaW1lAADU/v//AAABAxgAAAAMAAAABAAAAAAAAACe'\n b'////AAACAAcAAABBcnJUaW1lAAD///8AAAECHAAAAAwAAAAEAAAAAAAAAPD+'\n b'//8AAAABQAAAAAoAAABDUlNEZXBUaW1lAAA0////AAABAyAAAAAUAAAABAAA'\n b'AAAAAAAAAAYACAAGAAYAAAAAAAIABwAAAERlcFRpbWUAaP///wAAAQIcAAAA'\n b'DAAAAAQAAAAAAAAAWP///wAAAAFAAAAACQAAAERheU9mV2VlawAAAJz///8A'\n b'AAECHAAAAAwAAAAEAAAAAAAAAIz///8AAAABQAAAAAoAAABEYXlvZk1vbnRo'\n b'AADQ////AAABAhwAAAAMAAAABAAAAAAAAADA////AAAAAUAAAAAFAAAATW9u'\n b'dGgAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIkAAAAFAAAAAQAAAAAAAAA'\n b'CAAMAAgABwAIAAAAAAAAAUAAAAAEAAAAWWVhcgAAAAA=')])", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdataset\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpq\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mParquetDataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mTARGET_PATH\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mget_ipython\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun_line_magic\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'pinfo'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'dataset.read'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/pyarrow/parquet.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, read_dictionary, memory_map, buffer_size)\u001b[0m\n\u001b[1;32m 1058\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1059\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mvalidate_schema\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1060\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalidate_schemas\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1061\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1062\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mequals\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mother\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/pyarrow/parquet.py\u001b[0m in \u001b[0;36mvalidate_schemas\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 1111\u001b[0m \u001b[0;34m'{1!s}\\n\\nvs\\n\\n{2!s}'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1112\u001b[0m .format(piece, file_schema,\n\u001b[0;32m-> 1113\u001b[0;31m dataset_schema))\n\u001b[0m\u001b[1;32m 1114\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1115\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolumns\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muse_threads\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muse_pandas_metadata\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mValueError\u001b[0m: Schema in /User/mlrun/airlines/dataset/a3f9441653c14708ad69f63694749301.parquet was different. \nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nTailNum: double\nActualElapsedTime: double\nCRSElapsedTime: int64\nAirTime: double\nArrDelay: double\nDepDelay: double\nOrigin: string\nDest: string\nDistance: double\nTaxiIn: double\nTaxiOut: double\nCancelled: int64\nCancellationCode: double\nDiverted: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nIsArrDelayed: string\nIsDepDelayed: string\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'0, \"stop\": 10000, \"step\": 1}], \"column_indexes\": [{\"name\": n'\n b'ull, \"field_name\": null, \"pandas_type\": \"unicode\", \"numpy_ty'\n b'pe\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}], \"columns'\n b'\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas_type\": \"i'\n b'nt64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"M'\n b'onth\", \"field_name\": \"Month\", \"pandas_type\": \"int64\", \"numpy'\n b'_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayofMonth\", \"'\n b'field_name\": \"DayofMonth\", \"pandas_type\": \"int64\", \"numpy_ty'\n b'pe\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWeek\", \"fiel'\n b'd_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"numpy_type\": '\n b'\"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"field_name\"'\n b': \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"field_name\": '\n b'\"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\",'\n b' \"metadata\": null}, {\"name\": \"ArrTime\", \"field_name\": \"ArrTi'\n b'me\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"CRSArrT'\n b'ime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metada'\n b'ta\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"UniqueC'\n b'arrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"'\n b'metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": \"Fligh'\n b'tNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metad'\n b'ata\": null}, {\"name\": \"TailNum\", \"field_name\": \"TailNum\", \"p'\n b'andas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\":'\n b' null}, {\"name\": \"ActualElapsedTime\", \"field_name\": \"ActualE'\n b'lapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_nam'\n b'e\": \"CRSElapsedTime\", \"pandas_type\": \"int64\", \"numpy_type\": '\n b'\"int64\", \"metadata\": null}, {\"name\": \"AirTime\", \"field_name\"'\n b': \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}, {\"name\": \"ArrDelay\", \"field_name\": \"A'\n b'rrDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\",'\n b' \"metadata\": null}, {\"name\": \"DepDelay\", \"field_name\": \"DepD'\n b'elay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"m'\n b'etadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Origin\", '\n b'\"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\"'\n b': null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pandas_type'\n b'\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {\"n'\n b'ame\": \"Distance\", \"field_name\": \"Distance\", \"pandas_type\": \"'\n b'float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name'\n b'\": \"TaxiIn\", \"field_name\": \"TaxiIn\", \"pandas_type\": \"float64'\n b'\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Tax'\n b'iOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"float64\", \"n'\n b'umpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Cancelle'\n b'd\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int64\", \"nump'\n b'y_type\": \"int64\", \"metadata\": null}, {\"name\": \"CancellationC'\n b'ode\", \"field_name\": \"CancellationCode\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'Diverted\", \"field_name\": \"Diverted\", \"pandas_type\": \"int64\",'\n b' \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carrier'\n b'Delay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"float6'\n b'4\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"We'\n b'atherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_type\"'\n b': \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"n'\n b'ame\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDelay\"'\n b', \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metada'\n b'ta\": null}, {\"name\": \"IsArrDelayed\", \"field_name\": \"IsArrDel'\n b'ayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"me'\n b'tadata\": null}, {\"name\": \"IsDepDelayed\", \"field_name\": \"IsDe'\n b'pDelayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\",'\n b' \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"vers'\n b'ion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////6AWAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAANwPAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAKQPAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDEwMDAw'\n b'LCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51'\n b'bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNv'\n b'ZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVu'\n b'Y29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogIlll'\n b'YXIiLCAiZmllbGRfbmFtZSI6ICJZZWFyIiwgInBhbmRhc190eXBlIjogImlu'\n b'dDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxs'\n b'fSwgeyJuYW1lIjogIk1vbnRoIiwgImZpZWxkX25hbWUiOiAiTW9udGgiLCAi'\n b'cGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIs'\n b'ICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiRGF5b2ZNb250aCIsICJm'\n b'aWVsZF9uYW1lIjogIkRheW9mTW9udGgiLCAicGFuZGFzX3R5cGUiOiAiaW50'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9'\n b'LCB7Im5hbWUiOiAiRGF5T2ZXZWVrIiwgImZpZWxkX25hbWUiOiAiRGF5T2ZX'\n b'ZWVrIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAi'\n b'aW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRlcFRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJEZXBUaW1lIiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJDUlNEZXBUaW1lIiwgImZpZWxkX25hbWUiOiAi'\n b'Q1JTRGVwVGltZSIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90'\n b'eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJB'\n b'cnJUaW1lIiwgImZpZWxkX25hbWUiOiAiQXJyVGltZSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTQXJyVGltZSIsICJmaWVsZF9u'\n b'YW1lIjogIkNSU0FyclRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiVW5pcXVlQ2FycmllciIsICJmaWVsZF9uYW1lIjogIlVuaXF1ZUNh'\n b'cnJpZXIiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBl'\n b'IjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiRmxp'\n b'Z2h0TnVtIiwgImZpZWxkX25hbWUiOiAiRmxpZ2h0TnVtIiwgInBhbmRhc190'\n b'eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIlRhaWxOdW0iLCAiZmllbGRfbmFtZSI6'\n b'ICJUYWlsTnVtIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJBY3R1YWxFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkFjdHVhbEVs'\n b'YXBzZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJDUlNFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBzZWRU'\n b'aW1lIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAi'\n b'aW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkFpclRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJBaXJUaW1lIiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJBcnJEZWxheSIsICJmaWVsZF9uYW1lIjogIkFy'\n b'ckRlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJE'\n b'ZXBEZWxheSIsICJmaWVsZF9uYW1lIjogIkRlcERlbGF5IiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJPcmlnaW4iLCAiZmllbGRfbmFt'\n b'ZSI6ICJPcmlnaW4iLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1w'\n b'eV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiRGVzdCIsICJmaWVsZF9uYW1lIjogIkRlc3QiLCAicGFuZGFzX3R5cGUi'\n b'OiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0'\n b'YSI6IG51bGx9LCB7Im5hbWUiOiAiRGlzdGFuY2UiLCAiZmllbGRfbmFtZSI6'\n b'ICJEaXN0YW5jZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiVGF4aUluIiwgImZpZWxkX25hbWUiOiAiVGF4aUluIiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxkX25h'\n b'bWUiOiAiVGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiQ2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVkIiwg'\n b'InBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQi'\n b'LCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhbmNlbGxhdGlvbkNv'\n b'ZGUiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsYXRpb25Db2RlIiwgInBhbmRh'\n b'c190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXZlcnRlZCIsICJmaWVs'\n b'ZF9uYW1lIjogIkRpdmVydGVkIiwgInBhbmRhc190eXBlIjogImludDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJu'\n b'YW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVsZF9uYW1lIjogIkNhcnJpZXJE'\n b'ZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUi'\n b'OiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiV2Vh'\n b'dGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAiV2VhdGhlckRlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJOQVNEZWxheSIsICJm'\n b'aWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVs'\n b'bH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5IiwgImZpZWxkX25hbWUiOiAi'\n b'U2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJMYXRl'\n b'QWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiSXNBcnJEZWxheWVkIiwgImZpZWxkX25hbWUiOiAiSXNBcnJEZWxh'\n b'eWVkIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6'\n b'ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIklzRGVw'\n b'RGVsYXllZCIsICJmaWVsZF9uYW1lIjogIklzRGVwRGVsYXllZCIsICJwYW5k'\n b'YXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJw'\n b'eWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVyc2lv'\n b'biI6ICIwLjI1LjMifQAAAAAfAAAAWAYAABQGAADcBQAApAUAAGwFAAA0BQAA'\n b'BAUAAMwEAACUBAAAXAQAACwEAADwAwAAtAMAAIQDAABQAwAAHAMAAPACAADE'\n b'AgAAkAIAAGACAAAwAgAA+AEAALwBAACEAQAATAEAABQBAADgAAAAqAAAAGwA'\n b'AAA4AAAABAAAADT6//8AAAEFFAAAAAwAAAAEAAAAAAAAAMj7//8MAAAASXNE'\n b'ZXBEZWxheWVkAAAAAGT6//8AAAEFFAAAAAwAAAAEAAAAAAAAAPj7//8MAAAA'\n b'SXNBcnJEZWxheWVkAAAAAJT6//8AAAEDGAAAAAwAAAAEAAAAAAAAAF77//8A'\n b'AAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAAzPr//wAAAQMYAAAADAAAAAQA'\n b'AAAAAAAAlvv//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAAD7//8AAAEDGAAA'\n b'AAwAAAAEAAAAAAAAAMr7//8AAAIACAAAAE5BU0RlbGF5AAAAADD7//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAAPr7//8AAAIADAAAAFdlYXRoZXJEZWxheQAAAABk'\n b'+///AAABAxgAAAAMAAAABAAAAAAAAAAu/P//AAACAAwAAABDYXJyaWVyRGVs'\n b'YXkAAAAAmPv//wAAAQIcAAAADAAAAAQAAAAAAAAAiPv//wAAAAFAAAAACAAA'\n b'AERpdmVydGVkAAAAAMz7//8AAAEDGAAAAAwAAAAEAAAAAAAAAJb8//8AAAIA'\n b'EAAAAENhbmNlbGxhdGlvbkNvZGUAAAAABPz//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAA9Pv//wAAAAFAAAAACQAAAENhbmNlbGxlZAAAADj8//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAAL9//8AAAIABwAAAFRheGlPdXQAZPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAALv3//wAAAgAGAAAAVGF4aUluAACQ/P//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAABa/f//AAACAAgAAABEaXN0YW5jZQAAAADA/P//AAABBRQAAAAM'\n b'AAAABAAAAAAAAABU/v//BAAAAERlc3QAAAAA6Pz//wAAAQUUAAAADAAAAAQA'\n b'AAAAAAAAfP7//wYAAABPcmlnaW4AABD9//8AAAEDGAAAAAwAAAAEAAAAAAAA'\n b'ANr9//8AAAIACAAAAERlcERlbGF5AAAAAED9//8AAAEDGAAAAAwAAAAEAAAA'\n b'AAAAAAr+//8AAAIACAAAAEFyckRlbGF5AAAAAHD9//8AAAEDGAAAAAwAAAAE'\n b'AAAAAAAAADr+//8AAAIABwAAAEFpclRpbWUAnP3//wAAAQIcAAAADAAAAAQA'\n b'AAAAAAAAjP3//wAAAAFAAAAADgAAAENSU0VsYXBzZWRUaW1lAADU/f//AAAB'\n b'AxgAAAAMAAAABAAAAAAAAACe/v//AAACABEAAABBY3R1YWxFbGFwc2VkVGlt'\n b'ZQAAAAz+//8AAAEDGAAAAAwAAAAEAAAAAAAAANb+//8AAAIABwAAAFRhaWxO'\n b'dW0AOP7//wAAAQIcAAAADAAAAAQAAAAAAAAAKP7//wAAAAFAAAAACQAAAEZs'\n b'aWdodE51bQAAAGz+//8AAAEFGAAAABAAAAAEAAAAAAAAAAQABAAEAAAADQAA'\n b'AFVuaXF1ZUNhcnJpZXIAAACg/v//AAABAhwAAAAMAAAABAAAAAAAAACQ/v//'\n b'AAAAAUAAAAAKAAAAQ1JTQXJyVGltZQAA1P7//wAAAQMYAAAADAAAAAQAAAAA'\n b'AAAAnv///wAAAgAHAAAAQXJyVGltZQAA////AAABAhwAAAAMAAAABAAAAAAA'\n b'AADw/v//AAAAAUAAAAAKAAAAQ1JTRGVwVGltZQAANP///wAAAQMgAAAAFAAA'\n b'AAQAAAAAAAAAAAAGAAgABgAGAAAAAAACAAcAAABEZXBUaW1lAGj///8AAAEC'\n b'HAAAAAwAAAAEAAAAAAAAAFj///8AAAABQAAAAAkAAABEYXlPZldlZWsAAACc'\n b'////AAABAhwAAAAMAAAABAAAAAAAAACM////AAAAAUAAAAAKAAAARGF5b2ZN'\n b'b250aAAA0P///wAAAQIcAAAADAAAAAQAAAAAAAAAwP///wAAAAFAAAAABQAA'\n b'AE1vbnRoAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAECJAAAABQAAAAEAAAA'\n b'AAAAAAgADAAIAAcACAAAAAAAAAFAAAAABAAAAFllYXIAAAAAAAAAAA==')])\n\nvs\n\nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nTailNum: string\nActualElapsedTime: double\nCRSElapsedTime: double\nAirTime: double\nArrDelay: double\nDepDelay: double\nOrigin: string\nDest: string\nDistance: int64\nTaxiIn: double\nTaxiOut: double\nCancelled: int64\nCancellationCode: double\nDiverted: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nIsArrDelayed: string\nIsDepDelayed: string\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'10000, \"stop\": 20000, \"step\": 1}], \"column_indexes\": [{\"name'\n b'\": null, \"field_name\": null, \"pandas_type\": \"unicode\", \"nump'\n b'y_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}], \"col'\n b'umns\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas_type\"'\n b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\"'\n b': \"Month\", \"field_name\": \"Month\", \"pandas_type\": \"int64\", \"n'\n b'umpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayofMonth'\n b'\", \"field_name\": \"DayofMonth\", \"pandas_type\": \"int64\", \"nump'\n b'y_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWeek\", \"'\n b'field_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"numpy_typ'\n b'e\": \"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"field_n'\n b'ame\": \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"field_nam'\n b'e\": \"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int'\n b'64\", \"metadata\": null}, {\"name\": \"ArrTime\", \"field_name\": \"A'\n b'rrTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", '\n b'\"metadata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"CRS'\n b'ArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"me'\n b'tadata\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"Uni'\n b'queCarrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object'\n b'\", \"metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": \"F'\n b'lightNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"m'\n b'etadata\": null}, {\"name\": \"TailNum\", \"field_name\": \"TailNum\"'\n b', \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadat'\n b'a\": null}, {\"name\": \"ActualElapsedTime\", \"field_name\": \"Actu'\n b'alElapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"flo'\n b'at64\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_'\n b'name\": \"CRSElapsedTime\", \"pandas_type\": \"float64\", \"numpy_ty'\n b'pe\": \"float64\", \"metadata\": null}, {\"name\": \"AirTime\", \"fiel'\n b'd_name\": \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": '\n b'\"float64\", \"metadata\": null}, {\"name\": \"ArrDelay\", \"field_na'\n b'me\": \"ArrDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"DepDelay\", \"field_name\"'\n b': \"DepDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float'\n b'64\", \"metadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Or'\n b'igin\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"me'\n b'tadata\": null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pand'\n b'as_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": nul'\n b'l}, {\"name\": \"Distance\", \"field_name\": \"Distance\", \"pandas_t'\n b'ype\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"n'\n b'ame\": \"TaxiIn\", \"field_name\": \"TaxiIn\", \"pandas_type\": \"floa'\n b't64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"'\n b'TaxiOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"float64\",'\n b' \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"Cance'\n b'lled\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int64\", \"n'\n b'umpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Cancellati'\n b'onCode\", \"field_name\": \"CancellationCode\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"Diverted\", \"field_name\": \"Diverted\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carr'\n b'ierDelay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"flo'\n b'at64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": '\n b'\"WeatherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\":'\n b' \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"na'\n b'me\": \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_ty'\n b'pe\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, '\n b'{\"name\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDel'\n b'ay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}, {\"name\": \"IsArrDelayed\", \"field_name\": \"IsArr'\n b'Delayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", '\n b'\"metadata\": null}, {\"name\": \"IsDepDelayed\", \"field_name\": \"I'\n b'sDepDelayed\", \"pandas_type\": \"unicode\", \"numpy_type\": \"objec'\n b't\", \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"v'\n b'ersion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////5gWAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAANwPAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAKcPAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAxMDAwMCwgInN0b3AiOiAy'\n b'MDAwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7Im5hbWUi'\n b'OiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlwZSI6ICJ1'\n b'bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjog'\n b'eyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3sibmFtZSI6'\n b'ICJZZWFyIiwgImZpZWxkX25hbWUiOiAiWWVhciIsICJwYW5kYXNfdHlwZSI6'\n b'ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJNb250aCIsICJmaWVsZF9uYW1lIjogIk1vbnRo'\n b'IiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRheW9mTW9udGgi'\n b'LCAiZmllbGRfbmFtZSI6ICJEYXlvZk1vbnRoIiwgInBhbmRhc190eXBlIjog'\n b'ImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBu'\n b'dWxsfSwgeyJuYW1lIjogIkRheU9mV2VlayIsICJmaWVsZF9uYW1lIjogIkRh'\n b'eU9mV2VlayIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBl'\n b'IjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEZXBU'\n b'aW1lIiwgImZpZWxkX25hbWUiOiAiRGVwVGltZSIsICJwYW5kYXNfdHlwZSI6'\n b'ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0'\n b'YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTRGVwVGltZSIsICJmaWVsZF9uYW1l'\n b'IjogIkNSU0RlcFRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiQXJyVGltZSIsICJmaWVsZF9uYW1lIjogIkFyclRpbWUiLCAicGFuZGFz'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNSU0FyclRpbWUiLCAiZmll'\n b'bGRfbmFtZSI6ICJDUlNBcnJUaW1lIiwgInBhbmRhc190eXBlIjogImludDY0'\n b'IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIlVuaXF1ZUNhcnJpZXIiLCAiZmllbGRfbmFtZSI6ICJVbmlx'\n b'dWVDYXJyaWVyIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlf'\n b'dHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjog'\n b'IkZsaWdodE51bSIsICJmaWVsZF9uYW1lIjogIkZsaWdodE51bSIsICJwYW5k'\n b'YXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYWlsTnVtIiwgImZpZWxkX25h'\n b'bWUiOiAiVGFpbE51bSIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51'\n b'bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJBY3R1YWxFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkFjdHVh'\n b'bEVsYXBzZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJDUlNFbGFwc2VkVGltZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBz'\n b'ZWRUaW1lIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJB'\n b'aXJUaW1lIiwgImZpZWxkX25hbWUiOiAiQWlyVGltZSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQXJyRGVsYXkiLCAiZmllbGRfbmFt'\n b'ZSI6ICJBcnJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiRGVwRGVsYXkiLCAiZmllbGRfbmFtZSI6ICJEZXBEZWxheSIsICJw'\n b'YW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2'\n b'NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiT3JpZ2luIiwgImZp'\n b'ZWxkX25hbWUiOiAiT3JpZ2luIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUi'\n b'LCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIkRlc3QiLCAiZmllbGRfbmFtZSI6ICJEZXN0IiwgInBhbmRh'\n b'c190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRpc3RhbmNlIiwgImZpZWxk'\n b'X25hbWUiOiAiRGlzdGFuY2UiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiVGF4aUluIiwgImZpZWxkX25hbWUiOiAiVGF4aUluIiwgInBhbmRh'\n b'c190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxk'\n b'X25hbWUiOiAiVGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiQ2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVk'\n b'IiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhbmNlbGxhdGlv'\n b'bkNvZGUiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsYXRpb25Db2RlIiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXZlcnRlZCIsICJm'\n b'aWVsZF9uYW1lIjogIkRpdmVydGVkIiwgInBhbmRhc190eXBlIjogImludDY0'\n b'IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwg'\n b'eyJuYW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVsZF9uYW1lIjogIkNhcnJp'\n b'ZXJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5'\n b'cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAi'\n b'V2VhdGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAiV2VhdGhlckRlbGF5Iiwg'\n b'InBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9h'\n b'dDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJOQVNEZWxheSIs'\n b'ICJmaWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBhbmRhc190eXBlIjogImZs'\n b'b2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjog'\n b'bnVsbH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5IiwgImZpZWxkX25hbWUi'\n b'OiAiU2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJM'\n b'YXRlQWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0Iiwg'\n b'Im51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiSXNBcnJEZWxheWVkIiwgImZpZWxkX25hbWUiOiAiSXNBcnJE'\n b'ZWxheWVkIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlw'\n b'ZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIklz'\n b'RGVwRGVsYXllZCIsICJmaWVsZF9uYW1lIjogIklzRGVwRGVsYXllZCIsICJw'\n b'YW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6'\n b'ICJweWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVy'\n b'c2lvbiI6ICIwLjI1LjMifQAfAAAAVAYAABAGAADYBQAAoAUAAGgFAAAwBQAA'\n b'AAUAAMgEAACQBAAAWAQAACwEAADwAwAAuAMAAIgDAABUAwAAIAMAAPQCAADI'\n b'AgAAkAIAAGACAAAwAgAA+AEAALwBAACEAQAATAEAABQBAADgAAAAqAAAAGwA'\n b'AAA4AAAABAAAADj6//8AAAEFFAAAAAwAAAAEAAAAAAAAAMz7//8MAAAASXNE'\n b'ZXBEZWxheWVkAAAAAGj6//8AAAEFFAAAAAwAAAAEAAAAAAAAAPz7//8MAAAA'\n b'SXNBcnJEZWxheWVkAAAAAJj6//8AAAEDGAAAAAwAAAAEAAAAAAAAAGL7//8A'\n b'AAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAA0Pr//wAAAQMYAAAADAAAAAQA'\n b'AAAAAAAAmvv//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAAT7//8AAAEDGAAA'\n b'AAwAAAAEAAAAAAAAAM77//8AAAIACAAAAE5BU0RlbGF5AAAAADT7//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAAP77//8AAAIADAAAAFdlYXRoZXJEZWxheQAAAABo'\n b'+///AAABAxgAAAAMAAAABAAAAAAAAAAy/P//AAACAAwAAABDYXJyaWVyRGVs'\n b'YXkAAAAAnPv//wAAAQIcAAAADAAAAAQAAAAAAAAAjPv//wAAAAFAAAAACAAA'\n b'AERpdmVydGVkAAAAAND7//8AAAEDGAAAAAwAAAAEAAAAAAAAAJr8//8AAAIA'\n b'EAAAAENhbmNlbGxhdGlvbkNvZGUAAAAACPz//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAA+Pv//wAAAAFAAAAACQAAAENhbmNlbGxlZAAAADz8//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAAb9//8AAAIABwAAAFRheGlPdXQAaPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAAMv3//wAAAgAGAAAAVGF4aUluAACU/P//AAABAhwAAAAMAAAA'\n b'BAAAAAAAAACE/P//AAAAAUAAAAAIAAAARGlzdGFuY2UAAAAAyPz//wAAAQUU'\n b'AAAADAAAAAQAAAAAAAAAXP7//wQAAABEZXN0AAAAAPD8//8AAAEFFAAAAAwA'\n b'AAAEAAAAAAAAAIT+//8GAAAAT3JpZ2luAAAY/f//AAABAxgAAAAMAAAABAAA'\n b'AAAAAADi/f//AAACAAgAAABEZXBEZWxheQAAAABI/f//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAAAS/v//AAACAAgAAABBcnJEZWxheQAAAAB4/f//AAABAxgAAAAM'\n b'AAAABAAAAAAAAABC/v//AAACAAcAAABBaXJUaW1lAKT9//8AAAEDGAAAAAwA'\n b'AAAEAAAAAAAAAG7+//8AAAIADgAAAENSU0VsYXBzZWRUaW1lAADY/f//AAAB'\n b'AxgAAAAMAAAABAAAAAAAAACi/v//AAACABEAAABBY3R1YWxFbGFwc2VkVGlt'\n b'ZQAAABD+//8AAAEFFAAAAAwAAAAEAAAAAAAAAKT///8HAAAAVGFpbE51bQA4'\n b'/v//AAABAhwAAAAMAAAABAAAAAAAAAAo/v//AAAAAUAAAAAJAAAARmxpZ2h0'\n b'TnVtAAAAbP7//wAAAQUYAAAAEAAAAAQAAAAAAAAABAAEAAQAAAANAAAAVW5p'\n b'cXVlQ2FycmllcgAAAKD+//8AAAECHAAAAAwAAAAEAAAAAAAAAJD+//8AAAAB'\n b'QAAAAAoAAABDUlNBcnJUaW1lAADU/v//AAABAxgAAAAMAAAABAAAAAAAAACe'\n b'////AAACAAcAAABBcnJUaW1lAAD///8AAAECHAAAAAwAAAAEAAAAAAAAAPD+'\n b'//8AAAABQAAAAAoAAABDUlNEZXBUaW1lAAA0////AAABAyAAAAAUAAAABAAA'\n b'AAAAAAAAAAYACAAGAAYAAAAAAAIABwAAAERlcFRpbWUAaP///wAAAQIcAAAA'\n b'DAAAAAQAAAAAAAAAWP///wAAAAFAAAAACQAAAERheU9mV2VlawAAAJz///8A'\n b'AAECHAAAAAwAAAAEAAAAAAAAAIz///8AAAABQAAAAAoAAABEYXlvZk1vbnRo'\n b'AADQ////AAABAhwAAAAMAAAABAAAAAAAAADA////AAAAAUAAAAAFAAAATW9u'\n b'dGgAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIkAAAAFAAAAAQAAAAAAAAA'\n b'CAAMAAgABwAIAAAAAAAAAUAAAAAEAAAAWWVhcgAAAAA=')])" - ] - } - ], + "outputs": [], "source": [ - "dataset.read?" + "import shutil\n", + "shutil.rmtree(TARGET_PATH)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## cleanup" + "### single parquet file\n", + "\n", + "run this only when `dataset=False`" ] }, { @@ -518,7 +325,8 @@ "metadata": {}, "outputs": [], "source": [ - "# os.remove(parquet_file_path)" + "assert KEY in run.outputs.keys(), f\"mlrun.functions: key {KEY} not found in outputs\"\n", + "assert os.path.isfile(TARGET_PATH+'/'+ FILE_NAME), f\"mlrun.functions: artifact source not found at {TARGET_PATH+'/'+ FILE_NAME}\"" ] }, { @@ -526,7 +334,9 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "copied = pd.read_parquet(TARGET_PATH+'/'+ FILE_NAME, engine=\"pyarrow\")" + ] } ], "metadata": { From dba3bc120edc5711a9ee1ceaff9e557ced4d0aa1 Mon Sep 17 00:00:00 2001 From: yasha Date: Mon, 27 Jan 2020 19:57:58 +0000 Subject: [PATCH 26/32] parquet partitioning passes test --- fileutils/arc_to_parquet/arc_to_parquet.yaml | 4 +- tests/arc_to_parquet-airlines.ipynb | 525 +++++++++++++++++-- 2 files changed, 495 insertions(+), 34 deletions(-) diff --git a/fileutils/arc_to_parquet/arc_to_parquet.yaml b/fileutils/arc_to_parquet/arc_to_parquet.yaml index 194d37b87..c1b83a1d4 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.yaml +++ b/fileutils/arc_to_parquet/arc_to_parquet.yaml @@ -2,7 +2,7 @@ kind: job metadata: name: arc-to-parquet tag: '' - hash: eca47e4446d7c75e2096b8dd4803aa557aad7d6e + hash: 8a02024de5fc9c1e0876488700a604ca8551e991 project: '' spec: command: '' @@ -15,4 +15,4 @@ spec: functionSourceCode: IyBDb3B5cmlnaHQgMjAxOCBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgoKaW1wb3J0IHNzbAoKdHJ5OgogICAgX2NyZWF0ZV91bnZlcmlmaWVkX2h0dHBzX2NvbnRleHQgPSBzc2wuX2NyZWF0ZV91bnZlcmlmaWVkX2NvbnRleHQKZXhjZXB0IEF0dHJpYnV0ZUVycm9yOgogICAgIyBMZWdhY3kgUHl0aG9uIHRoYXQgZG9lc24ndCB2ZXJpZnkgSFRUUFMgY2VydGlmaWNhdGVzIGJ5IGRlZmF1bHQKICAgIHBhc3MKZWxzZToKICAgICMgSGFuZGxlIHRhcmdldCBlbnZpcm9ubWVudCB0aGF0IGRvZXNuJ3Qgc3VwcG9ydCBIVFRQUyB2ZXJpZmljYXRpb24KICAgIHNzbC5fY3JlYXRlX2RlZmF1bHRfaHR0cHNfY29udGV4dCA9IF9jcmVhdGVfdW52ZXJpZmllZF9odHRwc19jb250ZXh0CgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBudW1weSBhcyBucAppbXBvcnQgcGFuZGFzIGFzIHBkCmltcG9ydCBweWFycm93LnBhcnF1ZXQgYXMgcHEKaW1wb3J0IHB5YXJyb3cgYXMgcGEKZnJvbSBwaWNrbGUgaW1wb3J0IGR1bXAsIGxvYWQKCmZyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIHR5cGluZyBpbXBvcnQgSU8sIEFueVN0ciwgVW5pb24sIExpc3QsIE9wdGlvbmFsCgoKZGVmIGFyY190b19wYXJxdWV0KAogICAgY29udGV4dDogTUxDbGllbnRDdHgsCiAgICBhcmNoaXZlX3VybDogVW5pb25bc3RyLCBQYXRoLCBJT1tBbnlTdHJdXSwKICAgIGhlYWRlcjogT3B0aW9uYWxbTGlzdFtzdHJdXSA9IE5vbmUsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gIiIsCiAgICBuYW1lOiBzdHIgPSAiIiwKICAgIGNodW5rc2l6ZTogaW50ID0gMTBfMDAwLAogICAgZHR5cGU9Tm9uZSwKICAgIGVuY29kaW5nOiBzdHIgPSAnbGF0aW4tMScsCiAgICBsb2dfZGF0YTogYm9vbCA9IFRydWUsCiAgICBhZGRfdWlkOiBib29sID0gRmFsc2UsCiAgICBrZXk6IHN0ciA9ICJyYXdfZGF0YSIsCiAgICBkYXRhc2V0OiBib29sID0gRmFsc2UsCiAgICBwYXJ0aXRpb25fY29scyA9IFtdLAogICAgaW5jX2NvbHM6IE9wdGlvbmFsW0xpc3Rbc3RyXV0gPSBOb25lCikgLT4gTm9uZToKICAgICIiIk9wZW4gYSBmaWxlL29iamVjdCBhcmNoaXZlIGFuZCBzYXZlIGFzIGEgcGFycXVldCBmaWxlLgogICAgCiAgICBQYXJ0aXRpb25pbmcgcmVxdWlyZXMgcHJlY2lzZSBzcGVjaWZpY2F0aW9uIG9mIGNvbHVtbiB0eXBlcy4KICAgIAogICAgOnBhcmFtIGNvbnRleHQ6ICAgICBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gYXJjaGl2ZV91cmw6IGFueSB2YWxpZCBzdHJpbmcgcGF0aCBjb25zaXN0ZW50IHdpdGggdGhlIHBhdGggdmFyaWFibGUKICAgICAgICAgICAgICAgICAgICAgICAgb2YgcGFuZGFzLnJlYWRfY3N2LCBpbmNsdWRpbmcgc3RyaW5ncyBhcyBmaWxlIHBhdGhzLCBhcyB1cmxzLCAKICAgICAgICAgICAgICAgICAgICAgICAgcGF0aGxpYi5QYXRoIG9iamVjdHMsIGV0Yy4uLgogICAgOnBhcmFtIGhlYWRlcjogICAgICBjb2x1bW4gbmFtZXMKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogZGVzdGluYXRpb24gZm9sZGVyIG9mIHRhYmxlCiAgICA6cGFyYW0gbmFtZTogICAgICAgIG5hbWUgZmlsZSB0byBiZSBzYXZlZCBsb2NhbGx5LCBhbHNvCiAgICA6cGFyYW0gY2h1bmtzaXplOiAgICgwKSByb3cgc2l6ZSByZXRyaWV2ZWQgcGVyIGl0ZXJhdGlvbgogICAgOnBhcmFtIGluY19jb2xzOiAgICBpbmNsdWRlIG9ubHkgdGhlc2UgY29sdW1ucwogICAgOnBhcmFtIGR0eXBlICAgICAgICBkZXN0aW5hdGlvbiBkYXRhIHR5cGUgb2Ygc3BlY2lmaWVkIGNvbHVtbnMKICAgIDpwYXJhbSBrZXk6ICAgICAgICAga2V5IGluIGFydGlmYWN0IHN0b3JlICh3aGVuIGxvZ19kYXRhPVRydWUpCiAgICA6cGFyYW0gZGF0YXNldDogICAgIChGYWxzZSkgaWYgVHJ1ZSB0aGVuIHRhcmdldF9wYXRoIGlzIGZvbGRlciBmb3IKICAgICAgICAgICAgICAgICAgICAgICAgcGFydGl0aW9uZWQgZmlsZXMKICAgIDpwYXJhbSBwYXJ0X2NvbHM6ICAgKFtdKSBsaXN0IG9mIHBhcnRpdGlvbmluZyBjb2x1bW5zCiAgICAiIiIKICAgIGlmIG5vdCBuYW1lLmVuZHN3aXRoKCIucHF0Iik6CiAgICAgICAgbmFtZSArPSAiLnBxdCIKICAgIAogICAgZGVzdF9wYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgb3MubWFrZWRpcnMob3MucGF0aC5qb2luKHRhcmdldF9wYXRoKSwgZXhpc3Rfb2s9VHJ1ZSkKICAgIGlmIG5vdCBvcy5wYXRoLmlzZmlsZShkZXN0X3BhdGgpOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oImRlc3RpbmF0aW9uIGZpbGUgZG9lcyBub3QgZXhpc3QsIGRvd25sb2FkaW5nIikKICAgICAgICBwcXdyaXRlciA9IE5vbmUKICAgICAgICBmb3IgaSwgZGYgaW4gZW51bWVyYXRlKHBkLnJlYWRfY3N2KGFyY2hpdmVfdXJsLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGNodW5rc2l6ZT1jaHVua3NpemUsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbmFtZXM9aGVhZGVyLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGVuY29kaW5nPWVuY29kaW5nLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHVzZWNvbHM9aW5jX2NvbHMsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZHR5cGU9ZHR5cGUpKToKICAgICAgICAgICAgdGFibGUgPSBwYS5UYWJsZS5mcm9tX3BhbmRhcyhkZikKICAgICAgICAgICAgaWYgaSA9PSAwOgogICAgICAgICAgICAgICAgcHF3cml0ZXIgPSBwcS5QYXJxdWV0V3JpdGVyKGRlc3RfcGF0aCwgdGFibGUuc2NoZW1hKQogICAgICAgICAgICBpZiBkYXRhc2V0OgogICAgICAgICAgICAgICAgcHEud3JpdGVfdG9fZGF0YXNldCh0YWJsZSwgcm9vdF9wYXRoPXRhcmdldF9wYXRoLCBwYXJ0aXRpb25fY29scz1wYXJ0aXRpb25fY29scykKICAgICAgICAgICAgZWxzZToKICAgICAgICAgICAgICAgIHBxd3JpdGVyLndyaXRlX3RhYmxlKHRhYmxlKQogICAgICAgICAgICAKICAgICAgICBpZiBwcXdyaXRlcjoKICAgICAgICAgICAgcHF3cml0ZXIuY2xvc2UoKQoKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKGYic2F2ZWQgdGFibGUgdG8ge2Rlc3RfcGF0aH0iKQogICAgZWxzZToKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCJkZXN0aW5hdGlvbiBmaWxlIGFscmVhZHkgZXhpc3RzIikKCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWRlc3RfcGF0aCkKICAgIAogICAgIyBsb2cgaGVhZGVyCiAgICBmaWxlcGF0aCA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgJ2hlYWRlci5wa2wnKQogICAgZHVtcChoZWFkZXIsIG9wZW4oZmlsZXBhdGgsICd3YicpKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ2hlYWRlcicsIHRhcmdldF9wYXRoPWZpbGVwYXRoKQo= base_image: yjbds/mlrun-files:latest commands: [] - code_origin: https://github.com/yjb-ds/functions.git#e67bbbba46a2b5767f5ed561da882d90a979e670:/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py + code_origin: https://github.com/yjb-ds/functions.git#5a80835a812cf0cd8e106721aac3dde4989cbead:/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py diff --git a/tests/arc_to_parquet-airlines.ipynb b/tests/arc_to_parquet-airlines.ipynb index c6da8182b..cb463cde1 100644 --- a/tests/arc_to_parquet-airlines.ipynb +++ b/tests/arc_to_parquet-airlines.ipynb @@ -41,13 +41,13 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "BASE_IMAGE = 'yjbds/mlrun-files:latest'\n", "\n", - "CODE_BASE = '/User/repos/functions/'\n", + "CODE_BASE = '/User/repos/functions/' # 'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/'\n", "PROJECT = 'fileutils/arc_to_parquet'\n", "\n", "TARGET_PATH = '/User/mlrun/airlines/dataset'\n", @@ -75,7 +75,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -90,24 +90,26 @@ " 'LateAircraftDelay']\n", "INC_COLS = ['Year','Month','DayofMonth','DayOfWeek','DepTime','CRSDepTime','ArrTime','CRSArrTime',\n", " 'UniqueCarrier','FlightNum', 'CRSElapsedTime','AirTime',\n", - " 'Origin','Dest','Distance','TaxiOut','Cancelled',\n", + " 'Origin','Dest','Distance', 'TaxiIn', 'TaxiOut','Cancelled',\n", " 'CarrierDelay','WeatherDelay','NASDelay','SecurityDelay',\n", " 'LateAircraftDelay']\n", "\n", "ENCODING = 'latin-1'\n", "\n", "DTYPES_COLS = {\n", - " 'CRSElapsedTime': 'float64', \n", + " 'CRSElapsedTime': 'float32', \n", " 'TailNum': 'str', \n", - " 'Distance': 'float64', \n", - " 'TaxiOut': 'float64',\n", - " 'ArrTime': 'float64',\n", - " 'DepTime':'float64', \n", - " 'CarrierDelay': 'float64', \n", - " 'WeatherDelay': 'float64', \n", - " 'NASDelay':'float64', \n", - " 'SecurityDelay':'float64', \n", - " 'LateAircraftDelay':'float64'}\n", + " 'Distance': 'float32',\n", + " 'TaxiIn' : 'float32',\n", + " 'TaxiOut': 'float32',\n", + " 'ArrTime': 'float32',\n", + " 'AirTime': 'float32',\n", + " 'DepTime':'float32', \n", + " 'CarrierDelay': 'float32', \n", + " 'WeatherDelay': 'float32', \n", + " 'NASDelay':'float32', \n", + " 'SecurityDelay':'float32', \n", + " 'LateAircraftDelay':'float32'}\n", "\n", "USE_PARTITIONS = True\n", "PARTITION_COLS = ['Year', 'Month']" @@ -140,7 +142,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-27 19:05:25,830 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" + "[mlrun] 2020-01-27 19:37:40,654 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" ] } ], @@ -208,15 +210,232 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-27 19:21:20,254 starting run arc2parq uid=d3a5446edb94436d91efe4b2d7c64c2b -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-27 19:21:20,379 Job is running in the background, pod: arc2parq-5qs84\n" + "[mlrun] 2020-01-27 19:37:40,720 starting run arc2parq uid=647251d1ef46416bb2a1dc9a76310e54 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-27 19:37:40,821 Job is running in the background, pod: arc2parq-mgrvp\n", + "[mlrun] 2020-01-27 19:37:45,590 destination file does not exist, downloading\n", + "[mlrun] 2020-01-27 19:50:05,061 saved table to /User/mlrun/airlines/dataset/airlines.pqt\n", + "[mlrun] 2020-01-27 19:50:05,076 log artifact airlines at /User/mlrun/airlines/dataset/airlines.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-27 19:50:05,095 log artifact header at /User/mlrun/airlines/dataset/header.pkl, size: None, db: Y\n", + "\n", + "[mlrun] 2020-01-27 19:50:05,114 run executed, status=completed\n", + "final state: succeeded\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...310e54
0Jan 27 19:37:45completedarc-to-parquet
host=arc2parq-mgrvp
kind=job
owner=admin
archive_url=https://s3.amazonaws.com/h2o-airlines-unpacked/allyears.csv
dataset=True
dtype={'AirTime': 'float32', 'ArrTime': 'float32', 'CRSElapsedTime': 'float32', 'CarrierDelay': 'float32', 'DepTime': 'float32', 'Distance': 'float32', 'LateAircraftDelay': 'float32', 'NASDelay': 'float32', 'SecurityDelay': 'float32', 'TailNum': 'str', 'TaxiOut': 'float32', 'WeatherDelay': 'float32'}
encoding=latin-1
inc_cols=['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'CRSElapsedTime', 'AirTime', 'Origin', 'Dest', 'Distance', 'TaxiOut', 'Cancelled', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
key=airlines
name=airlines.pqt
part_cols=['Year', 'Month']
target_path=/User/mlrun/airlines/dataset
airlines
header
\n", + "
\n", + "
\n", + "
\n", + " Title\n", + " ×\n", + "
\n", + " \n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "to track results use .show() or .logs() or in CLI: \n", + "!mlrun get run 647251d1ef46416bb2a1dc9a76310e54 , !mlrun logs 647251d1ef46416bb2a1dc9a76310e54 \n", + "[mlrun] 2020-01-27 19:50:13,248 run executed, status=completed\n", + "CPU times: user 400 ms, sys: 45.2 ms, total: 445 ms\n", + "Wall time: 12min 32s\n" ] } ], @@ -261,36 +480,278 @@ "### a partitioned parquet table" ] }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "dataset = pq.ParquetDataset(TARGET_PATH)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "df = dataset.read().to_pandas()" + ] + }, { "cell_type": "code", "execution_count": 16, "metadata": {}, + "outputs": [], + "source": [ + "df.set_index(['Year', 'Month'], inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, "outputs": [ { - "ename": "ValueError", - "evalue": "Schema in /User/mlrun/airlines/dataset/1278d2c85afc40cabc8e5add8d12892e.parquet was different. \nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: int64\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nCRSElapsedTime: double\nAirTime: int64\nOrigin: string\nDest: string\nDistance: double\nTaxiOut: double\nCancelled: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'95550000, \"stop\": 95560000, \"step\": 1}], \"column_indexes\": ['\n b'{\"name\": null, \"field_name\": null, \"pandas_type\": \"unicode\",'\n b' \"numpy_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}]'\n b', \"columns\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas'\n b'_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {'\n b'\"name\": \"Month\", \"field_name\": \"Month\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Dayo'\n b'fMonth\", \"field_name\": \"DayofMonth\", \"pandas_type\": \"int64\",'\n b' \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWe'\n b'ek\", \"field_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"num'\n b'py_type\": \"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"f'\n b'ield_name\": \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type'\n b'\": \"float64\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"fie'\n b'ld_name\": \"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\"'\n b': \"int64\", \"metadata\": null}, {\"name\": \"ArrTime\", \"field_nam'\n b'e\": \"ArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\"'\n b', \"metadata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"C'\n b'RSArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"'\n b'metadata\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"U'\n b'niqueCarrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"obje'\n b'ct\", \"metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": '\n b'\"FlightNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", '\n b'\"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_name\": '\n b'\"CRSElapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"f'\n b'loat64\", \"metadata\": null}, {\"name\": \"AirTime\", \"field_name\"'\n b': \"AirTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", '\n b'\"metadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Origin\"'\n b', \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadat'\n b'a\": null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pandas_ty'\n b'pe\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {'\n b'\"name\": \"Distance\", \"field_name\": \"Distance\", \"pandas_type\":'\n b' \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"na'\n b'me\": \"TaxiOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"flo'\n b'at64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": '\n b'\"Cancelled\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carr'\n b'ierDelay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"flo'\n b'at64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": '\n b'\"WeatherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\":'\n b' \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"na'\n b'me\": \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_ty'\n b'pe\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, '\n b'{\"name\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDel'\n b'ay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}], \"creator\": {\"library\": \"pyarrow\", \"version\":'\n b' \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////4AQAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAAJwLAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAGcLAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiA5NTU1MDAwMCwgInN0b3Ai'\n b'OiA5NTU2MDAwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7'\n b'Im5hbWUiOiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlw'\n b'ZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFk'\n b'YXRhIjogeyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3si'\n b'bmFtZSI6ICJZZWFyIiwgImZpZWxkX25hbWUiOiAiWWVhciIsICJwYW5kYXNf'\n b'dHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFk'\n b'YXRhIjogbnVsbH0sIHsibmFtZSI6ICJNb250aCIsICJmaWVsZF9uYW1lIjog'\n b'Ik1vbnRoIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUi'\n b'OiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRheW9m'\n b'TW9udGgiLCAiZmllbGRfbmFtZSI6ICJEYXlvZk1vbnRoIiwgInBhbmRhc190'\n b'eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIkRheU9mV2VlayIsICJmaWVsZF9uYW1l'\n b'IjogIkRheU9mV2VlayIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1w'\n b'eV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJEZXBUaW1lIiwgImZpZWxkX25hbWUiOiAiRGVwVGltZSIsICJwYW5kYXNf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJt'\n b'ZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTRGVwVGltZSIsICJmaWVs'\n b'ZF9uYW1lIjogIkNSU0RlcFRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQi'\n b'LCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiQXJyVGltZSIsICJmaWVsZF9uYW1lIjogIkFyclRpbWUiLCAi'\n b'cGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIs'\n b'ICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTQXJyVGltZSIsICJm'\n b'aWVsZF9uYW1lIjogIkNSU0FyclRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9'\n b'LCB7Im5hbWUiOiAiVW5pcXVlQ2FycmllciIsICJmaWVsZF9uYW1lIjogIlVu'\n b'aXF1ZUNhcnJpZXIiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1w'\n b'eV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiRmxpZ2h0TnVtIiwgImZpZWxkX25hbWUiOiAiRmxpZ2h0TnVtIiwgInBh'\n b'bmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNSU0VsYXBzZWRUaW1lIiwg'\n b'ImZpZWxkX25hbWUiOiAiQ1JTRWxhcHNlZFRpbWUiLCAicGFuZGFzX3R5cGUi'\n b'OiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIkFpclRpbWUiLCAiZmllbGRfbmFtZSI6'\n b'ICJBaXJUaW1lIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5'\n b'cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIk9y'\n b'aWdpbiIsICJmaWVsZF9uYW1lIjogIk9yaWdpbiIsICJwYW5kYXNfdHlwZSI6'\n b'ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRh'\n b'IjogbnVsbH0sIHsibmFtZSI6ICJEZXN0IiwgImZpZWxkX25hbWUiOiAiRGVz'\n b'dCIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAi'\n b'b2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXN0YW5j'\n b'ZSIsICJmaWVsZF9uYW1lIjogIkRpc3RhbmNlIiwgInBhbmRhc190eXBlIjog'\n b'ImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRh'\n b'IjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxkX25hbWUiOiAi'\n b'VGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5'\n b'cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAi'\n b'Q2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVkIiwgInBhbmRh'\n b'c190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0'\n b'YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVs'\n b'ZF9uYW1lIjogIkNhcnJpZXJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9h'\n b'dDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51'\n b'bGx9LCB7Im5hbWUiOiAiV2VhdGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAi'\n b'V2VhdGhlckRlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJOQVNEZWxheSIsICJmaWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5'\n b'IiwgImZpZWxkX25hbWUiOiAiU2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAi'\n b'ZmllbGRfbmFtZSI6ICJMYXRlQWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9XSwgImNyZWF0b3IiOiB7ImxpYnJhcnkiOiAicHlhcnJv'\n b'dyIsICJ2ZXJzaW9uIjogIjAuMTUuMSJ9LCAicGFuZGFzX3ZlcnNpb24iOiAi'\n b'MC4yNS4zIn0AFgAAAHwEAAA4BAAAAAQAAMgDAACQAwAAWAMAACQDAADsAgAA'\n b'tAIAAHwCAABEAgAAEAIAAOQBAAC4AQAAhAEAAFQBAAAcAQAA5AAAAKwAAAB4'\n b'AAAAQAAAAAQAAADs+///AAABAxgAAAAMAAAABAAAAAAAAAC2/P//AAACABEA'\n b'AABMYXRlQWlyY3JhZnREZWxheQAAACT8//8AAAEDGAAAAAwAAAAEAAAAAAAA'\n b'AO78//8AAAIADQAAAFNlY3VyaXR5RGVsYXkAAABY/P//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAAAi/f//AAACAAgAAABOQVNEZWxheQAAAACI/P//AAABAxgAAAAM'\n b'AAAABAAAAAAAAABS/f//AAACAAwAAABXZWF0aGVyRGVsYXkAAAAAvPz//wAA'\n b'AQMYAAAADAAAAAQAAAAAAAAAhv3//wAAAgAMAAAAQ2FycmllckRlbGF5AAAA'\n b'APD8//8AAAECHAAAAAwAAAAEAAAAAAAAAOD8//8AAAABQAAAAAkAAABDYW5j'\n b'ZWxsZWQAAAAk/f//AAABAxgAAAAMAAAABAAAAAAAAADu/f//AAACAAcAAABU'\n b'YXhpT3V0AFD9//8AAAEDGAAAAAwAAAAEAAAAAAAAABr+//8AAAIACAAAAERp'\n b'c3RhbmNlAAAAAID9//8AAAEFFAAAAAwAAAAEAAAAAAAAABj///8EAAAARGVz'\n b'dAAAAACo/f//AAABBRQAAAAMAAAABAAAAAAAAABA////BgAAAE9yaWdpbgAA'\n b'0P3//wAAAQIcAAAADAAAAAQAAAAAAAAAwP3//wAAAAFAAAAABwAAAEFpclRp'\n b'bWUAAP7//wAAAQMYAAAADAAAAAQAAAAAAAAAyv7//wAAAgAOAAAAQ1JTRWxh'\n b'cHNlZFRpbWUAADT+//8AAAECHAAAAAwAAAAEAAAAAAAAACT+//8AAAABQAAA'\n b'AAkAAABGbGlnaHROdW0AAABo/v//AAABBRgAAAAQAAAABAAAAAAAAAAEAAQA'\n b'BAAAAA0AAABVbmlxdWVDYXJyaWVyAAAAnP7//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAAjP7//wAAAAFAAAAACgAAAENSU0FyclRpbWUAAND+//8AAAECHAAAAAwA'\n b'AAAEAAAAAAAAAMD+//8AAAABQAAAAAcAAABBcnJUaW1lAAD///8AAAECHAAA'\n b'AAwAAAAEAAAAAAAAAPD+//8AAAABQAAAAAoAAABDUlNEZXBUaW1lAAA0////'\n b'AAABAyAAAAAUAAAABAAAAAAAAAAAAAYACAAGAAYAAAAAAAIABwAAAERlcFRp'\n b'bWUAaP///wAAAQIcAAAADAAAAAQAAAAAAAAAWP///wAAAAFAAAAACQAAAERh'\n b'eU9mV2VlawAAAJz///8AAAECHAAAAAwAAAAEAAAAAAAAAIz///8AAAABQAAA'\n b'AAoAAABEYXlvZk1vbnRoAADQ////AAABAhwAAAAMAAAABAAAAAAAAADA////'\n b'AAAAAUAAAAAFAAAATW9udGgAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIk'\n b'AAAAFAAAAAQAAAAAAAAACAAMAAgABwAIAAAAAAAAAUAAAAAEAAAAWWVhcgAA'\n b'AAA=')])\n\nvs\n\nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nCRSElapsedTime: double\nAirTime: double\nOrigin: string\nDest: string\nDistance: double\nTaxiOut: double\nCancelled: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'90730000, \"stop\": 90740000, \"step\": 1}], \"column_indexes\": ['\n b'{\"name\": null, \"field_name\": null, \"pandas_type\": \"unicode\",'\n b' \"numpy_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}]'\n b', \"columns\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas'\n b'_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {'\n b'\"name\": \"Month\", \"field_name\": \"Month\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Dayo'\n b'fMonth\", \"field_name\": \"DayofMonth\", \"pandas_type\": \"int64\",'\n b' \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWe'\n b'ek\", \"field_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"num'\n b'py_type\": \"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"f'\n b'ield_name\": \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type'\n b'\": \"float64\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"fie'\n b'ld_name\": \"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\"'\n b': \"int64\", \"metadata\": null}, {\"name\": \"ArrTime\", \"field_nam'\n b'e\": \"ArrTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"floa'\n b't64\", \"metadata\": null}, {\"name\": \"CRSArrTime\", \"field_name\"'\n b': \"CRSArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64'\n b'\", \"metadata\": null}, {\"name\": \"UniqueCarrier\", \"field_name\"'\n b': \"UniqueCarrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"'\n b'object\", \"metadata\": null}, {\"name\": \"FlightNum\", \"field_nam'\n b'e\": \"FlightNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_nam'\n b'e\": \"CRSElapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\"'\n b': \"float64\", \"metadata\": null}, {\"name\": \"AirTime\", \"field_n'\n b'ame\": \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"Origin\", \"field_name\": '\n b'\"Origin\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", '\n b'\"metadata\": null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"p'\n b'andas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": '\n b'null}, {\"name\": \"Distance\", \"field_name\": \"Distance\", \"panda'\n b's_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": nul'\n b'l}, {\"name\": \"TaxiOut\", \"field_name\": \"TaxiOut\", \"pandas_typ'\n b'e\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {'\n b'\"name\": \"Cancelled\", \"field_name\": \"Cancelled\", \"pandas_type'\n b'\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name'\n b'\": \"CarrierDelay\", \"field_name\": \"CarrierDelay\", \"pandas_typ'\n b'e\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {'\n b'\"name\": \"WeatherDelay\", \"field_name\": \"WeatherDelay\", \"panda'\n b's_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": nul'\n b'l}, {\"name\": \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_t'\n b'ype\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null},'\n b' {\"name\": \"SecurityDelay\", \"field_name\": \"SecurityDelay\", \"p'\n b'andas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\":'\n b' null}, {\"name\": \"LateAircraftDelay\", \"field_name\": \"LateAir'\n b'craftDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"v'\n b'ersion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////4AQAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAAKQLAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAG8LAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiA5MDczMDAwMCwgInN0b3Ai'\n b'OiA5MDc0MDAwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7'\n b'Im5hbWUiOiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlw'\n b'ZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFk'\n b'YXRhIjogeyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3si'\n b'bmFtZSI6ICJZZWFyIiwgImZpZWxkX25hbWUiOiAiWWVhciIsICJwYW5kYXNf'\n b'dHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFk'\n b'YXRhIjogbnVsbH0sIHsibmFtZSI6ICJNb250aCIsICJmaWVsZF9uYW1lIjog'\n b'Ik1vbnRoIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUi'\n b'OiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRheW9m'\n b'TW9udGgiLCAiZmllbGRfbmFtZSI6ICJEYXlvZk1vbnRoIiwgInBhbmRhc190'\n b'eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIkRheU9mV2VlayIsICJmaWVsZF9uYW1l'\n b'IjogIkRheU9mV2VlayIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1w'\n b'eV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJEZXBUaW1lIiwgImZpZWxkX25hbWUiOiAiRGVwVGltZSIsICJwYW5kYXNf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJt'\n b'ZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTRGVwVGltZSIsICJmaWVs'\n b'ZF9uYW1lIjogIkNSU0RlcFRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQi'\n b'LCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiQXJyVGltZSIsICJmaWVsZF9uYW1lIjogIkFyclRpbWUiLCAi'\n b'cGFuZGFzX3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNSU0FyclRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJDUlNBcnJUaW1lIiwgInBhbmRhc190eXBlIjog'\n b'ImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBu'\n b'dWxsfSwgeyJuYW1lIjogIlVuaXF1ZUNhcnJpZXIiLCAiZmllbGRfbmFtZSI6'\n b'ICJVbmlxdWVDYXJyaWVyIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAi'\n b'bnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJu'\n b'YW1lIjogIkZsaWdodE51bSIsICJmaWVsZF9uYW1lIjogIkZsaWdodE51bSIs'\n b'ICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJDUlNFbGFwc2VkVGlt'\n b'ZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBzZWRUaW1lIiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJBaXJUaW1lIiwgImZpZWxkX25h'\n b'bWUiOiAiQWlyVGltZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiT3JpZ2luIiwgImZpZWxkX25hbWUiOiAiT3JpZ2luIiwgInBhbmRh'\n b'c190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRlc3QiLCAiZmllbGRfbmFt'\n b'ZSI6ICJEZXN0IiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlf'\n b'dHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjog'\n b'IkRpc3RhbmNlIiwgImZpZWxkX25hbWUiOiAiRGlzdGFuY2UiLCAicGFuZGFz'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIlRheGlPdXQiLCAiZmllbGRf'\n b'bmFtZSI6ICJUYXhpT3V0IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsi'\n b'bmFtZSI6ICJDYW5jZWxsZWQiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsZWQi'\n b'LCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2'\n b'NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ2FycmllckRlbGF5'\n b'IiwgImZpZWxkX25hbWUiOiAiQ2FycmllckRlbGF5IiwgInBhbmRhc190eXBl'\n b'IjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFk'\n b'YXRhIjogbnVsbH0sIHsibmFtZSI6ICJXZWF0aGVyRGVsYXkiLCAiZmllbGRf'\n b'bmFtZSI6ICJXZWF0aGVyRGVsYXkiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQ2'\n b'NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAibWV0YWRhdGEiOiBudWxs'\n b'fSwgeyJuYW1lIjogIk5BU0RlbGF5IiwgImZpZWxkX25hbWUiOiAiTkFTRGVs'\n b'YXkiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjog'\n b'ImZsb2F0NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIlNlY3Vy'\n b'aXR5RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJTZWN1cml0eURlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJMYXRlQWlyY3JhZnRE'\n b'ZWxheSIsICJmaWVsZF9uYW1lIjogIkxhdGVBaXJjcmFmdERlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6'\n b'ICJweWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVy'\n b'c2lvbiI6ICIwLjI1LjMifQAWAAAAdAQAADAEAAD4AwAAwAMAAIgDAABQAwAA'\n b'IAMAAOgCAACwAgAAeAIAAEACAAAQAgAA5AEAALgBAACEAQAAVAEAABwBAADk'\n b'AAAArAAAAHgAAABAAAAABAAAAPT7//8AAAEDGAAAAAwAAAAEAAAAAAAAAL78'\n b'//8AAAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAALPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAA9vz//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAGD8//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAACr9//8AAAIACAAAAE5BU0RlbGF5AAAAAJD8//8A'\n b'AAEDGAAAAAwAAAAEAAAAAAAAAFr9//8AAAIADAAAAFdlYXRoZXJEZWxheQAA'\n b'AADE/P//AAABAxgAAAAMAAAABAAAAAAAAACO/f//AAACAAwAAABDYXJyaWVy'\n b'RGVsYXkAAAAA+Pz//wAAAQIcAAAADAAAAAQAAAAAAAAA6Pz//wAAAAFAAAAA'\n b'CQAAAENhbmNlbGxlZAAAACz9//8AAAEDGAAAAAwAAAAEAAAAAAAAAPb9//8A'\n b'AAIABwAAAFRheGlPdXQAWP3//wAAAQMYAAAADAAAAAQAAAAAAAAAIv7//wAA'\n b'AgAIAAAARGlzdGFuY2UAAAAAiP3//wAAAQUUAAAADAAAAAQAAAAAAAAAHP//'\n b'/wQAAABEZXN0AAAAALD9//8AAAEFFAAAAAwAAAAEAAAAAAAAAET///8GAAAA'\n b'T3JpZ2luAADY/f//AAABAxgAAAAMAAAABAAAAAAAAACi/v//AAACAAcAAABB'\n b'aXJUaW1lAAT+//8AAAEDGAAAAAwAAAAEAAAAAAAAAM7+//8AAAIADgAAAENS'\n b'U0VsYXBzZWRUaW1lAAA4/v//AAABAhwAAAAMAAAABAAAAAAAAAAo/v//AAAA'\n b'AUAAAAAJAAAARmxpZ2h0TnVtAAAAbP7//wAAAQUYAAAAEAAAAAQAAAAAAAAA'\n b'BAAEAAQAAAANAAAAVW5pcXVlQ2FycmllcgAAAKD+//8AAAECHAAAAAwAAAAE'\n b'AAAAAAAAAJD+//8AAAABQAAAAAoAAABDUlNBcnJUaW1lAADU/v//AAABAxgA'\n b'AAAMAAAABAAAAAAAAACe////AAACAAcAAABBcnJUaW1lAAD///8AAAECHAAA'\n b'AAwAAAAEAAAAAAAAAPD+//8AAAABQAAAAAoAAABDUlNEZXBUaW1lAAA0////'\n b'AAABAyAAAAAUAAAABAAAAAAAAAAAAAYACAAGAAYAAAAAAAIABwAAAERlcFRp'\n b'bWUAaP///wAAAQIcAAAADAAAAAQAAAAAAAAAWP///wAAAAFAAAAACQAAAERh'\n b'eU9mV2VlawAAAJz///8AAAECHAAAAAwAAAAEAAAAAAAAAIz///8AAAABQAAA'\n b'AAoAAABEYXlvZk1vbnRoAADQ////AAABAhwAAAAMAAAABAAAAAAAAADA////'\n b'AAAAAUAAAAAFAAAATW9udGgAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIk'\n b'AAAAFAAAAAQAAAAAAAAACAAMAAgABwAIAAAAAAAAAUAAAAAEAAAAWWVhcgAA'\n b'AAA=')])", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdataset\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpq\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mParquetDataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mTARGET_PATH\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/pyarrow/parquet.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, read_dictionary, memory_map, buffer_size)\u001b[0m\n\u001b[1;32m 1058\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1059\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mvalidate_schema\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1060\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalidate_schemas\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1061\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1062\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mequals\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mother\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/pyarrow/parquet.py\u001b[0m in \u001b[0;36mvalidate_schemas\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 1111\u001b[0m \u001b[0;34m'{1!s}\\n\\nvs\\n\\n{2!s}'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1112\u001b[0m .format(piece, file_schema,\n\u001b[0;32m-> 1113\u001b[0;31m dataset_schema))\n\u001b[0m\u001b[1;32m 1114\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1115\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolumns\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muse_threads\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muse_pandas_metadata\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mValueError\u001b[0m: Schema in /User/mlrun/airlines/dataset/1278d2c85afc40cabc8e5add8d12892e.parquet was different. \nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: int64\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nCRSElapsedTime: double\nAirTime: int64\nOrigin: string\nDest: string\nDistance: double\nTaxiOut: double\nCancelled: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'95550000, \"stop\": 95560000, \"step\": 1}], \"column_indexes\": ['\n b'{\"name\": null, \"field_name\": null, \"pandas_type\": \"unicode\",'\n b' \"numpy_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}]'\n b', \"columns\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas'\n b'_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {'\n b'\"name\": \"Month\", \"field_name\": \"Month\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Dayo'\n b'fMonth\", \"field_name\": \"DayofMonth\", \"pandas_type\": \"int64\",'\n b' \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWe'\n b'ek\", \"field_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"num'\n b'py_type\": \"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"f'\n b'ield_name\": \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type'\n b'\": \"float64\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"fie'\n b'ld_name\": \"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\"'\n b': \"int64\", \"metadata\": null}, {\"name\": \"ArrTime\", \"field_nam'\n b'e\": \"ArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\"'\n b', \"metadata\": null}, {\"name\": \"CRSArrTime\", \"field_name\": \"C'\n b'RSArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"'\n b'metadata\": null}, {\"name\": \"UniqueCarrier\", \"field_name\": \"U'\n b'niqueCarrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"obje'\n b'ct\", \"metadata\": null}, {\"name\": \"FlightNum\", \"field_name\": '\n b'\"FlightNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", '\n b'\"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_name\": '\n b'\"CRSElapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"f'\n b'loat64\", \"metadata\": null}, {\"name\": \"AirTime\", \"field_name\"'\n b': \"AirTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", '\n b'\"metadata\": null}, {\"name\": \"Origin\", \"field_name\": \"Origin\"'\n b', \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadat'\n b'a\": null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"pandas_ty'\n b'pe\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {'\n b'\"name\": \"Distance\", \"field_name\": \"Distance\", \"pandas_type\":'\n b' \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"na'\n b'me\": \"TaxiOut\", \"field_name\": \"TaxiOut\", \"pandas_type\": \"flo'\n b'at64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": '\n b'\"Cancelled\", \"field_name\": \"Cancelled\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Carr'\n b'ierDelay\", \"field_name\": \"CarrierDelay\", \"pandas_type\": \"flo'\n b'at64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": '\n b'\"WeatherDelay\", \"field_name\": \"WeatherDelay\", \"pandas_type\":'\n b' \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"na'\n b'me\": \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_type\": \"f'\n b'loat64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\"'\n b': \"SecurityDelay\", \"field_name\": \"SecurityDelay\", \"pandas_ty'\n b'pe\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, '\n b'{\"name\": \"LateAircraftDelay\", \"field_name\": \"LateAircraftDel'\n b'ay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"met'\n b'adata\": null}], \"creator\": {\"library\": \"pyarrow\", \"version\":'\n b' \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////4AQAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAAJwLAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAGcLAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiA5NTU1MDAwMCwgInN0b3Ai'\n b'OiA5NTU2MDAwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7'\n b'Im5hbWUiOiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlw'\n b'ZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFk'\n b'YXRhIjogeyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3si'\n b'bmFtZSI6ICJZZWFyIiwgImZpZWxkX25hbWUiOiAiWWVhciIsICJwYW5kYXNf'\n b'dHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFk'\n b'YXRhIjogbnVsbH0sIHsibmFtZSI6ICJNb250aCIsICJmaWVsZF9uYW1lIjog'\n b'Ik1vbnRoIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUi'\n b'OiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRheW9m'\n b'TW9udGgiLCAiZmllbGRfbmFtZSI6ICJEYXlvZk1vbnRoIiwgInBhbmRhc190'\n b'eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIkRheU9mV2VlayIsICJmaWVsZF9uYW1l'\n b'IjogIkRheU9mV2VlayIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1w'\n b'eV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJEZXBUaW1lIiwgImZpZWxkX25hbWUiOiAiRGVwVGltZSIsICJwYW5kYXNf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJt'\n b'ZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTRGVwVGltZSIsICJmaWVs'\n b'ZF9uYW1lIjogIkNSU0RlcFRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQi'\n b'LCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiQXJyVGltZSIsICJmaWVsZF9uYW1lIjogIkFyclRpbWUiLCAi'\n b'cGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIs'\n b'ICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTQXJyVGltZSIsICJm'\n b'aWVsZF9uYW1lIjogIkNSU0FyclRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50'\n b'NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9'\n b'LCB7Im5hbWUiOiAiVW5pcXVlQ2FycmllciIsICJmaWVsZF9uYW1lIjogIlVu'\n b'aXF1ZUNhcnJpZXIiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1w'\n b'eV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUi'\n b'OiAiRmxpZ2h0TnVtIiwgImZpZWxkX25hbWUiOiAiRmxpZ2h0TnVtIiwgInBh'\n b'bmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNSU0VsYXBzZWRUaW1lIiwg'\n b'ImZpZWxkX25hbWUiOiAiQ1JTRWxhcHNlZFRpbWUiLCAicGFuZGFzX3R5cGUi'\n b'OiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIkFpclRpbWUiLCAiZmllbGRfbmFtZSI6'\n b'ICJBaXJUaW1lIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5'\n b'cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIk9y'\n b'aWdpbiIsICJmaWVsZF9uYW1lIjogIk9yaWdpbiIsICJwYW5kYXNfdHlwZSI6'\n b'ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRh'\n b'IjogbnVsbH0sIHsibmFtZSI6ICJEZXN0IiwgImZpZWxkX25hbWUiOiAiRGVz'\n b'dCIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAi'\n b'b2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJEaXN0YW5j'\n b'ZSIsICJmaWVsZF9uYW1lIjogIkRpc3RhbmNlIiwgInBhbmRhc190eXBlIjog'\n b'ImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRh'\n b'IjogbnVsbH0sIHsibmFtZSI6ICJUYXhpT3V0IiwgImZpZWxkX25hbWUiOiAi'\n b'VGF4aU91dCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5'\n b'cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAi'\n b'Q2FuY2VsbGVkIiwgImZpZWxkX25hbWUiOiAiQ2FuY2VsbGVkIiwgInBhbmRh'\n b'c190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0'\n b'YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNhcnJpZXJEZWxheSIsICJmaWVs'\n b'ZF9uYW1lIjogIkNhcnJpZXJEZWxheSIsICJwYW5kYXNfdHlwZSI6ICJmbG9h'\n b'dDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51'\n b'bGx9LCB7Im5hbWUiOiAiV2VhdGhlckRlbGF5IiwgImZpZWxkX25hbWUiOiAi'\n b'V2VhdGhlckRlbGF5IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVt'\n b'cHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFt'\n b'ZSI6ICJOQVNEZWxheSIsICJmaWVsZF9uYW1lIjogIk5BU0RlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJTZWN1cml0eURlbGF5'\n b'IiwgImZpZWxkX25hbWUiOiAiU2VjdXJpdHlEZWxheSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiTGF0ZUFpcmNyYWZ0RGVsYXkiLCAi'\n b'ZmllbGRfbmFtZSI6ICJMYXRlQWlyY3JhZnREZWxheSIsICJwYW5kYXNfdHlw'\n b'ZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRh'\n b'ZGF0YSI6IG51bGx9XSwgImNyZWF0b3IiOiB7ImxpYnJhcnkiOiAicHlhcnJv'\n b'dyIsICJ2ZXJzaW9uIjogIjAuMTUuMSJ9LCAicGFuZGFzX3ZlcnNpb24iOiAi'\n b'MC4yNS4zIn0AFgAAAHwEAAA4BAAAAAQAAMgDAACQAwAAWAMAACQDAADsAgAA'\n b'tAIAAHwCAABEAgAAEAIAAOQBAAC4AQAAhAEAAFQBAAAcAQAA5AAAAKwAAAB4'\n b'AAAAQAAAAAQAAADs+///AAABAxgAAAAMAAAABAAAAAAAAAC2/P//AAACABEA'\n b'AABMYXRlQWlyY3JhZnREZWxheQAAACT8//8AAAEDGAAAAAwAAAAEAAAAAAAA'\n b'AO78//8AAAIADQAAAFNlY3VyaXR5RGVsYXkAAABY/P//AAABAxgAAAAMAAAA'\n b'BAAAAAAAAAAi/f//AAACAAgAAABOQVNEZWxheQAAAACI/P//AAABAxgAAAAM'\n b'AAAABAAAAAAAAABS/f//AAACAAwAAABXZWF0aGVyRGVsYXkAAAAAvPz//wAA'\n b'AQMYAAAADAAAAAQAAAAAAAAAhv3//wAAAgAMAAAAQ2FycmllckRlbGF5AAAA'\n b'APD8//8AAAECHAAAAAwAAAAEAAAAAAAAAOD8//8AAAABQAAAAAkAAABDYW5j'\n b'ZWxsZWQAAAAk/f//AAABAxgAAAAMAAAABAAAAAAAAADu/f//AAACAAcAAABU'\n b'YXhpT3V0AFD9//8AAAEDGAAAAAwAAAAEAAAAAAAAABr+//8AAAIACAAAAERp'\n b'c3RhbmNlAAAAAID9//8AAAEFFAAAAAwAAAAEAAAAAAAAABj///8EAAAARGVz'\n b'dAAAAACo/f//AAABBRQAAAAMAAAABAAAAAAAAABA////BgAAAE9yaWdpbgAA'\n b'0P3//wAAAQIcAAAADAAAAAQAAAAAAAAAwP3//wAAAAFAAAAABwAAAEFpclRp'\n b'bWUAAP7//wAAAQMYAAAADAAAAAQAAAAAAAAAyv7//wAAAgAOAAAAQ1JTRWxh'\n b'cHNlZFRpbWUAADT+//8AAAECHAAAAAwAAAAEAAAAAAAAACT+//8AAAABQAAA'\n b'AAkAAABGbGlnaHROdW0AAABo/v//AAABBRgAAAAQAAAABAAAAAAAAAAEAAQA'\n b'BAAAAA0AAABVbmlxdWVDYXJyaWVyAAAAnP7//wAAAQIcAAAADAAAAAQAAAAA'\n b'AAAAjP7//wAAAAFAAAAACgAAAENSU0FyclRpbWUAAND+//8AAAECHAAAAAwA'\n b'AAAEAAAAAAAAAMD+//8AAAABQAAAAAcAAABBcnJUaW1lAAD///8AAAECHAAA'\n b'AAwAAAAEAAAAAAAAAPD+//8AAAABQAAAAAoAAABDUlNEZXBUaW1lAAA0////'\n b'AAABAyAAAAAUAAAABAAAAAAAAAAAAAYACAAGAAYAAAAAAAIABwAAAERlcFRp'\n b'bWUAaP///wAAAQIcAAAADAAAAAQAAAAAAAAAWP///wAAAAFAAAAACQAAAERh'\n b'eU9mV2VlawAAAJz///8AAAECHAAAAAwAAAAEAAAAAAAAAIz///8AAAABQAAA'\n b'AAoAAABEYXlvZk1vbnRoAADQ////AAABAhwAAAAMAAAABAAAAAAAAADA////'\n b'AAAAAUAAAAAFAAAATW9udGgAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIk'\n b'AAAAFAAAAAQAAAAAAAAACAAMAAgABwAIAAAAAAAAAUAAAAAEAAAAWWVhcgAA'\n b'AAA=')])\n\nvs\n\nYear: int64\nMonth: int64\nDayofMonth: int64\nDayOfWeek: int64\nDepTime: double\nCRSDepTime: int64\nArrTime: double\nCRSArrTime: int64\nUniqueCarrier: string\nFlightNum: int64\nCRSElapsedTime: double\nAirTime: double\nOrigin: string\nDest: string\nDistance: double\nTaxiOut: double\nCancelled: int64\nCarrierDelay: double\nWeatherDelay: double\nNASDelay: double\nSecurityDelay: double\nLateAircraftDelay: double\nmetadata\n--------\nOrderedDict([(b'pandas',\n b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": '\n b'90730000, \"stop\": 90740000, \"step\": 1}], \"column_indexes\": ['\n b'{\"name\": null, \"field_name\": null, \"pandas_type\": \"unicode\",'\n b' \"numpy_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}]'\n b', \"columns\": [{\"name\": \"Year\", \"field_name\": \"Year\", \"pandas'\n b'_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {'\n b'\"name\": \"Month\", \"field_name\": \"Month\", \"pandas_type\": \"int6'\n b'4\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"Dayo'\n b'fMonth\", \"field_name\": \"DayofMonth\", \"pandas_type\": \"int64\",'\n b' \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"DayOfWe'\n b'ek\", \"field_name\": \"DayOfWeek\", \"pandas_type\": \"int64\", \"num'\n b'py_type\": \"int64\", \"metadata\": null}, {\"name\": \"DepTime\", \"f'\n b'ield_name\": \"DepTime\", \"pandas_type\": \"float64\", \"numpy_type'\n b'\": \"float64\", \"metadata\": null}, {\"name\": \"CRSDepTime\", \"fie'\n b'ld_name\": \"CRSDepTime\", \"pandas_type\": \"int64\", \"numpy_type\"'\n b': \"int64\", \"metadata\": null}, {\"name\": \"ArrTime\", \"field_nam'\n b'e\": \"ArrTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"floa'\n b't64\", \"metadata\": null}, {\"name\": \"CRSArrTime\", \"field_name\"'\n b': \"CRSArrTime\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64'\n b'\", \"metadata\": null}, {\"name\": \"UniqueCarrier\", \"field_name\"'\n b': \"UniqueCarrier\", \"pandas_type\": \"unicode\", \"numpy_type\": \"'\n b'object\", \"metadata\": null}, {\"name\": \"FlightNum\", \"field_nam'\n b'e\": \"FlightNum\", \"pandas_type\": \"int64\", \"numpy_type\": \"int6'\n b'4\", \"metadata\": null}, {\"name\": \"CRSElapsedTime\", \"field_nam'\n b'e\": \"CRSElapsedTime\", \"pandas_type\": \"float64\", \"numpy_type\"'\n b': \"float64\", \"metadata\": null}, {\"name\": \"AirTime\", \"field_n'\n b'ame\": \"AirTime\", \"pandas_type\": \"float64\", \"numpy_type\": \"fl'\n b'oat64\", \"metadata\": null}, {\"name\": \"Origin\", \"field_name\": '\n b'\"Origin\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", '\n b'\"metadata\": null}, {\"name\": \"Dest\", \"field_name\": \"Dest\", \"p'\n b'andas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": '\n b'null}, {\"name\": \"Distance\", \"field_name\": \"Distance\", \"panda'\n b's_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": nul'\n b'l}, {\"name\": \"TaxiOut\", \"field_name\": \"TaxiOut\", \"pandas_typ'\n b'e\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {'\n b'\"name\": \"Cancelled\", \"field_name\": \"Cancelled\", \"pandas_type'\n b'\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name'\n b'\": \"CarrierDelay\", \"field_name\": \"CarrierDelay\", \"pandas_typ'\n b'e\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {'\n b'\"name\": \"WeatherDelay\", \"field_name\": \"WeatherDelay\", \"panda'\n b's_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": nul'\n b'l}, {\"name\": \"NASDelay\", \"field_name\": \"NASDelay\", \"pandas_t'\n b'ype\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null},'\n b' {\"name\": \"SecurityDelay\", \"field_name\": \"SecurityDelay\", \"p'\n b'andas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\":'\n b' null}, {\"name\": \"LateAircraftDelay\", \"field_name\": \"LateAir'\n b'craftDelay\", \"pandas_type\": \"float64\", \"numpy_type\": \"float6'\n b'4\", \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"v'\n b'ersion\": \"0.15.1\"}, \"pandas_version\": \"0.25.3\"}'),\n (b'ARROW:schema',\n b'/////4AQAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAE'\n b'AAgACgAAAKQLAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYA'\n b'AABwYW5kYXMAAG8LAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJh'\n b'bmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiA5MDczMDAwMCwgInN0b3Ai'\n b'OiA5MDc0MDAwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7'\n b'Im5hbWUiOiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlw'\n b'ZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFk'\n b'YXRhIjogeyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3si'\n b'bmFtZSI6ICJZZWFyIiwgImZpZWxkX25hbWUiOiAiWWVhciIsICJwYW5kYXNf'\n b'dHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFk'\n b'YXRhIjogbnVsbH0sIHsibmFtZSI6ICJNb250aCIsICJmaWVsZF9uYW1lIjog'\n b'Ik1vbnRoIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUi'\n b'OiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRheW9m'\n b'TW9udGgiLCAiZmllbGRfbmFtZSI6ICJEYXlvZk1vbnRoIiwgInBhbmRhc190'\n b'eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRh'\n b'dGEiOiBudWxsfSwgeyJuYW1lIjogIkRheU9mV2VlayIsICJmaWVsZF9uYW1l'\n b'IjogIkRheU9mV2VlayIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1w'\n b'eV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6'\n b'ICJEZXBUaW1lIiwgImZpZWxkX25hbWUiOiAiRGVwVGltZSIsICJwYW5kYXNf'\n b'dHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJt'\n b'ZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ1JTRGVwVGltZSIsICJmaWVs'\n b'ZF9uYW1lIjogIkNSU0RlcFRpbWUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQi'\n b'LCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7'\n b'Im5hbWUiOiAiQXJyVGltZSIsICJmaWVsZF9uYW1lIjogIkFyclRpbWUiLCAi'\n b'cGFuZGFzX3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0'\n b'NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkNSU0FyclRpbWUi'\n b'LCAiZmllbGRfbmFtZSI6ICJDUlNBcnJUaW1lIiwgInBhbmRhc190eXBlIjog'\n b'ImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBu'\n b'dWxsfSwgeyJuYW1lIjogIlVuaXF1ZUNhcnJpZXIiLCAiZmllbGRfbmFtZSI6'\n b'ICJVbmlxdWVDYXJyaWVyIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAi'\n b'bnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJu'\n b'YW1lIjogIkZsaWdodE51bSIsICJmaWVsZF9uYW1lIjogIkZsaWdodE51bSIs'\n b'ICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJDUlNFbGFwc2VkVGlt'\n b'ZSIsICJmaWVsZF9uYW1lIjogIkNSU0VsYXBzZWRUaW1lIiwgInBhbmRhc190'\n b'eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1l'\n b'dGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJBaXJUaW1lIiwgImZpZWxkX25h'\n b'bWUiOiAiQWlyVGltZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51'\n b'bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5h'\n b'bWUiOiAiT3JpZ2luIiwgImZpZWxkX25hbWUiOiAiT3JpZ2luIiwgInBhbmRh'\n b'c190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRlc3QiLCAiZmllbGRfbmFt'\n b'ZSI6ICJEZXN0IiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlf'\n b'dHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjog'\n b'IkRpc3RhbmNlIiwgImZpZWxkX25hbWUiOiAiRGlzdGFuY2UiLCAicGFuZGFz'\n b'X3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAi'\n b'bWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIlRheGlPdXQiLCAiZmllbGRf'\n b'bmFtZSI6ICJUYXhpT3V0IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAi'\n b'bnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsi'\n b'bmFtZSI6ICJDYW5jZWxsZWQiLCAiZmllbGRfbmFtZSI6ICJDYW5jZWxsZWQi'\n b'LCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2'\n b'NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiQ2FycmllckRlbGF5'\n b'IiwgImZpZWxkX25hbWUiOiAiQ2FycmllckRlbGF5IiwgInBhbmRhc190eXBl'\n b'IjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFk'\n b'YXRhIjogbnVsbH0sIHsibmFtZSI6ICJXZWF0aGVyRGVsYXkiLCAiZmllbGRf'\n b'bmFtZSI6ICJXZWF0aGVyRGVsYXkiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQ2'\n b'NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAibWV0YWRhdGEiOiBudWxs'\n b'fSwgeyJuYW1lIjogIk5BU0RlbGF5IiwgImZpZWxkX25hbWUiOiAiTkFTRGVs'\n b'YXkiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjog'\n b'ImZsb2F0NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIlNlY3Vy'\n b'aXR5RGVsYXkiLCAiZmllbGRfbmFtZSI6ICJTZWN1cml0eURlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJMYXRlQWlyY3JhZnRE'\n b'ZWxheSIsICJmaWVsZF9uYW1lIjogIkxhdGVBaXJjcmFmdERlbGF5IiwgInBh'\n b'bmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0'\n b'IiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6'\n b'ICJweWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVy'\n b'c2lvbiI6ICIwLjI1LjMifQAWAAAAdAQAADAEAAD4AwAAwAMAAIgDAABQAwAA'\n b'IAMAAOgCAACwAgAAeAIAAEACAAAQAgAA5AEAALgBAACEAQAAVAEAABwBAADk'\n b'AAAArAAAAHgAAABAAAAABAAAAPT7//8AAAEDGAAAAAwAAAAEAAAAAAAAAL78'\n b'//8AAAIAEQAAAExhdGVBaXJjcmFmdERlbGF5AAAALPz//wAAAQMYAAAADAAA'\n b'AAQAAAAAAAAA9vz//wAAAgANAAAAU2VjdXJpdHlEZWxheQAAAGD8//8AAAED'\n b'GAAAAAwAAAAEAAAAAAAAACr9//8AAAIACAAAAE5BU0RlbGF5AAAAAJD8//8A'\n b'AAEDGAAAAAwAAAAEAAAAAAAAAFr9//8AAAIADAAAAFdlYXRoZXJEZWxheQAA'\n b'AADE/P//AAABAxgAAAAMAAAABAAAAAAAAACO/f//AAACAAwAAABDYXJyaWVy'\n b'RGVsYXkAAAAA+Pz//wAAAQIcAAAADAAAAAQAAAAAAAAA6Pz//wAAAAFAAAAA'\n b'CQAAAENhbmNlbGxlZAAAACz9//8AAAEDGAAAAAwAAAAEAAAAAAAAAPb9//8A'\n b'AAIABwAAAFRheGlPdXQAWP3//wAAAQMYAAAADAAAAAQAAAAAAAAAIv7//wAA'\n b'AgAIAAAARGlzdGFuY2UAAAAAiP3//wAAAQUUAAAADAAAAAQAAAAAAAAAHP//'\n b'/wQAAABEZXN0AAAAALD9//8AAAEFFAAAAAwAAAAEAAAAAAAAAET///8GAAAA'\n b'T3JpZ2luAADY/f//AAABAxgAAAAMAAAABAAAAAAAAACi/v//AAACAAcAAABB'\n b'aXJUaW1lAAT+//8AAAEDGAAAAAwAAAAEAAAAAAAAAM7+//8AAAIADgAAAENS'\n b'U0VsYXBzZWRUaW1lAAA4/v//AAABAhwAAAAMAAAABAAAAAAAAAAo/v//AAAA'\n b'AUAAAAAJAAAARmxpZ2h0TnVtAAAAbP7//wAAAQUYAAAAEAAAAAQAAAAAAAAA'\n b'BAAEAAQAAAANAAAAVW5pcXVlQ2FycmllcgAAAKD+//8AAAECHAAAAAwAAAAE'\n b'AAAAAAAAAJD+//8AAAABQAAAAAoAAABDUlNBcnJUaW1lAADU/v//AAABAxgA'\n b'AAAMAAAABAAAAAAAAACe////AAACAAcAAABBcnJUaW1lAAD///8AAAECHAAA'\n b'AAwAAAAEAAAAAAAAAPD+//8AAAABQAAAAAoAAABDUlNEZXBUaW1lAAA0////'\n b'AAABAyAAAAAUAAAABAAAAAAAAAAAAAYACAAGAAYAAAAAAAIABwAAAERlcFRp'\n b'bWUAaP///wAAAQIcAAAADAAAAAQAAAAAAAAAWP///wAAAAFAAAAACQAAAERh'\n b'eU9mV2VlawAAAJz///8AAAECHAAAAAwAAAAEAAAAAAAAAIz///8AAAABQAAA'\n b'AAoAAABEYXlvZk1vbnRoAADQ////AAABAhwAAAAMAAAABAAAAAAAAADA////'\n b'AAAAAUAAAAAFAAAATW9udGgAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIk'\n b'AAAAFAAAAAQAAAAAAAAACAAMAAgABwAIAAAAAAAAAUAAAAAEAAAAWWVhcgAA'\n b'AAA=')])" - ] + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
DayofMonthDayOfWeekDepTimeCRSDepTimeArrTimeCRSArrTimeUniqueCarrierFlightNumCRSElapsedTimeAirTimeOriginDestDistanceTaxiOutCancelledCarrierDelayWeatherDelayNASDelaySecurityDelayLateAircraftDelay
YearMonth
200792151951.018152058.01901EV431846.025.0CSGATL83.05.000.00.021.00.096.0
92371826.018151915.01901EV431846.022.0CSGATL83.08.000.00.00.00.00.0
92411827.018151906.01901EV431846.019.0CSGATL83.07.000.00.00.00.00.0
92521840.018151915.01901EV431846.022.0CSGATL83.03.000.00.00.00.00.0
92631815.018151847.01901EV431846.017.0CSGATL83.05.000.00.00.00.00.0
\n", + "
" + ], + "text/plain": [ + " DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime \\\n", + "Year Month \n", + "2007 9 21 5 1951.0 1815 2058.0 1901 \n", + " 9 23 7 1826.0 1815 1915.0 1901 \n", + " 9 24 1 1827.0 1815 1906.0 1901 \n", + " 9 25 2 1840.0 1815 1915.0 1901 \n", + " 9 26 3 1815.0 1815 1847.0 1901 \n", + "\n", + " UniqueCarrier FlightNum CRSElapsedTime AirTime Origin Dest \\\n", + "Year Month \n", + "2007 9 EV 4318 46.0 25.0 CSG ATL \n", + " 9 EV 4318 46.0 22.0 CSG ATL \n", + " 9 EV 4318 46.0 19.0 CSG ATL \n", + " 9 EV 4318 46.0 22.0 CSG ATL \n", + " 9 EV 4318 46.0 17.0 CSG ATL \n", + "\n", + " Distance TaxiOut Cancelled CarrierDelay WeatherDelay \\\n", + "Year Month \n", + "2007 9 83.0 5.0 0 0.0 0.0 \n", + " 9 83.0 8.0 0 0.0 0.0 \n", + " 9 83.0 7.0 0 0.0 0.0 \n", + " 9 83.0 3.0 0 0.0 0.0 \n", + " 9 83.0 5.0 0 0.0 0.0 \n", + "\n", + " NASDelay SecurityDelay LateAircraftDelay \n", + "Year Month \n", + "2007 9 21.0 0.0 96.0 \n", + " 9 0.0 0.0 0.0 \n", + " 9 0.0 0.0 0.0 \n", + " 9 0.0 0.0 0.0 \n", + " 9 0.0 0.0 0.0 " + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ - "dataset = pq.ParquetDataset(TARGET_PATH)" + "df.head()" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 20, "metadata": {}, "outputs": [], "source": [ - "df = dataset.read().to_pandas().set_index(['Year', 'Month'], inplace=True)" + "if USE_ARCHIVE == ARCHIVE:\n", + " assert df.shape==(123_534_969, 20)" ] }, { From 25610594599a666d554804a1cbb2e5ecacceaa53 Mon Sep 17 00:00:00 2001 From: yasha Date: Tue, 28 Jan 2020 00:09:52 +0000 Subject: [PATCH 27/32] added fileutils/parquet-to-dask function --- fileutils/arc_to_parquet/arc_to_parquet.py | 39 +- fileutils/arc_to_parquet/arc_to_parquet.yaml | 6 +- .../__pycache__/function.cpython-36.pyc | Bin 0 -> 1251 bytes fileutils/parq_to_dask/function.py | 62 +++ fileutils/parq_to_dask/function.yaml | 23 + tests/arc_to_parquet-airlines.ipynb | 429 ++++++++++-------- tests/parq_to_dask.ipynb | 303 +++++++++++++ 7 files changed, 646 insertions(+), 216 deletions(-) create mode 100644 fileutils/parq_to_dask/__pycache__/function.cpython-36.pyc create mode 100644 fileutils/parq_to_dask/function.py create mode 100644 fileutils/parq_to_dask/function.yaml create mode 100644 tests/parq_to_dask.ipynb diff --git a/fileutils/arc_to_parquet/arc_to_parquet.py b/fileutils/arc_to_parquet/arc_to_parquet.py index 853dc7234..9400616e0 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.py +++ b/fileutils/arc_to_parquet/arc_to_parquet.py @@ -40,17 +40,15 @@ def arc_to_parquet( context: MLClientCtx, archive_url: Union[str, Path, IO[AnyStr]], header: Optional[List[str]] = None, + inc_cols: Optional[List[str]] = None, target_path: str = "", name: str = "", chunksize: int = 10_000, dtype=None, encoding: str = 'latin-1', - log_data: bool = True, - add_uid: bool = False, - key: str = "raw_data", - dataset: bool = False, + key: str = 'data', + dataset: str = 'dataset', partition_cols = [], - inc_cols: Optional[List[str]] = None ) -> None: """Open a file/object archive and save as a parquet file. @@ -61,21 +59,28 @@ def arc_to_parquet( of pandas.read_csv, including strings as file paths, as urls, pathlib.Path objects, etc... :param header: column names + :param inc_cols: include only these columns :param target_path: destination folder of table :param name: name file to be saved locally, also :param chunksize: (0) row size retrieved per iteration - :param inc_cols: include only these columns :param dtype destination data type of specified columns + :param encoding ('latin-8') file encoding :param key: key in artifact store (when log_data=True) - :param dataset: (False) if True then target_path is folder for - partitioned files + :param dataset: (None) if not None then 'target_path/dataset' + is folder for partitioned files :param part_cols: ([]) list of partitioning columns + """ if not name.endswith(".pqt"): name += ".pqt" - dest_path = os.path.join(target_path, name) - os.makedirs(os.path.join(target_path), exist_ok=True) + if dataset: + os.makedirs(os.path.join(target_path, dataset), exist_ok=True) + dest_path = os.path.join(target_path, dataset) + else: + os.makedirs(os.path.join(target_path), exist_ok=True) + dest_path = os.path.join(target_path, name) + if not os.path.isfile(dest_path): context.logger.info("destination file does not exist, downloading") pqwriter = None @@ -87,10 +92,13 @@ def arc_to_parquet( dtype=dtype)): table = pa.Table.from_pandas(df) if i == 0: - pqwriter = pq.ParquetWriter(dest_path, table.schema) + # write the header to target_path... + pqwriter = pq.ParquetWriter(os.path.join(target_path,'header-only.pqt'), table.schema) if dataset: - pq.write_to_dataset(table, root_path=target_path, partition_cols=partition_cols) + # ...and files to subfolder dataset + pq.write_to_dataset(table, root_path=dest_path, partition_cols=partition_cols) else: + # ...and file to a parquet file pqwriter.write_table(table) if pqwriter: @@ -99,10 +107,5 @@ def arc_to_parquet( context.logger.info(f"saved table to {dest_path}") else: context.logger.info("destination file already exists") - - context.log_artifact(key, target_path=dest_path) - # log header - filepath = os.path.join(target_path, 'header.pkl') - dump(header, open(filepath, 'wb')) - context.log_artifact('header', target_path=filepath) + # context.log_artifact(key, target_path=dest_path) diff --git a/fileutils/arc_to_parquet/arc_to_parquet.yaml b/fileutils/arc_to_parquet/arc_to_parquet.yaml index c1b83a1d4..ec6bb4f65 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.yaml +++ b/fileutils/arc_to_parquet/arc_to_parquet.yaml @@ -2,7 +2,7 @@ kind: job metadata: name: arc-to-parquet tag: '' - hash: 8a02024de5fc9c1e0876488700a604ca8551e991 + hash: 54c52b0bc70a6b44d8df0126500c6527ee749b02 project: '' spec: command: '' @@ -12,7 +12,7 @@ spec: env: [] description: '' build: - functionSourceCode: IyBDb3B5cmlnaHQgMjAxOCBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgoKaW1wb3J0IHNzbAoKdHJ5OgogICAgX2NyZWF0ZV91bnZlcmlmaWVkX2h0dHBzX2NvbnRleHQgPSBzc2wuX2NyZWF0ZV91bnZlcmlmaWVkX2NvbnRleHQKZXhjZXB0IEF0dHJpYnV0ZUVycm9yOgogICAgIyBMZWdhY3kgUHl0aG9uIHRoYXQgZG9lc24ndCB2ZXJpZnkgSFRUUFMgY2VydGlmaWNhdGVzIGJ5IGRlZmF1bHQKICAgIHBhc3MKZWxzZToKICAgICMgSGFuZGxlIHRhcmdldCBlbnZpcm9ubWVudCB0aGF0IGRvZXNuJ3Qgc3VwcG9ydCBIVFRQUyB2ZXJpZmljYXRpb24KICAgIHNzbC5fY3JlYXRlX2RlZmF1bHRfaHR0cHNfY29udGV4dCA9IF9jcmVhdGVfdW52ZXJpZmllZF9odHRwc19jb250ZXh0CgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBudW1weSBhcyBucAppbXBvcnQgcGFuZGFzIGFzIHBkCmltcG9ydCBweWFycm93LnBhcnF1ZXQgYXMgcHEKaW1wb3J0IHB5YXJyb3cgYXMgcGEKZnJvbSBwaWNrbGUgaW1wb3J0IGR1bXAsIGxvYWQKCmZyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIHR5cGluZyBpbXBvcnQgSU8sIEFueVN0ciwgVW5pb24sIExpc3QsIE9wdGlvbmFsCgoKZGVmIGFyY190b19wYXJxdWV0KAogICAgY29udGV4dDogTUxDbGllbnRDdHgsCiAgICBhcmNoaXZlX3VybDogVW5pb25bc3RyLCBQYXRoLCBJT1tBbnlTdHJdXSwKICAgIGhlYWRlcjogT3B0aW9uYWxbTGlzdFtzdHJdXSA9IE5vbmUsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gIiIsCiAgICBuYW1lOiBzdHIgPSAiIiwKICAgIGNodW5rc2l6ZTogaW50ID0gMTBfMDAwLAogICAgZHR5cGU9Tm9uZSwKICAgIGVuY29kaW5nOiBzdHIgPSAnbGF0aW4tMScsCiAgICBsb2dfZGF0YTogYm9vbCA9IFRydWUsCiAgICBhZGRfdWlkOiBib29sID0gRmFsc2UsCiAgICBrZXk6IHN0ciA9ICJyYXdfZGF0YSIsCiAgICBkYXRhc2V0OiBib29sID0gRmFsc2UsCiAgICBwYXJ0aXRpb25fY29scyA9IFtdLAogICAgaW5jX2NvbHM6IE9wdGlvbmFsW0xpc3Rbc3RyXV0gPSBOb25lCikgLT4gTm9uZToKICAgICIiIk9wZW4gYSBmaWxlL29iamVjdCBhcmNoaXZlIGFuZCBzYXZlIGFzIGEgcGFycXVldCBmaWxlLgogICAgCiAgICBQYXJ0aXRpb25pbmcgcmVxdWlyZXMgcHJlY2lzZSBzcGVjaWZpY2F0aW9uIG9mIGNvbHVtbiB0eXBlcy4KICAgIAogICAgOnBhcmFtIGNvbnRleHQ6ICAgICBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gYXJjaGl2ZV91cmw6IGFueSB2YWxpZCBzdHJpbmcgcGF0aCBjb25zaXN0ZW50IHdpdGggdGhlIHBhdGggdmFyaWFibGUKICAgICAgICAgICAgICAgICAgICAgICAgb2YgcGFuZGFzLnJlYWRfY3N2LCBpbmNsdWRpbmcgc3RyaW5ncyBhcyBmaWxlIHBhdGhzLCBhcyB1cmxzLCAKICAgICAgICAgICAgICAgICAgICAgICAgcGF0aGxpYi5QYXRoIG9iamVjdHMsIGV0Yy4uLgogICAgOnBhcmFtIGhlYWRlcjogICAgICBjb2x1bW4gbmFtZXMKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogZGVzdGluYXRpb24gZm9sZGVyIG9mIHRhYmxlCiAgICA6cGFyYW0gbmFtZTogICAgICAgIG5hbWUgZmlsZSB0byBiZSBzYXZlZCBsb2NhbGx5LCBhbHNvCiAgICA6cGFyYW0gY2h1bmtzaXplOiAgICgwKSByb3cgc2l6ZSByZXRyaWV2ZWQgcGVyIGl0ZXJhdGlvbgogICAgOnBhcmFtIGluY19jb2xzOiAgICBpbmNsdWRlIG9ubHkgdGhlc2UgY29sdW1ucwogICAgOnBhcmFtIGR0eXBlICAgICAgICBkZXN0aW5hdGlvbiBkYXRhIHR5cGUgb2Ygc3BlY2lmaWVkIGNvbHVtbnMKICAgIDpwYXJhbSBrZXk6ICAgICAgICAga2V5IGluIGFydGlmYWN0IHN0b3JlICh3aGVuIGxvZ19kYXRhPVRydWUpCiAgICA6cGFyYW0gZGF0YXNldDogICAgIChGYWxzZSkgaWYgVHJ1ZSB0aGVuIHRhcmdldF9wYXRoIGlzIGZvbGRlciBmb3IKICAgICAgICAgICAgICAgICAgICAgICAgcGFydGl0aW9uZWQgZmlsZXMKICAgIDpwYXJhbSBwYXJ0X2NvbHM6ICAgKFtdKSBsaXN0IG9mIHBhcnRpdGlvbmluZyBjb2x1bW5zCiAgICAiIiIKICAgIGlmIG5vdCBuYW1lLmVuZHN3aXRoKCIucHF0Iik6CiAgICAgICAgbmFtZSArPSAiLnBxdCIKICAgIAogICAgZGVzdF9wYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgb3MubWFrZWRpcnMob3MucGF0aC5qb2luKHRhcmdldF9wYXRoKSwgZXhpc3Rfb2s9VHJ1ZSkKICAgIGlmIG5vdCBvcy5wYXRoLmlzZmlsZShkZXN0X3BhdGgpOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oImRlc3RpbmF0aW9uIGZpbGUgZG9lcyBub3QgZXhpc3QsIGRvd25sb2FkaW5nIikKICAgICAgICBwcXdyaXRlciA9IE5vbmUKICAgICAgICBmb3IgaSwgZGYgaW4gZW51bWVyYXRlKHBkLnJlYWRfY3N2KGFyY2hpdmVfdXJsLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGNodW5rc2l6ZT1jaHVua3NpemUsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbmFtZXM9aGVhZGVyLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGVuY29kaW5nPWVuY29kaW5nLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHVzZWNvbHM9aW5jX2NvbHMsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZHR5cGU9ZHR5cGUpKToKICAgICAgICAgICAgdGFibGUgPSBwYS5UYWJsZS5mcm9tX3BhbmRhcyhkZikKICAgICAgICAgICAgaWYgaSA9PSAwOgogICAgICAgICAgICAgICAgcHF3cml0ZXIgPSBwcS5QYXJxdWV0V3JpdGVyKGRlc3RfcGF0aCwgdGFibGUuc2NoZW1hKQogICAgICAgICAgICBpZiBkYXRhc2V0OgogICAgICAgICAgICAgICAgcHEud3JpdGVfdG9fZGF0YXNldCh0YWJsZSwgcm9vdF9wYXRoPXRhcmdldF9wYXRoLCBwYXJ0aXRpb25fY29scz1wYXJ0aXRpb25fY29scykKICAgICAgICAgICAgZWxzZToKICAgICAgICAgICAgICAgIHBxd3JpdGVyLndyaXRlX3RhYmxlKHRhYmxlKQogICAgICAgICAgICAKICAgICAgICBpZiBwcXdyaXRlcjoKICAgICAgICAgICAgcHF3cml0ZXIuY2xvc2UoKQoKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKGYic2F2ZWQgdGFibGUgdG8ge2Rlc3RfcGF0aH0iKQogICAgZWxzZToKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCJkZXN0aW5hdGlvbiBmaWxlIGFscmVhZHkgZXhpc3RzIikKCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWRlc3RfcGF0aCkKICAgIAogICAgIyBsb2cgaGVhZGVyCiAgICBmaWxlcGF0aCA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgJ2hlYWRlci5wa2wnKQogICAgZHVtcChoZWFkZXIsIG9wZW4oZmlsZXBhdGgsICd3YicpKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ2hlYWRlcicsIHRhcmdldF9wYXRoPWZpbGVwYXRoKQo= + functionSourceCode: IyBDb3B5cmlnaHQgMjAxOCBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgoKaW1wb3J0IHNzbAoKdHJ5OgogICAgX2NyZWF0ZV91bnZlcmlmaWVkX2h0dHBzX2NvbnRleHQgPSBzc2wuX2NyZWF0ZV91bnZlcmlmaWVkX2NvbnRleHQKZXhjZXB0IEF0dHJpYnV0ZUVycm9yOgogICAgIyBMZWdhY3kgUHl0aG9uIHRoYXQgZG9lc24ndCB2ZXJpZnkgSFRUUFMgY2VydGlmaWNhdGVzIGJ5IGRlZmF1bHQKICAgIHBhc3MKZWxzZToKICAgICMgSGFuZGxlIHRhcmdldCBlbnZpcm9ubWVudCB0aGF0IGRvZXNuJ3Qgc3VwcG9ydCBIVFRQUyB2ZXJpZmljYXRpb24KICAgIHNzbC5fY3JlYXRlX2RlZmF1bHRfaHR0cHNfY29udGV4dCA9IF9jcmVhdGVfdW52ZXJpZmllZF9odHRwc19jb250ZXh0CgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBudW1weSBhcyBucAppbXBvcnQgcGFuZGFzIGFzIHBkCmltcG9ydCBweWFycm93LnBhcnF1ZXQgYXMgcHEKaW1wb3J0IHB5YXJyb3cgYXMgcGEKZnJvbSBwaWNrbGUgaW1wb3J0IGR1bXAsIGxvYWQKCmZyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIHR5cGluZyBpbXBvcnQgSU8sIEFueVN0ciwgVW5pb24sIExpc3QsIE9wdGlvbmFsCgoKZGVmIGFyY190b19wYXJxdWV0KAogICAgY29udGV4dDogTUxDbGllbnRDdHgsCiAgICBhcmNoaXZlX3VybDogVW5pb25bc3RyLCBQYXRoLCBJT1tBbnlTdHJdXSwKICAgIGhlYWRlcjogT3B0aW9uYWxbTGlzdFtzdHJdXSA9IE5vbmUsCiAgICBpbmNfY29sczogT3B0aW9uYWxbTGlzdFtzdHJdXSA9IE5vbmUsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gIiIsCiAgICBuYW1lOiBzdHIgPSAiIiwKICAgIGNodW5rc2l6ZTogaW50ID0gMTBfMDAwLAogICAgZHR5cGU9Tm9uZSwKICAgIGVuY29kaW5nOiBzdHIgPSAnbGF0aW4tMScsCiAgICBrZXk6IHN0ciA9ICdkYXRhJywKICAgIGRhdGFzZXQ6IHN0ciA9ICdkYXRhc2V0JywKICAgIHBhcnRpdGlvbl9jb2xzID0gW10sCikgLT4gTm9uZToKICAgICIiIk9wZW4gYSBmaWxlL29iamVjdCBhcmNoaXZlIGFuZCBzYXZlIGFzIGEgcGFycXVldCBmaWxlLgogICAgCiAgICBQYXJ0aXRpb25pbmcgcmVxdWlyZXMgcHJlY2lzZSBzcGVjaWZpY2F0aW9uIG9mIGNvbHVtbiB0eXBlcy4KICAgIAogICAgOnBhcmFtIGNvbnRleHQ6ICAgICBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gYXJjaGl2ZV91cmw6IGFueSB2YWxpZCBzdHJpbmcgcGF0aCBjb25zaXN0ZW50IHdpdGggdGhlIHBhdGggdmFyaWFibGUKICAgICAgICAgICAgICAgICAgICAgICAgb2YgcGFuZGFzLnJlYWRfY3N2LCBpbmNsdWRpbmcgc3RyaW5ncyBhcyBmaWxlIHBhdGhzLCBhcyB1cmxzLCAKICAgICAgICAgICAgICAgICAgICAgICAgcGF0aGxpYi5QYXRoIG9iamVjdHMsIGV0Yy4uLgogICAgOnBhcmFtIGhlYWRlcjogICAgICBjb2x1bW4gbmFtZXMKICAgIDpwYXJhbSBpbmNfY29sczogICAgaW5jbHVkZSBvbmx5IHRoZXNlIGNvbHVtbnMKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogZGVzdGluYXRpb24gZm9sZGVyIG9mIHRhYmxlCiAgICA6cGFyYW0gbmFtZTogICAgICAgIG5hbWUgZmlsZSB0byBiZSBzYXZlZCBsb2NhbGx5LCBhbHNvCiAgICA6cGFyYW0gY2h1bmtzaXplOiAgICgwKSByb3cgc2l6ZSByZXRyaWV2ZWQgcGVyIGl0ZXJhdGlvbgogICAgOnBhcmFtIGR0eXBlICAgICAgICBkZXN0aW5hdGlvbiBkYXRhIHR5cGUgb2Ygc3BlY2lmaWVkIGNvbHVtbnMKICAgIDpwYXJhbSBlbmNvZGluZyAgICAgKCdsYXRpbi04JykgZmlsZSBlbmNvZGluZwogICAgOnBhcmFtIGtleTogICAgICAgICBrZXkgaW4gYXJ0aWZhY3Qgc3RvcmUgKHdoZW4gbG9nX2RhdGE9VHJ1ZSkKICAgIDpwYXJhbSBkYXRhc2V0OiAgICAgKE5vbmUpIGlmIG5vdCBOb25lIHRoZW4gJ3RhcmdldF9wYXRoL2RhdGFzZXQnCiAgICAgICAgICAgICAgICAgICAgICAgIGlzIGZvbGRlciBmb3IgcGFydGl0aW9uZWQgZmlsZXMKICAgIDpwYXJhbSBwYXJ0X2NvbHM6ICAgKFtdKSBsaXN0IG9mIHBhcnRpdGlvbmluZyBjb2x1bW5zCiAgICAKICAgICIiIgogICAgaWYgbm90IG5hbWUuZW5kc3dpdGgoIi5wcXQiKToKICAgICAgICBuYW1lICs9ICIucHF0IgogICAgCiAgICBpZiBkYXRhc2V0OgogICAgICAgIG9zLm1ha2VkaXJzKG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgZGF0YXNldCksIGV4aXN0X29rPVRydWUpCiAgICAgICAgZGVzdF9wYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBkYXRhc2V0KQogICAgZWxzZToKICAgICAgICBvcy5tYWtlZGlycyhvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgpLCBleGlzdF9vaz1UcnVlKQogICAgICAgIGRlc3RfcGF0aCA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSkKICAgICAgICAKICAgIGlmIG5vdCBvcy5wYXRoLmlzZmlsZShkZXN0X3BhdGgpOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oImRlc3RpbmF0aW9uIGZpbGUgZG9lcyBub3QgZXhpc3QsIGRvd25sb2FkaW5nIikKICAgICAgICBwcXdyaXRlciA9IE5vbmUKICAgICAgICBmb3IgaSwgZGYgaW4gZW51bWVyYXRlKHBkLnJlYWRfY3N2KGFyY2hpdmVfdXJsLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGNodW5rc2l6ZT1jaHVua3NpemUsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbmFtZXM9aGVhZGVyLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGVuY29kaW5nPWVuY29kaW5nLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHVzZWNvbHM9aW5jX2NvbHMsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZHR5cGU9ZHR5cGUpKToKICAgICAgICAgICAgdGFibGUgPSBwYS5UYWJsZS5mcm9tX3BhbmRhcyhkZikKICAgICAgICAgICAgaWYgaSA9PSAwOgogICAgICAgICAgICAgICAgIyB3cml0ZSB0aGUgaGVhZGVyIHRvIHRhcmdldF9wYXRoLi4uCiAgICAgICAgICAgICAgICBwcXdyaXRlciA9IHBxLlBhcnF1ZXRXcml0ZXIob3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCdoZWFkZXItb25seS5wcXQnKSwgdGFibGUuc2NoZW1hKQogICAgICAgICAgICBpZiBkYXRhc2V0OgogICAgICAgICAgICAgICAgICMgLi4uYW5kIGZpbGVzIHRvIHN1YmZvbGRlciBkYXRhc2V0CiAgICAgICAgICAgICAgICBwcS53cml0ZV90b19kYXRhc2V0KHRhYmxlLCByb290X3BhdGg9ZGVzdF9wYXRoLCBwYXJ0aXRpb25fY29scz1wYXJ0aXRpb25fY29scykKICAgICAgICAgICAgZWxzZToKICAgICAgICAgICAgICAgICMgLi4uYW5kIGZpbGUgdG8gYSBwYXJxdWV0IGZpbGUKICAgICAgICAgICAgICAgIHBxd3JpdGVyLndyaXRlX3RhYmxlKHRhYmxlKQogICAgICAgICAgICAKICAgICAgICBpZiBwcXdyaXRlcjoKICAgICAgICAgICAgcHF3cml0ZXIuY2xvc2UoKQoKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKGYic2F2ZWQgdGFibGUgdG8ge2Rlc3RfcGF0aH0iKQogICAgZWxzZToKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCJkZXN0aW5hdGlvbiBmaWxlIGFscmVhZHkgZXhpc3RzIikKICAgIAogICAgIyBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWRlc3RfcGF0aCkK base_image: yjbds/mlrun-files:latest commands: [] - code_origin: https://github.com/yjb-ds/functions.git#5a80835a812cf0cd8e106721aac3dde4989cbead:/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py + code_origin: https://github.com/yjb-ds/functions.git#dba3bc120edc5711a9ee1ceaff9e557ced4d0aa1:/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py diff --git a/fileutils/parq_to_dask/__pycache__/function.cpython-36.pyc b/fileutils/parq_to_dask/__pycache__/function.cpython-36.pyc new file mode 100644 index 0000000000000000000000000000000000000000..99a9d32655e583a8131645cba76d2abc6c54400a GIT binary patch literal 1251 zcmZuxOK;pZ5T>|#t@dHZvE%OAPB%RiP!8ThfTAecqJi5(0M`kS1Z{bUAd9oH))b|X zRBN}JlY8wy=pX1`;7&Os zd?+uU8fAm$vaF@`k4tTA{J1g5<>HNxZ+VtRHY#dq;KJBJldD%Pn5q88dF zrQ~q2rbB&}E3R$SfYQ!gsKB&J7NLt7SB%>^KEkcPKrm#54ZNaiYlZE59m=txrnmAr z{lmLL-uTO4<)~;CsLxi>GQ1+MBi_S8_`iO2T<3gt9oB473YgVu#xYF|W>!j5Bcsn} z#kwTP4A(ob11HRiu6U`(`ZD{aR_7S`iA(z|*SRrDZDW+G=fDMSDd%>mAm`cjd({?R zr?DD2m9c{j2b&Juq;~So3)(LJB5Hz{|9(Dyu0hQeG_{^DT3NWNBZWD%rWAVayl1A) zT#RoO4x3B6anputVT$hb5P^&mFQ#LfP~TH~$fy3viE@%>=_j_QwXoByz|ak3t^65O zxhR2SY`#*hVE( zD-YoU3anu*Re%;Zx7}2jT7h!2XhU<^l=7v*e+yA=DplV>;$VP)^} None: + """Load parquet file or dataset into dask cluster + + + """ + # Setup Dask + if hasattr(context, 'dask_client'): + dask_client = context.dask_client + else: + dask_client = Client(LocalCluster(n_workers=shards)) + + df = dd.read_parquet(parquet_url) + + if persist: + df = df.persist() diff --git a/fileutils/parq_to_dask/function.yaml b/fileutils/parq_to_dask/function.yaml new file mode 100644 index 000000000..5dfbb8336 --- /dev/null +++ b/fileutils/parq_to_dask/function.yaml @@ -0,0 +1,23 @@ +kind: dask +metadata: + name: function + hash: 722d874e8d00ff106a34b876c9700fc4f0bc2994 + project: default +spec: + command: /User/repos/functions/fileutils/parq_to_dask/function.py + args: [] + image: '' + volumes: [] + volume_mounts: [] + env: [] + build: + base_image: yjbds/mlrun-ds:latest + commands: [] + description: '' + replicas: 4 + image_pull_policy: Always + remote: true + service_type: NodePort + nthreads: 1 + min_replicas: 0 + max_replicas: 4 diff --git a/tests/arc_to_parquet-airlines.ipynb b/tests/arc_to_parquet-airlines.ipynb index cb463cde1..58414b1ae 100644 --- a/tests/arc_to_parquet-airlines.ipynb +++ b/tests/arc_to_parquet-airlines.ipynb @@ -11,7 +11,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -36,7 +36,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## parameters\n" + "## parameters\n", + "from **[h20ai](https://github.com/h2oai/h2o-2/wiki/Hacking-Airline-DataSet-with-H2O)**:" ] }, { @@ -45,37 +46,62 @@ "metadata": {}, "outputs": [], "source": [ - "BASE_IMAGE = 'yjbds/mlrun-files:latest'\n", - "\n", - "CODE_BASE = '/User/repos/functions/' # 'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/'\n", - "PROJECT = 'fileutils/arc_to_parquet'\n", - "\n", - "TARGET_PATH = '/User/mlrun/airlines/dataset'\n", - "\n", "ARCHIVE_BIG = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears_10.csv\"\n", "ARCHIVE = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears.csv\"\n", "ARCHIVE_SMALL = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv\"" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 26, "metadata": {}, + "outputs": [], "source": [ - "**For testing and development use ARCHIVE_SMALL:**" + "USE_ARCHIVE = ARCHIVE\n", + "TARGET_PATH = '/User/mlrun/airlines/dataset'\n", + "\n", + "PARTITIONS_DEST = 'partitions'\n", + "PARTITION_COLS = ['Year', 'Month']" ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 27, "metadata": {}, "outputs": [], "source": [ - "USE_ARCHIVE = ARCHIVE" + "os.makedirs(os.path.join(TARGET_PATH, PARTITIONS_DEST), exist_ok=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [], + "source": [ + "BASE_IMAGE = 'yjbds/mlrun-files:latest'" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "CODE_BASE = '/User/repos/functions/' # 'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/'\n", + "FUNCTION = 'fileutils/arc_to_parquet'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**For testing and development use ARCHIVE_SMALL:**" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 30, "metadata": {}, "outputs": [], "source": [ @@ -109,19 +135,16 @@ " 'WeatherDelay': 'float32', \n", " 'NASDelay':'float32', \n", " 'SecurityDelay':'float32', \n", - " 'LateAircraftDelay':'float32'}\n", - "\n", - "USE_PARTITIONS = True\n", - "PARTITION_COLS = ['Year', 'Month']" + " 'LateAircraftDelay':'float32'}" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 31, "metadata": {}, "outputs": [], "source": [ - "os.makedirs(TARGET_PATH, exist_ok=True)" + "LABEL_COLUMN = \"IsArrDelayed\"" ] }, { @@ -135,24 +158,24 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-27 19:37:40,654 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" + "[mlrun] 2020-01-27 23:25:02,696 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" ] } ], "source": [ "# load function from a local Python file\n", "arctoparq = mlrun.code_to_function(\n", - " filename=os.path.join(CODE_BASE, PROJECT, 'arc_to_parquet.py'), \n", + " filename=os.path.join(CODE_BASE, FUNCTION, 'arc_to_parquet.py'), \n", " kind='job')\n", "arctoparq.build_config(base_image=BASE_IMAGE, commands=[])\n", - "yaml_name = os.path.join(CODE_BASE, PROJECT, 'arc_to_parquet.yaml')\n", + "yaml_name = os.path.join(CODE_BASE, FUNCTION, 'arc_to_parquet.yaml')\n", "arctoparq.export(yaml_name)" ] }, @@ -165,12 +188,12 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "arctoparq = mlrun.import_function(\n", - " os.path.join(CODE_BASE, PROJECT, 'arc_to_parquet.yaml')\n", + " os.path.join(CODE_BASE, FUNCTION, 'arc_to_parquet.yaml')\n", ").apply(mlrun.mount_v3io())" ] }, @@ -190,7 +213,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 34, "metadata": {}, "outputs": [ { @@ -199,7 +222,7 @@ "'ready'" ] }, - "execution_count": 9, + "execution_count": 34, "metadata": {}, "output_type": "execute_result" } @@ -210,21 +233,19 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-27 19:37:40,720 starting run arc2parq uid=647251d1ef46416bb2a1dc9a76310e54 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-27 19:37:40,821 Job is running in the background, pod: arc2parq-mgrvp\n", - "[mlrun] 2020-01-27 19:37:45,590 destination file does not exist, downloading\n", - "[mlrun] 2020-01-27 19:50:05,061 saved table to /User/mlrun/airlines/dataset/airlines.pqt\n", - "[mlrun] 2020-01-27 19:50:05,076 log artifact airlines at /User/mlrun/airlines/dataset/airlines.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-27 19:50:05,095 log artifact header at /User/mlrun/airlines/dataset/header.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-27 23:25:09,450 starting run arc2parq uid=c8f9525e5258489ea1211312348b21e1 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-27 23:25:09,545 Job is running in the background, pod: arc2parq-lw6ww\n", + "[mlrun] 2020-01-27 23:25:14,326 destination file does not exist, downloading\n", + "[mlrun] 2020-01-27 23:36:53,211 saved table to /User/mlrun/airlines/dataset/partitions\n", "\n", - "[mlrun] 2020-01-27 19:50:05,114 run executed, status=completed\n", + "[mlrun] 2020-01-27 23:36:53,223 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -397,26 +418,26 @@ " \n", " \n", " \n", - "
...310e54
\n", + "
...8b21e1
\n", " 0\n", - " Jan 27 19:37:45\n", + " Jan 27 23:25:14\n", " completed\n", " arc-to-parquet\n", - "
host=arc2parq-mgrvp
kind=job
owner=admin
\n", + "
host=arc2parq-lw6ww
kind=job
owner=admin
\n", + " \n", + "
archive_url=https://s3.amazonaws.com/h2o-airlines-unpacked/allyears.csv
dataset=partitions
dtype={'AirTime': 'float32', 'ArrTime': 'float32', 'CRSElapsedTime': 'float32', 'CarrierDelay': 'float32', 'DepTime': 'float32', 'Distance': 'float32', 'LateAircraftDelay': 'float32', 'NASDelay': 'float32', 'SecurityDelay': 'float32', 'TailNum': 'str', 'TaxiIn': 'float32', 'TaxiOut': 'float32', 'WeatherDelay': 'float32'}
encoding=latin-1
inc_cols=['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'CRSElapsedTime', 'AirTime', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut', 'Cancelled', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
key=airlines
name=airlines.pqt
part_cols=['Year', 'Month']
target_path=/User/mlrun/airlines/dataset
\n", " \n", - "
archive_url=https://s3.amazonaws.com/h2o-airlines-unpacked/allyears.csv
dataset=True
dtype={'AirTime': 'float32', 'ArrTime': 'float32', 'CRSElapsedTime': 'float32', 'CarrierDelay': 'float32', 'DepTime': 'float32', 'Distance': 'float32', 'LateAircraftDelay': 'float32', 'NASDelay': 'float32', 'SecurityDelay': 'float32', 'TailNum': 'str', 'TaxiOut': 'float32', 'WeatherDelay': 'float32'}
encoding=latin-1
inc_cols=['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'CRSElapsedTime', 'AirTime', 'Origin', 'Dest', 'Distance', 'TaxiOut', 'Cancelled', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
key=airlines
name=airlines.pqt
part_cols=['Year', 'Month']
target_path=/User/mlrun/airlines/dataset
\n", " \n", - "
airlines
header
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -432,15 +453,12 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 647251d1ef46416bb2a1dc9a76310e54 , !mlrun logs 647251d1ef46416bb2a1dc9a76310e54 \n", - "[mlrun] 2020-01-27 19:50:13,248 run executed, status=completed\n", - "CPU times: user 400 ms, sys: 45.2 ms, total: 445 ms\n", - "Wall time: 12min 32s\n" + "!mlrun get run c8f9525e5258489ea1211312348b21e1 , !mlrun logs c8f9525e5258489ea1211312348b21e1 \n", + "[mlrun] 2020-01-27 23:37:01,852 run executed, status=completed\n" ] } ], "source": [ - "%%time\n", "# create and run the task\n", "arc_to_parq_task = mlrun.NewTask(\n", " 'arc2parq', \n", @@ -450,7 +468,7 @@ " 'name' : FILE_NAME, \n", " 'key' : KEY,\n", " 'archive_url': USE_ARCHIVE,\n", - " 'dataset' : USE_PARTITIONS,\n", + " 'dataset' : PARTITIONS_DEST,\n", " 'part_cols' : PARTITION_COLS,\n", " 'encoding' : ENCODING,\n", " 'inc_cols' : INC_COLS,\n", @@ -482,34 +500,26 @@ }, { "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [], - "source": [ - "dataset = pq.ParquetDataset(TARGET_PATH)" - ] - }, - { - "cell_type": "code", - "execution_count": 15, + "execution_count": 21, "metadata": {}, "outputs": [], "source": [ + "dataset = pq.ParquetDataset(os.path.join(TARGET_PATH, PARTITIONS_DEST))\n", "df = dataset.read().to_pandas()" ] }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 22, "metadata": {}, "outputs": [], "source": [ - "df.set_index(['Year', 'Month'], inplace=True)" + "df.set_index(PARTITION_COLS, inplace=True)" ] }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 23, "metadata": {}, "outputs": [ { @@ -544,9 +554,10 @@ " FlightNum\n", " CRSElapsedTime\n", " AirTime\n", - " Origin\n", + " ...\n", " Dest\n", " Distance\n", + " TaxiIn\n", " TaxiOut\n", " Cancelled\n", " CarrierDelay\n", @@ -578,164 +589,173 @@ " \n", " \n", " \n", + " \n", " \n", " \n", " \n", " \n", - " 2007\n", - " 9\n", - " 21\n", - " 5\n", - " 1951.0\n", - " 1815\n", - " 2058.0\n", - " 1901\n", - " EV\n", - " 4318\n", - " 46.0\n", - " 25.0\n", - " CSG\n", - " ATL\n", - " 83.0\n", - " 5.0\n", + " 1992\n", + " 1\n", + " 7\n", + " 2\n", + " 640.0\n", + " 640\n", + " 851.0\n", + " 853\n", + " US\n", + " 53\n", + " 133.0\n", + " NaN\n", + " ...\n", + " IND\n", + " 644.0\n", + " NaN\n", + " NaN\n", " 0\n", - " 0.0\n", - " 0.0\n", - " 21.0\n", - " 0.0\n", - " 96.0\n", + " NaN\n", + " NaN\n", + " NaN\n", + " NaN\n", + " NaN\n", " \n", " \n", - " 9\n", - " 23\n", - " 7\n", - " 1826.0\n", - " 1815\n", - " 1915.0\n", - " 1901\n", - " EV\n", - " 4318\n", - " 46.0\n", - " 22.0\n", - " CSG\n", - " ATL\n", - " 83.0\n", - " 8.0\n", + " 1\n", + " 8\n", + " 3\n", + " 639.0\n", + " 640\n", + " 837.0\n", + " 853\n", + " US\n", + " 53\n", + " 133.0\n", + " NaN\n", + " ...\n", + " IND\n", + " 644.0\n", + " NaN\n", + " NaN\n", " 0\n", - " 0.0\n", - " 0.0\n", - " 0.0\n", - " 0.0\n", - " 0.0\n", + " NaN\n", + " NaN\n", + " NaN\n", + " NaN\n", + " NaN\n", " \n", " \n", - " 9\n", - " 24\n", - " 1\n", - " 1827.0\n", - " 1815\n", - " 1906.0\n", - " 1901\n", - " EV\n", - " 4318\n", - " 46.0\n", - " 19.0\n", - " CSG\n", - " ATL\n", - " 83.0\n", - " 7.0\n", + " 1\n", + " 9\n", + " 4\n", + " 644.0\n", + " 640\n", + " 905.0\n", + " 853\n", + " US\n", + " 53\n", + " 133.0\n", + " NaN\n", + " ...\n", + " IND\n", + " 644.0\n", + " NaN\n", + " NaN\n", " 0\n", - " 0.0\n", - " 0.0\n", - " 0.0\n", - " 0.0\n", - " 0.0\n", + " NaN\n", + " NaN\n", + " NaN\n", + " NaN\n", + " NaN\n", " \n", " \n", - " 9\n", - " 25\n", - " 2\n", - " 1840.0\n", - " 1815\n", - " 1915.0\n", - " 1901\n", - " EV\n", - " 4318\n", - " 46.0\n", - " 22.0\n", - " CSG\n", - " ATL\n", - " 83.0\n", - " 3.0\n", + " 1\n", + " 11\n", + " 6\n", + " 640.0\n", + " 640\n", + " 834.0\n", + " 853\n", + " US\n", + " 53\n", + " 133.0\n", + " NaN\n", + " ...\n", + " IND\n", + " 644.0\n", + " NaN\n", + " NaN\n", " 0\n", - " 0.0\n", - " 0.0\n", - " 0.0\n", - " 0.0\n", - " 0.0\n", + " NaN\n", + " NaN\n", + " NaN\n", + " NaN\n", + " NaN\n", " \n", " \n", - " 9\n", - " 26\n", - " 3\n", - " 1815.0\n", - " 1815\n", - " 1847.0\n", - " 1901\n", - " EV\n", - " 4318\n", - " 46.0\n", - " 17.0\n", - " CSG\n", - " ATL\n", - " 83.0\n", - " 5.0\n", + " 1\n", + " 12\n", + " 7\n", + " 639.0\n", + " 640\n", + " 832.0\n", + " 853\n", + " US\n", + " 53\n", + " 133.0\n", + " NaN\n", + " ...\n", + " IND\n", + " 644.0\n", + " NaN\n", + " NaN\n", " 0\n", - " 0.0\n", - " 0.0\n", - " 0.0\n", - " 0.0\n", - " 0.0\n", + " NaN\n", + " NaN\n", + " NaN\n", + " NaN\n", + " NaN\n", " \n", " \n", "\n", + "

5 rows × 21 columns

\n", "" ], "text/plain": [ " DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime \\\n", "Year Month \n", - "2007 9 21 5 1951.0 1815 2058.0 1901 \n", - " 9 23 7 1826.0 1815 1915.0 1901 \n", - " 9 24 1 1827.0 1815 1906.0 1901 \n", - " 9 25 2 1840.0 1815 1915.0 1901 \n", - " 9 26 3 1815.0 1815 1847.0 1901 \n", + "1992 1 7 2 640.0 640 851.0 853 \n", + " 1 8 3 639.0 640 837.0 853 \n", + " 1 9 4 644.0 640 905.0 853 \n", + " 1 11 6 640.0 640 834.0 853 \n", + " 1 12 7 639.0 640 832.0 853 \n", "\n", - " UniqueCarrier FlightNum CRSElapsedTime AirTime Origin Dest \\\n", - "Year Month \n", - "2007 9 EV 4318 46.0 25.0 CSG ATL \n", - " 9 EV 4318 46.0 22.0 CSG ATL \n", - " 9 EV 4318 46.0 19.0 CSG ATL \n", - " 9 EV 4318 46.0 22.0 CSG ATL \n", - " 9 EV 4318 46.0 17.0 CSG ATL \n", + " UniqueCarrier FlightNum CRSElapsedTime AirTime ... Dest \\\n", + "Year Month ... \n", + "1992 1 US 53 133.0 NaN ... IND \n", + " 1 US 53 133.0 NaN ... IND \n", + " 1 US 53 133.0 NaN ... IND \n", + " 1 US 53 133.0 NaN ... IND \n", + " 1 US 53 133.0 NaN ... IND \n", "\n", - " Distance TaxiOut Cancelled CarrierDelay WeatherDelay \\\n", - "Year Month \n", - "2007 9 83.0 5.0 0 0.0 0.0 \n", - " 9 83.0 8.0 0 0.0 0.0 \n", - " 9 83.0 7.0 0 0.0 0.0 \n", - " 9 83.0 3.0 0 0.0 0.0 \n", - " 9 83.0 5.0 0 0.0 0.0 \n", + " Distance TaxiIn TaxiOut Cancelled CarrierDelay WeatherDelay \\\n", + "Year Month \n", + "1992 1 644.0 NaN NaN 0 NaN NaN \n", + " 1 644.0 NaN NaN 0 NaN NaN \n", + " 1 644.0 NaN NaN 0 NaN NaN \n", + " 1 644.0 NaN NaN 0 NaN NaN \n", + " 1 644.0 NaN NaN 0 NaN NaN \n", "\n", " NASDelay SecurityDelay LateAircraftDelay \n", "Year Month \n", - "2007 9 21.0 0.0 96.0 \n", - " 9 0.0 0.0 0.0 \n", - " 9 0.0 0.0 0.0 \n", - " 9 0.0 0.0 0.0 \n", - " 9 0.0 0.0 0.0 " + "1992 1 NaN NaN NaN \n", + " 1 NaN NaN NaN \n", + " 1 NaN NaN NaN \n", + " 1 NaN NaN NaN \n", + " 1 NaN NaN NaN \n", + "\n", + "[5 rows x 21 columns]" ] }, - "execution_count": 17, + "execution_count": 23, "metadata": {}, "output_type": "execute_result" } @@ -746,12 +766,14 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "if USE_ARCHIVE == ARCHIVE:\n", - " assert df.shape==(123_534_969, 20)" + " assert df.shape==(123_534_969, 21)\n", + "if USE_ARCHIVE == ARCHIVE_SMALL:\n", + " assert df.shape==(43_978, 21)" ] }, { @@ -763,7 +785,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 25, "metadata": {}, "outputs": [], "source": [ @@ -796,8 +818,25 @@ "metadata": {}, "outputs": [], "source": [ - "copied = pd.read_parquet(TARGET_PATH+'/'+ FILE_NAME, engine=\"pyarrow\")" + "copied = pd.read_parquet(TARGET_PATH+'/'+ FILE_NAME, engine=\"pyarrow\")\n", + "copied.set_index(PARTITION_COLS, inplace=True)" ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "copied.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { diff --git a/tests/parq_to_dask.ipynb b/tests/parq_to_dask.ipynb new file mode 100644 index 000000000..ad69b6e2b --- /dev/null +++ b/tests/parq_to_dask.ipynb @@ -0,0 +1,303 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# archive to parquet\n", + "\n", + "Convert a remote archive or csv file (or local file://), to parquet format" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import mlrun\n", + "import os\n", + "mlrun.mlconf.dbpath = 'http://mlrun-api:8080'" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import pyarrow as pa\n", + "import pyarrow.parquet as pq" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## parameters\n", + "from **[h20ai](https://github.com/h2oai/h2o-2/wiki/Hacking-Airline-DataSet-with-H2O)**:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "ARCHIVE_BIG = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears_10.csv\"\n", + "ARCHIVE = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears.csv\"\n", + "ARCHIVE_SMALL = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv\"" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "USE_ARCHIVE = ARCHIVE_SMALL\n", + "SRC_PATH = '/User/mlrun/airlines/dataset-small/partitions/*.parquet'\n", + "\n", + "PARTITIONS_DEST = 'partitions'\n", + "PARTITION_COLS = ['Year', 'Month']" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "BASE_IMAGE = 'yjbds/mlrun-ds:latest'" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "CODE_BASE = '/User/repos/functions/' # 'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/'\n", + "FUNCTION = 'fileutils/parq_to_dask'\n", + "JOB_KIND = 'dask'" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "LABEL_COLUMN = \"IsArrDelayed\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## load and configure function" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "yaml_name = os.path.join(CODE_BASE, FUNCTION, 'function.yaml')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**If run the first time, create the function:**" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-28 00:08:03,363 function spec saved to path: /User/repos/functions/fileutils/parq_to_dask/function.yaml\n" + ] + } + ], + "source": [ + "# load function from a local Python file\n", + "parq2dask = mlrun.new_function(\n", + " command=os.path.join(CODE_BASE, FUNCTION, 'function.py'), \n", + " kind=JOB_KIND)\n", + "\n", + "parq2dask.spec.remote = True\n", + "parq2dask.spec.replicas = 4 \n", + "parq2dask.spec.max_replicas = 4\n", + "parq2dask.spec.service_type = 'NodePort'\n", + "parq2dask.spec.image_pull_policy = 'Always'\n", + "parq2dask.build_config(base_image=BASE_IMAGE, commands=[])\n", + "\n", + "parq2dask.export(yaml_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**otherwise load it:**" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "# parq2dask = mlrun.import_function(yaml_name).apply(mlrun.mount_v3io())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## deploy / build" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following triggers a build when run for the first time using specs found in the yaml file above." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'ready'" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "parq2dask.deploy(skip_deployed=True, with_mlrun=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-28 00:08:03,437 starting run parq-to-dask uid=687a2c492be3405abcfa85d5430fd42a -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-28 00:08:04,283 saving function: function, tag: latest\n" + ] + }, + { + "ename": "RunDBError", + "evalue": "POST http://mlrun-api:8080/api/start/function, error: HTTPConnectionPool(host='mlrun-api', port=8080): Read timed out. (read timeout=20)", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mtimeout\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/connectionpool.py\u001b[0m in \u001b[0;36m_make_request\u001b[0;34m(self, conn, method, url, timeout, chunked, **httplib_request_kw)\u001b[0m\n\u001b[1;32m 383\u001b[0m \u001b[0;31m# otherwise it looks like a programming error was the cause.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 384\u001b[0;31m \u001b[0msix\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mraise_from\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 385\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mSocketTimeout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mBaseSSLError\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mSocketError\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/packages/six.py\u001b[0m in \u001b[0;36mraise_from\u001b[0;34m(value, from_value)\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/connectionpool.py\u001b[0m in \u001b[0;36m_make_request\u001b[0;34m(self, conn, method, url, timeout, chunked, **httplib_request_kw)\u001b[0m\n\u001b[1;32m 379\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 380\u001b[0;31m \u001b[0mhttplib_response\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mconn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgetresponse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 381\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/conda/lib/python3.6/http/client.py\u001b[0m in \u001b[0;36mgetresponse\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 1330\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1331\u001b[0;31m \u001b[0mresponse\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbegin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1332\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mConnectionError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/conda/lib/python3.6/http/client.py\u001b[0m in \u001b[0;36mbegin\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 296\u001b[0m \u001b[0;32mwhile\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 297\u001b[0;31m \u001b[0mversion\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstatus\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreason\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_read_status\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 298\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mstatus\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0mCONTINUE\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/conda/lib/python3.6/http/client.py\u001b[0m in \u001b[0;36m_read_status\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 257\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_read_status\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 258\u001b[0;31m \u001b[0mline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreadline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_MAXLINE\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"iso-8859-1\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 259\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mline\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0m_MAXLINE\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/conda/lib/python3.6/socket.py\u001b[0m in \u001b[0;36mreadinto\u001b[0;34m(self, b)\u001b[0m\n\u001b[1;32m 585\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 586\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_sock\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecv_into\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 587\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mtimeout\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mtimeout\u001b[0m: timed out", + "\nDuring handling of the above exception, another exception occurred:\n", + "\u001b[0;31mReadTimeoutError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m/conda/lib/python3.6/site-packages/requests/adapters.py\u001b[0m in \u001b[0;36msend\u001b[0;34m(self, request, stream, timeout, verify, cert, proxies)\u001b[0m\n\u001b[1;32m 448\u001b[0m \u001b[0mretries\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmax_retries\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 449\u001b[0;31m \u001b[0mtimeout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtimeout\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 450\u001b[0m )\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/connectionpool.py\u001b[0m in \u001b[0;36murlopen\u001b[0;34m(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)\u001b[0m\n\u001b[1;32m 637\u001b[0m retries = retries.increment(method, url, error=e, _pool=self,\n\u001b[0;32m--> 638\u001b[0;31m _stacktrace=sys.exc_info()[2])\n\u001b[0m\u001b[1;32m 639\u001b[0m \u001b[0mretries\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msleep\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/util/retry.py\u001b[0m in \u001b[0;36mincrement\u001b[0;34m(self, method, url, response, error, _pool, _stacktrace)\u001b[0m\n\u001b[1;32m 367\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mread\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mFalse\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_method_retryable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 368\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0msix\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreraise\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merror\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_stacktrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 369\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mread\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/packages/six.py\u001b[0m in \u001b[0;36mreraise\u001b[0;34m(tp, value, tb)\u001b[0m\n\u001b[1;32m 685\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwith_traceback\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 686\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 687\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/connectionpool.py\u001b[0m in \u001b[0;36murlopen\u001b[0;34m(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)\u001b[0m\n\u001b[1;32m 599\u001b[0m \u001b[0mbody\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mbody\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mheaders\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mheaders\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 600\u001b[0;31m chunked=chunked)\n\u001b[0m\u001b[1;32m 601\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/connectionpool.py\u001b[0m in \u001b[0;36m_make_request\u001b[0;34m(self, conn, method, url, timeout, chunked, **httplib_request_kw)\u001b[0m\n\u001b[1;32m 385\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mSocketTimeout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mBaseSSLError\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mSocketError\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 386\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_raise_timeout\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0murl\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0murl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtimeout_value\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mread_timeout\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 387\u001b[0m \u001b[0;32mraise\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/connectionpool.py\u001b[0m in \u001b[0;36m_raise_timeout\u001b[0;34m(self, err, url, timeout_value)\u001b[0m\n\u001b[1;32m 305\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mSocketTimeout\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 306\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mReadTimeoutError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0murl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"Read timed out. (read timeout=%s)\"\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mtimeout_value\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 307\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mReadTimeoutError\u001b[0m: HTTPConnectionPool(host='mlrun-api', port=8080): Read timed out. (read timeout=20)", + "\nDuring handling of the above exception, another exception occurred:\n", + "\u001b[0;31mReadTimeout\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/db/httpdb.py\u001b[0m in \u001b[0;36mapi_call\u001b[0;34m(self, method, path, error, params, body, json, timeout)\u001b[0m\n\u001b[1;32m 69\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 70\u001b[0;31m \u001b[0mresp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrequests\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrequest\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0murl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtimeout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtimeout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 71\u001b[0m \u001b[0mresp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mraise_for_status\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/conda/lib/python3.6/site-packages/requests/api.py\u001b[0m in \u001b[0;36mrequest\u001b[0;34m(method, url, **kwargs)\u001b[0m\n\u001b[1;32m 59\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0msessions\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSession\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0msession\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 60\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0msession\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrequest\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mmethod\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0murl\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0murl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 61\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/conda/lib/python3.6/site-packages/requests/sessions.py\u001b[0m in \u001b[0;36mrequest\u001b[0;34m(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)\u001b[0m\n\u001b[1;32m 532\u001b[0m \u001b[0msend_kwargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msettings\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 533\u001b[0;31m \u001b[0mresp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mprep\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0msend_kwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 534\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/conda/lib/python3.6/site-packages/requests/sessions.py\u001b[0m in \u001b[0;36msend\u001b[0;34m(self, request, **kwargs)\u001b[0m\n\u001b[1;32m 667\u001b[0m \u001b[0;31m# Resolve redirects if allowed.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 668\u001b[0;31m \u001b[0mhistory\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mresp\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mresp\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mgen\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mallow_redirects\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 669\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/conda/lib/python3.6/site-packages/requests/sessions.py\u001b[0m in \u001b[0;36m\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 667\u001b[0m \u001b[0;31m# Resolve redirects if allowed.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 668\u001b[0;31m \u001b[0mhistory\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mresp\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mresp\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mgen\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mallow_redirects\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 669\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/conda/lib/python3.6/site-packages/requests/sessions.py\u001b[0m in \u001b[0;36mresolve_redirects\u001b[0;34m(self, resp, req, stream, timeout, verify, cert, proxies, yield_requests, **adapter_kwargs)\u001b[0m\n\u001b[1;32m 246\u001b[0m \u001b[0mallow_redirects\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 247\u001b[0;31m \u001b[0;34m**\u001b[0m\u001b[0madapter_kwargs\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 248\u001b[0m )\n", + "\u001b[0;32m/conda/lib/python3.6/site-packages/requests/sessions.py\u001b[0m in \u001b[0;36msend\u001b[0;34m(self, request, **kwargs)\u001b[0m\n\u001b[1;32m 645\u001b[0m \u001b[0;31m# Send the request\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 646\u001b[0;31m \u001b[0mr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0madapter\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrequest\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 647\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/conda/lib/python3.6/site-packages/requests/adapters.py\u001b[0m in \u001b[0;36msend\u001b[0;34m(self, request, stream, timeout, verify, cert, proxies)\u001b[0m\n\u001b[1;32m 528\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mReadTimeoutError\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 529\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mReadTimeout\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrequest\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrequest\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 530\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mReadTimeout\u001b[0m: HTTPConnectionPool(host='mlrun-api', port=8080): Read timed out. (read timeout=20)", + "\nThe above exception was the direct cause of the following exception:\n", + "\u001b[0;31mRunDBError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 9\u001b[0m 'persist' : True})\n\u001b[1;32m 10\u001b[0m \u001b[0;31m# run\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 11\u001b[0;31m \u001b[0mrun\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mparq2dask\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mparq_to_dask_task\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, runspec, handler, name, project, params, inputs, out_path, workdir, watch, schedule)\u001b[0m\n\u001b[1;32m 292\u001b[0m \u001b[0;31m# single run\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 293\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 294\u001b[0;31m \u001b[0mresp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_run\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mexecution\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 295\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mwatch\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkind\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32min\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m''\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'handler'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'local'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 296\u001b[0m \u001b[0mstate\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlogs\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_db\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/daskjob.py\u001b[0m in \u001b[0;36m_run\u001b[0;34m(self, runobj, execution)\u001b[0m\n\u001b[1;32m 246\u001b[0m \u001b[0mautocommit\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 247\u001b[0m host=socket.gethostname())\n\u001b[0;32m--> 248\u001b[0;31m \u001b[0mclient\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclient\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 249\u001b[0m \u001b[0msetattr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcontext\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'dask_client'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mclient\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 250\u001b[0m \u001b[0msout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mserr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mexec_from_params\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mhandler\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcontext\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/daskjob.py\u001b[0m in \u001b[0;36mclient\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 193\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mspec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mremote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscheduler_address\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 194\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_load_db_status\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 195\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_start\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 196\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 197\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscheduler_address\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/daskjob.py\u001b[0m in \u001b[0;36m_start\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 142\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 143\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msave\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mversioned\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 144\u001b[0;31m \u001b[0mresp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdb\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mremote_start\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_function_uri\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 145\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mresp\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;34m'status'\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mresp\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 146\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mresp\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'status'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/db/httpdb.py\u001b[0m in \u001b[0;36mremote_start\u001b[0;34m(self, func_url)\u001b[0m\n\u001b[1;32m 306\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 307\u001b[0m \u001b[0mreq\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m'functionUrl'\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mfunc_url\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 308\u001b[0;31m \u001b[0mresp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapi_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'POST'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'start/function'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mjson\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mreq\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 309\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mOSError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 310\u001b[0m \u001b[0mlogger\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'error starting function: {}'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/db/httpdb.py\u001b[0m in \u001b[0;36mapi_call\u001b[0;34m(self, method, path, error, params, body, json, timeout)\u001b[0m\n\u001b[1;32m 73\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mrequests\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mRequestException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 74\u001b[0m \u001b[0merror\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0merror\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;34m'{} {}, error: {}'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0murl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 75\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mRunDBError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 76\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 77\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_path_of\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mprefix\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mproject\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mRunDBError\u001b[0m: POST http://mlrun-api:8080/api/start/function, error: HTTPConnectionPool(host='mlrun-api', port=8080): Read timed out. (read timeout=20)" + ] + } + ], + "source": [ + "# create and run the task\n", + "parq_to_dask_task = mlrun.NewTask(\n", + " 'parq-to-dask', \n", + " handler='parquet_to_dask', \n", + " params={\n", + " 'parquet_url': SRC_PATH,\n", + " 'index_cols' : PARTITION_COLS,\n", + " 'shards' : 4,\n", + " 'persist' : True})\n", + "# run\n", + "run = parq2dask.run(parq_to_dask_task)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From c16b08976c2916ebdc18e6aa9ef8aa45bbd738f9 Mon Sep 17 00:00:00 2001 From: yasha Date: Tue, 28 Jan 2020 00:11:24 +0000 Subject: [PATCH 28/32] added fileutils/parquet-to-dask function --- .gitignore | 1 + .../__pycache__/function.cpython-36.pyc | Bin 1251 -> 0 bytes 2 files changed, 1 insertion(+) delete mode 100644 fileutils/parq_to_dask/__pycache__/function.cpython-36.pyc diff --git a/.gitignore b/.gitignore index f09384efb..4dc138fa2 100644 --- a/.gitignore +++ b/.gitignore @@ -2,4 +2,5 @@ models/ .ipynb_checkpoints *.gz *.csv +*.pyc *.swp diff --git a/fileutils/parq_to_dask/__pycache__/function.cpython-36.pyc b/fileutils/parq_to_dask/__pycache__/function.cpython-36.pyc deleted file mode 100644 index 99a9d32655e583a8131645cba76d2abc6c54400a..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 1251 zcmZuxOK;pZ5T>|#t@dHZvE%OAPB%RiP!8ThfTAecqJi5(0M`kS1Z{bUAd9oH))b|X zRBN}JlY8wy=pX1`;7&Os zd?+uU8fAm$vaF@`k4tTA{J1g5<>HNxZ+VtRHY#dq;KJBJldD%Pn5q88dF zrQ~q2rbB&}E3R$SfYQ!gsKB&J7NLt7SB%>^KEkcPKrm#54ZNaiYlZE59m=txrnmAr z{lmLL-uTO4<)~;CsLxi>GQ1+MBi_S8_`iO2T<3gt9oB473YgVu#xYF|W>!j5Bcsn} z#kwTP4A(ob11HRiu6U`(`ZD{aR_7S`iA(z|*SRrDZDW+G=fDMSDd%>mAm`cjd({?R zr?DD2m9c{j2b&Juq;~So3)(LJB5Hz{|9(Dyu0hQeG_{^DT3NWNBZWD%rWAVayl1A) zT#RoO4x3B6anputVT$hb5P^&mFQ#LfP~TH~$fy3viE@%>=_j_QwXoByz|ak3t^65O zxhR2SY`#*hVE( zD-YoU3anu*Re%;Zx7}2jT7h!2XhU<^l=7v*e+yA=DplV>;$VP)^} Date: Wed, 29 Jan 2020 01:51:49 +0000 Subject: [PATCH 29/32] update mlrunapi, parq-to-dask dask job running --- .gitignore | 1 + fileutils/arc_to_parquet/arc_to_parquet.yaml | 18 - .../{arc_to_parquet.py => function.py} | 22 +- fileutils/arc_to_parquet/function.yaml | 16 + .../function.py | 34 +- .../function.yaml | 7 +- tests/arc_to_parquet-airlines.ipynb | 428 +++----- tests/arc_to_parquet.ipynb | 174 +++- tests/describe.py | 45 + tests/describe.yaml | 22 + tests/parq_to_dask.ipynb | 303 ------ tests/parquet_to_dask.ipynb | 952 ++++++++++++++++++ 12 files changed, 1349 insertions(+), 673 deletions(-) delete mode 100644 fileutils/arc_to_parquet/arc_to_parquet.yaml rename fileutils/arc_to_parquet/{arc_to_parquet.py => function.py} (84%) create mode 100644 fileutils/arc_to_parquet/function.yaml rename fileutils/{parq_to_dask => parquet_to_dask}/function.py (65%) rename fileutils/{parq_to_dask => parquet_to_dask}/function.yaml (60%) create mode 100644 tests/describe.py create mode 100644 tests/describe.yaml delete mode 100644 tests/parq_to_dask.ipynb create mode 100644 tests/parquet_to_dask.ipynb diff --git a/.gitignore b/.gitignore index 4dc138fa2..eb4858686 100644 --- a/.gitignore +++ b/.gitignore @@ -4,3 +4,4 @@ models/ *.csv *.pyc *.swp +dask-worker-space diff --git a/fileutils/arc_to_parquet/arc_to_parquet.yaml b/fileutils/arc_to_parquet/arc_to_parquet.yaml deleted file mode 100644 index ec6bb4f65..000000000 --- a/fileutils/arc_to_parquet/arc_to_parquet.yaml +++ /dev/null @@ -1,18 +0,0 @@ -kind: job -metadata: - name: arc-to-parquet - tag: '' - hash: 54c52b0bc70a6b44d8df0126500c6527ee749b02 - project: '' -spec: - command: '' - args: [] - volumes: [] - volume_mounts: [] - env: [] - description: '' - build: - functionSourceCode: IyBDb3B5cmlnaHQgMjAxOCBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgoKaW1wb3J0IHNzbAoKdHJ5OgogICAgX2NyZWF0ZV91bnZlcmlmaWVkX2h0dHBzX2NvbnRleHQgPSBzc2wuX2NyZWF0ZV91bnZlcmlmaWVkX2NvbnRleHQKZXhjZXB0IEF0dHJpYnV0ZUVycm9yOgogICAgIyBMZWdhY3kgUHl0aG9uIHRoYXQgZG9lc24ndCB2ZXJpZnkgSFRUUFMgY2VydGlmaWNhdGVzIGJ5IGRlZmF1bHQKICAgIHBhc3MKZWxzZToKICAgICMgSGFuZGxlIHRhcmdldCBlbnZpcm9ubWVudCB0aGF0IGRvZXNuJ3Qgc3VwcG9ydCBIVFRQUyB2ZXJpZmljYXRpb24KICAgIHNzbC5fY3JlYXRlX2RlZmF1bHRfaHR0cHNfY29udGV4dCA9IF9jcmVhdGVfdW52ZXJpZmllZF9odHRwc19jb250ZXh0CgppbXBvcnQgb3MKaW1wb3J0IGpzb24KZnJvbSBwYXRobGliIGltcG9ydCBQYXRoCmltcG9ydCBudW1weSBhcyBucAppbXBvcnQgcGFuZGFzIGFzIHBkCmltcG9ydCBweWFycm93LnBhcnF1ZXQgYXMgcHEKaW1wb3J0IHB5YXJyb3cgYXMgcGEKZnJvbSBwaWNrbGUgaW1wb3J0IGR1bXAsIGxvYWQKCmZyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIHR5cGluZyBpbXBvcnQgSU8sIEFueVN0ciwgVW5pb24sIExpc3QsIE9wdGlvbmFsCgoKZGVmIGFyY190b19wYXJxdWV0KAogICAgY29udGV4dDogTUxDbGllbnRDdHgsCiAgICBhcmNoaXZlX3VybDogVW5pb25bc3RyLCBQYXRoLCBJT1tBbnlTdHJdXSwKICAgIGhlYWRlcjogT3B0aW9uYWxbTGlzdFtzdHJdXSA9IE5vbmUsCiAgICBpbmNfY29sczogT3B0aW9uYWxbTGlzdFtzdHJdXSA9IE5vbmUsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gIiIsCiAgICBuYW1lOiBzdHIgPSAiIiwKICAgIGNodW5rc2l6ZTogaW50ID0gMTBfMDAwLAogICAgZHR5cGU9Tm9uZSwKICAgIGVuY29kaW5nOiBzdHIgPSAnbGF0aW4tMScsCiAgICBrZXk6IHN0ciA9ICdkYXRhJywKICAgIGRhdGFzZXQ6IHN0ciA9ICdkYXRhc2V0JywKICAgIHBhcnRpdGlvbl9jb2xzID0gW10sCikgLT4gTm9uZToKICAgICIiIk9wZW4gYSBmaWxlL29iamVjdCBhcmNoaXZlIGFuZCBzYXZlIGFzIGEgcGFycXVldCBmaWxlLgogICAgCiAgICBQYXJ0aXRpb25pbmcgcmVxdWlyZXMgcHJlY2lzZSBzcGVjaWZpY2F0aW9uIG9mIGNvbHVtbiB0eXBlcy4KICAgIAogICAgOnBhcmFtIGNvbnRleHQ6ICAgICBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gYXJjaGl2ZV91cmw6IGFueSB2YWxpZCBzdHJpbmcgcGF0aCBjb25zaXN0ZW50IHdpdGggdGhlIHBhdGggdmFyaWFibGUKICAgICAgICAgICAgICAgICAgICAgICAgb2YgcGFuZGFzLnJlYWRfY3N2LCBpbmNsdWRpbmcgc3RyaW5ncyBhcyBmaWxlIHBhdGhzLCBhcyB1cmxzLCAKICAgICAgICAgICAgICAgICAgICAgICAgcGF0aGxpYi5QYXRoIG9iamVjdHMsIGV0Yy4uLgogICAgOnBhcmFtIGhlYWRlcjogICAgICBjb2x1bW4gbmFtZXMKICAgIDpwYXJhbSBpbmNfY29sczogICAgaW5jbHVkZSBvbmx5IHRoZXNlIGNvbHVtbnMKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogZGVzdGluYXRpb24gZm9sZGVyIG9mIHRhYmxlCiAgICA6cGFyYW0gbmFtZTogICAgICAgIG5hbWUgZmlsZSB0byBiZSBzYXZlZCBsb2NhbGx5LCBhbHNvCiAgICA6cGFyYW0gY2h1bmtzaXplOiAgICgwKSByb3cgc2l6ZSByZXRyaWV2ZWQgcGVyIGl0ZXJhdGlvbgogICAgOnBhcmFtIGR0eXBlICAgICAgICBkZXN0aW5hdGlvbiBkYXRhIHR5cGUgb2Ygc3BlY2lmaWVkIGNvbHVtbnMKICAgIDpwYXJhbSBlbmNvZGluZyAgICAgKCdsYXRpbi04JykgZmlsZSBlbmNvZGluZwogICAgOnBhcmFtIGtleTogICAgICAgICBrZXkgaW4gYXJ0aWZhY3Qgc3RvcmUgKHdoZW4gbG9nX2RhdGE9VHJ1ZSkKICAgIDpwYXJhbSBkYXRhc2V0OiAgICAgKE5vbmUpIGlmIG5vdCBOb25lIHRoZW4gJ3RhcmdldF9wYXRoL2RhdGFzZXQnCiAgICAgICAgICAgICAgICAgICAgICAgIGlzIGZvbGRlciBmb3IgcGFydGl0aW9uZWQgZmlsZXMKICAgIDpwYXJhbSBwYXJ0X2NvbHM6ICAgKFtdKSBsaXN0IG9mIHBhcnRpdGlvbmluZyBjb2x1bW5zCiAgICAKICAgICIiIgogICAgaWYgbm90IG5hbWUuZW5kc3dpdGgoIi5wcXQiKToKICAgICAgICBuYW1lICs9ICIucHF0IgogICAgCiAgICBpZiBkYXRhc2V0OgogICAgICAgIG9zLm1ha2VkaXJzKG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgZGF0YXNldCksIGV4aXN0X29rPVRydWUpCiAgICAgICAgZGVzdF9wYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBkYXRhc2V0KQogICAgZWxzZToKICAgICAgICBvcy5tYWtlZGlycyhvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgpLCBleGlzdF9vaz1UcnVlKQogICAgICAgIGRlc3RfcGF0aCA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSkKICAgICAgICAKICAgIGlmIG5vdCBvcy5wYXRoLmlzZmlsZShkZXN0X3BhdGgpOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oImRlc3RpbmF0aW9uIGZpbGUgZG9lcyBub3QgZXhpc3QsIGRvd25sb2FkaW5nIikKICAgICAgICBwcXdyaXRlciA9IE5vbmUKICAgICAgICBmb3IgaSwgZGYgaW4gZW51bWVyYXRlKHBkLnJlYWRfY3N2KGFyY2hpdmVfdXJsLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGNodW5rc2l6ZT1jaHVua3NpemUsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbmFtZXM9aGVhZGVyLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGVuY29kaW5nPWVuY29kaW5nLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHVzZWNvbHM9aW5jX2NvbHMsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZHR5cGU9ZHR5cGUpKToKICAgICAgICAgICAgdGFibGUgPSBwYS5UYWJsZS5mcm9tX3BhbmRhcyhkZikKICAgICAgICAgICAgaWYgaSA9PSAwOgogICAgICAgICAgICAgICAgIyB3cml0ZSB0aGUgaGVhZGVyIHRvIHRhcmdldF9wYXRoLi4uCiAgICAgICAgICAgICAgICBwcXdyaXRlciA9IHBxLlBhcnF1ZXRXcml0ZXIob3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCdoZWFkZXItb25seS5wcXQnKSwgdGFibGUuc2NoZW1hKQogICAgICAgICAgICBpZiBkYXRhc2V0OgogICAgICAgICAgICAgICAgICMgLi4uYW5kIGZpbGVzIHRvIHN1YmZvbGRlciBkYXRhc2V0CiAgICAgICAgICAgICAgICBwcS53cml0ZV90b19kYXRhc2V0KHRhYmxlLCByb290X3BhdGg9ZGVzdF9wYXRoLCBwYXJ0aXRpb25fY29scz1wYXJ0aXRpb25fY29scykKICAgICAgICAgICAgZWxzZToKICAgICAgICAgICAgICAgICMgLi4uYW5kIGZpbGUgdG8gYSBwYXJxdWV0IGZpbGUKICAgICAgICAgICAgICAgIHBxd3JpdGVyLndyaXRlX3RhYmxlKHRhYmxlKQogICAgICAgICAgICAKICAgICAgICBpZiBwcXdyaXRlcjoKICAgICAgICAgICAgcHF3cml0ZXIuY2xvc2UoKQoKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKGYic2F2ZWQgdGFibGUgdG8ge2Rlc3RfcGF0aH0iKQogICAgZWxzZToKICAgICAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCJkZXN0aW5hdGlvbiBmaWxlIGFscmVhZHkgZXhpc3RzIikKICAgIAogICAgIyBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWRlc3RfcGF0aCkK - base_image: yjbds/mlrun-files:latest - commands: [] - code_origin: https://github.com/yjb-ds/functions.git#dba3bc120edc5711a9ee1ceaff9e557ced4d0aa1:/User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py diff --git a/fileutils/arc_to_parquet/arc_to_parquet.py b/fileutils/arc_to_parquet/function.py similarity index 84% rename from fileutils/arc_to_parquet/arc_to_parquet.py rename to fileutils/arc_to_parquet/function.py index 9400616e0..396dc0bd9 100644 --- a/fileutils/arc_to_parquet/arc_to_parquet.py +++ b/fileutils/arc_to_parquet/function.py @@ -47,7 +47,7 @@ def arc_to_parquet( dtype=None, encoding: str = 'latin-1', key: str = 'data', - dataset: str = 'dataset', + dataset: Optional[str] = None, partition_cols = [], ) -> None: """Open a file/object archive and save as a parquet file. @@ -74,7 +74,7 @@ def arc_to_parquet( if not name.endswith(".pqt"): name += ".pqt" - if dataset: + if dataset is not None: os.makedirs(os.path.join(target_path, dataset), exist_ok=True) dest_path = os.path.join(target_path, dataset) else: @@ -86,19 +86,25 @@ def arc_to_parquet( pqwriter = None for i, df in enumerate(pd.read_csv(archive_url, chunksize=chunksize, - names=header, + names=header, encoding=encoding, usecols=inc_cols, dtype=dtype)): table = pa.Table.from_pandas(df) if i == 0: - # write the header to target_path... - pqwriter = pq.ParquetWriter(os.path.join(target_path,'header-only.pqt'), table.schema) + filepath = os.path.join(target_path,'header-only.pqt') + if dataset: + # just write header here + pq.ParquetWriter(filepath, table.schema) + context.log_artifact('header', target_path=filepath) + else: + # start writing file + context.log_artifact('header', target_path=filepath) + pqwriter = pq.ParquetWriter(dest_path, table.schema) + if dataset: - # ...and files to subfolder dataset pq.write_to_dataset(table, root_path=dest_path, partition_cols=partition_cols) else: - # ...and file to a parquet file pqwriter.write_table(table) if pqwriter: @@ -108,4 +114,4 @@ def arc_to_parquet( else: context.logger.info("destination file already exists") - # context.log_artifact(key, target_path=dest_path) + context.log_artifact(key, target_path=dest_path) diff --git a/fileutils/arc_to_parquet/function.yaml b/fileutils/arc_to_parquet/function.yaml new file mode 100644 index 000000000..8b8bacca2 --- /dev/null +++ b/fileutils/arc_to_parquet/function.yaml @@ -0,0 +1,16 @@ +kind: job +metadata: + name: arc_to_parquet + hash: 30906948185d1e5c2b3a20482a1e97dbb1505f8a + project: default +spec: + command: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py + args: [] + image: '' + volumes: [] + volume_mounts: [] + env: [] + description: retrieve archive table and save as partitioned parquet dataset + build: + base_image: yjbds/mlrun_dev-files:latest + commands: [] diff --git a/fileutils/parq_to_dask/function.py b/fileutils/parquet_to_dask/function.py similarity index 65% rename from fileutils/parq_to_dask/function.py rename to fileutils/parquet_to_dask/function.py index edc5caf13..56e5b4f44 100644 --- a/fileutils/parq_to_dask/function.py +++ b/fileutils/parquet_to_dask/function.py @@ -11,18 +11,6 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. - -import ssl - -try: - _create_unverified_https_context = ssl._create_unverified_context -except AttributeError: - # Legacy Python that doesn't verify HTTPS certificates by default - pass -else: - # Handle target environment that doesn't support HTTPS verification - ssl._create_default_https_context = _create_unverified_https_context - import os import json from pathlib import Path @@ -44,19 +32,29 @@ def parquet_to_dask( inc_cols: Optional[List[str]] = None, index_cols: Optional[List[str]] = None, shards: int = 4, - persist: bool = True + threads_per: int = 4, + persist: bool = True, + dask_key: str = 'my_dask_dataframe' ) -> None: - """Load parquet file or dataset into dask cluster - + """Load parquet dataset into dask cluster + If no cluster is found loads a new one and persist the data to it """ # Setup Dask if hasattr(context, 'dask_client'): dask_client = context.dask_client else: - dask_client = Client(LocalCluster(n_workers=shards)) + context.dask_client = Client(LocalCluster(n_workers=shards, + threads_per_worker=threads_per)) + context.logger.info(context.dask_client) + + assert context.dask_client df = dd.read_parquet(parquet_url) - if persist: - df = df.persist() + if persist and context: + df = context.dask_client.persist(df) + context.dask_client.datasets[dask_key] = df + print(df.head()) + # or can use: + # context.dask_client.publish_dataset(my_dataset=df) diff --git a/fileutils/parq_to_dask/function.yaml b/fileutils/parquet_to_dask/function.yaml similarity index 60% rename from fileutils/parq_to_dask/function.yaml rename to fileutils/parquet_to_dask/function.yaml index 5dfbb8336..a54ca4e35 100644 --- a/fileutils/parq_to_dask/function.yaml +++ b/fileutils/parquet_to_dask/function.yaml @@ -1,21 +1,20 @@ kind: dask metadata: name: function - hash: 722d874e8d00ff106a34b876c9700fc4f0bc2994 + hash: 2a336714ac18bcc984baaf3b24a9c05dabed9a62 project: default spec: - command: /User/repos/functions/fileutils/parq_to_dask/function.py + command: /User/repos/functions/fileutils/parquet_to_dask/function.py args: [] image: '' volumes: [] volume_mounts: [] env: [] build: - base_image: yjbds/mlrun-ds:latest + base_image: yjbds/mlrun_dev-dask-ds:latest commands: [] description: '' replicas: 4 - image_pull_policy: Always remote: true service_type: NodePort nthreads: 1 diff --git a/tests/arc_to_parquet-airlines.ipynb b/tests/arc_to_parquet-airlines.ipynb index 58414b1ae..a74f34d17 100644 --- a/tests/arc_to_parquet-airlines.ipynb +++ b/tests/arc_to_parquet-airlines.ipynb @@ -4,14 +4,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# archive to parquet\n", + "# archive to parquet - partitioned data\n", "\n", - "Convert a remote archive or csv file (or local file://), to parquet format" + "Ailines data" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -20,100 +20,64 @@ "mlrun.mlconf.dbpath = 'http://mlrun-api:8080'" ] }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import pandas as pd\n", - "import pyarrow as pa\n", - "import pyarrow.parquet as pq" - ] - }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## parameters\n", - "from **[h20ai](https://github.com/h2oai/h2o-2/wiki/Hacking-Airline-DataSet-with-H2O)**:" + "## parameters" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ + "FUNCTION = 'arc_to_parquet'\n", + "DESCRIPTION = 'retrieve archive table and save as partitioned parquet dataset'\n", + "\n", + "BASE_IMAGE = 'yjbds/mlrun_dev-files:latest'\n", + "JOB_KIND = 'job'\n", + "TASK_NAME = 'user-task-arc-to-part-parq'\n", + "\n", + "CODE_BASE = '/User/repos/functions/fileutils'\n", + "\n", "ARCHIVE_BIG = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears_10.csv\"\n", "ARCHIVE = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears.csv\"\n", - "ARCHIVE_SMALL = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv\"" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": {}, - "outputs": [], - "source": [ + "ARCHIVE_SMALL = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv\"\n", + "\n", "USE_ARCHIVE = ARCHIVE\n", "TARGET_PATH = '/User/mlrun/airlines/dataset'\n", "\n", - "PARTITIONS_DEST = 'partitions'\n", - "PARTITION_COLS = ['Year', 'Month']" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": {}, - "outputs": [], - "source": [ - "os.makedirs(os.path.join(TARGET_PATH, PARTITIONS_DEST), exist_ok=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": {}, - "outputs": [], - "source": [ - "BASE_IMAGE = 'yjbds/mlrun-files:latest'" + "FILE_SHAPE = (123_534_969, 21) # (rows, cols)\n", + "SMALL_FILE_SHAPE = (43_978, 21) # (rows, cols)\n", + "\n", + "FILE_NAME = 'airlines.pqt'\n", + "KEY = 'airlines'" ] }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ - "CODE_BASE = '/User/repos/functions/' # 'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/'\n", - "FUNCTION = 'fileutils/arc_to_parquet'" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**For testing and development use ARCHIVE_SMALL:**" + "PARTITIONS_DEST = 'partitions'\n", + "PARTITION_COLS = ['Year', 'Month']" ] }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ - "FILE_NAME = 'airlines.pqt'\n", - "KEY = 'airlines'\n", - "\n", - "# no need for this as the files contain a header:\n", "HEADER = ['Year','Month','DayofMonth','DayOfWeek','DepTime','CRSDepTime','ArrTime','CRSArrTime',\n", " 'UniqueCarrier','FlightNum','TailNum','ActualElapsedTime','CRSElapsedTime','AirTime',\n", " 'ArrDelay','DepDelay','Origin','Dest','Distance','TaxiIn','TaxiOut','Cancelled',\n", " 'CancellationCode','Diverted','CarrierDelay','WeatherDelay','NASDelay','SecurityDelay',\n", " 'LateAircraftDelay']\n", + "\n", "INC_COLS = ['Year','Month','DayofMonth','DayOfWeek','DepTime','CRSDepTime','ArrTime','CRSArrTime',\n", " 'UniqueCarrier','FlightNum', 'CRSElapsedTime','AirTime',\n", " 'Origin','Dest','Distance', 'TaxiIn', 'TaxiOut','Cancelled',\n", @@ -140,80 +104,32 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "LABEL_COLUMN = \"IsArrDelayed\"" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## load and configure function\n", - "\n", - "**If run the first time, create the function:**" - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-27 23:25:02,696 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" - ] - } - ], - "source": [ - "# load function from a local Python file\n", - "arctoparq = mlrun.code_to_function(\n", - " filename=os.path.join(CODE_BASE, FUNCTION, 'arc_to_parquet.py'), \n", - " kind='job')\n", - "arctoparq.build_config(base_image=BASE_IMAGE, commands=[])\n", - "yaml_name = os.path.join(CODE_BASE, FUNCTION, 'arc_to_parquet.yaml')\n", - "arctoparq.export(yaml_name)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**otherwise load it:**" - ] - }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ - "arctoparq = mlrun.import_function(\n", - " os.path.join(CODE_BASE, FUNCTION, 'arc_to_parquet.yaml')\n", - ").apply(mlrun.mount_v3io())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## deploy / build" + "os.makedirs(os.path.join(TARGET_PATH, PARTITIONS_DEST), exist_ok=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The following triggers a build when run for the first time using specs found in the yaml file above." + "#### load function" ] }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -222,30 +138,38 @@ "'ready'" ] }, - "execution_count": 34, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ + "func_yaml = os.path.join(CODE_BASE, FUNCTION, 'arc_to_parquet.yaml')\n", + "\n", + "arctoparq = mlrun.import_function(func_yaml)\n", + "\n", + "arctoparq.apply(mlrun.mount_v3io())\n", + "\n", "arctoparq.deploy(skip_deployed=True, with_mlrun=False)" ] }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-27 23:25:09,450 starting run arc2parq uid=c8f9525e5258489ea1211312348b21e1 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-27 23:25:09,545 Job is running in the background, pod: arc2parq-lw6ww\n", - "[mlrun] 2020-01-27 23:25:14,326 destination file does not exist, downloading\n", - "[mlrun] 2020-01-27 23:36:53,211 saved table to /User/mlrun/airlines/dataset/partitions\n", + "[mlrun] 2020-01-28 21:58:53,305 starting run arc-to-part-parq-task uid=95e9b64ce8664abaad4cba846b2caa5e -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-28 21:58:53,385 Job is running in the background, pod: arc-to-part-parq-task-kw84v\n", + "[mlrun] 2020-01-28 21:58:57,639 destination file does not exist, downloading\n", + "[mlrun] 2020-01-28 22:02:33,755 log artifact header at /User/mlrun/airlines/dataset/header-only.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-28 22:11:31,709 saved table to /User/mlrun/airlines/dataset/partitions\n", + "[mlrun] 2020-01-28 22:11:31,782 log artifact airlines at /User/mlrun/airlines/dataset/partitions, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-27 23:36:53,223 run executed, status=completed\n", + "[mlrun] 2020-01-28 22:11:31,837 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -418,26 +342,26 @@ " \n", " \n", " \n", - "
...8b21e1
\n", + "
...2caa5e
\n", " 0\n", - " Jan 27 23:25:14\n", + " Jan 28 21:58:57\n", " completed\n", - " arc-to-parquet\n", - "
host=arc2parq-lw6ww
kind=job
owner=admin
\n", + " arc_to_parquet\n", + "
host=arc-to-part-parq-task-kw84v
kind=job
owner=admin
\n", " \n", "
archive_url=https://s3.amazonaws.com/h2o-airlines-unpacked/allyears.csv
dataset=partitions
dtype={'AirTime': 'float32', 'ArrTime': 'float32', 'CRSElapsedTime': 'float32', 'CarrierDelay': 'float32', 'DepTime': 'float32', 'Distance': 'float32', 'LateAircraftDelay': 'float32', 'NASDelay': 'float32', 'SecurityDelay': 'float32', 'TailNum': 'str', 'TaxiIn': 'float32', 'TaxiOut': 'float32', 'WeatherDelay': 'float32'}
encoding=latin-1
inc_cols=['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'CRSElapsedTime', 'AirTime', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut', 'Cancelled', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
key=airlines
name=airlines.pqt
part_cols=['Year', 'Month']
target_path=/User/mlrun/airlines/dataset
\n", " \n", - " \n", + "
header
airlines
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -453,16 +377,16 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run c8f9525e5258489ea1211312348b21e1 , !mlrun logs c8f9525e5258489ea1211312348b21e1 \n", - "[mlrun] 2020-01-27 23:37:01,852 run executed, status=completed\n" + "!mlrun get run 95e9b64ce8664abaad4cba846b2caa5e , !mlrun logs 95e9b64ce8664abaad4cba846b2caa5e \n", + "[mlrun] 2020-01-28 22:11:35,919 run executed, status=completed\n" ] } ], "source": [ "# create and run the task\n", "arc_to_parq_task = mlrun.NewTask(\n", - " 'arc2parq', \n", - " handler='arc_to_parquet', \n", + " TASK_NAME, \n", + " handler=FUNCTION, \n", " params={\n", " 'target_path': TARGET_PATH,\n", " 'name' : FILE_NAME, \n", @@ -500,7 +424,18 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import pandas as pd\n", + "import pyarrow.parquet as pq" + ] + }, + { + "cell_type": "code", + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ @@ -510,7 +445,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -519,7 +454,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 12, "metadata": {}, "outputs": [ { @@ -594,21 +529,21 @@ " \n", " \n", " \n", - " 1992\n", - " 1\n", + " 1990\n", + " 4\n", + " 29\n", " 7\n", - " 2\n", - " 640.0\n", - " 640\n", - " 851.0\n", - " 853\n", - " US\n", - " 53\n", - " 133.0\n", + " 2030.0\n", + " 2030\n", + " 2205.0\n", + " 2145\n", + " PA (1)\n", + " 548\n", + " 75.0\n", " NaN\n", " ...\n", - " IND\n", - " 644.0\n", + " BOS\n", + " 185.0\n", " NaN\n", " NaN\n", " 0\n", @@ -619,20 +554,20 @@ " NaN\n", " \n", " \n", - " 1\n", - " 8\n", - " 3\n", - " 639.0\n", - " 640\n", - " 837.0\n", - " 853\n", - " US\n", - " 53\n", - " 133.0\n", + " 4\n", + " 30\n", + " 1\n", + " 2031.0\n", + " 2030\n", + " 2146.0\n", + " 2145\n", + " PA (1)\n", + " 548\n", + " 75.0\n", " NaN\n", " ...\n", - " IND\n", - " 644.0\n", + " BOS\n", + " 185.0\n", " NaN\n", " NaN\n", " 0\n", @@ -643,20 +578,20 @@ " NaN\n", " \n", " \n", - " 1\n", - " 9\n", - " 4\n", - " 644.0\n", - " 640\n", - " 905.0\n", - " 853\n", - " US\n", - " 53\n", - " 133.0\n", + " 4\n", + " 1\n", + " 7\n", + " 2028.0\n", + " 2030\n", + " 2121.0\n", + " 2135\n", + " PA (1)\n", + " 549\n", + " 65.0\n", " NaN\n", " ...\n", - " IND\n", - " 644.0\n", + " LGA\n", + " 185.0\n", " NaN\n", " NaN\n", " 0\n", @@ -667,20 +602,20 @@ " NaN\n", " \n", " \n", - " 1\n", - " 11\n", - " 6\n", - " 640.0\n", - " 640\n", - " 834.0\n", - " 853\n", - " US\n", - " 53\n", - " 133.0\n", + " 4\n", + " 2\n", + " 1\n", + " 2030.0\n", + " 2030\n", + " 2128.0\n", + " 2135\n", + " PA (1)\n", + " 549\n", + " 65.0\n", " NaN\n", " ...\n", - " IND\n", - " 644.0\n", + " LGA\n", + " 185.0\n", " NaN\n", " NaN\n", " 0\n", @@ -691,20 +626,20 @@ " NaN\n", " \n", " \n", - " 1\n", - " 12\n", - " 7\n", - " 639.0\n", - " 640\n", - " 832.0\n", - " 853\n", - " US\n", - " 53\n", - " 133.0\n", + " 4\n", + " 3\n", + " 2\n", + " 2030.0\n", + " 2030\n", + " 2146.0\n", + " 2135\n", + " PA (1)\n", + " 549\n", + " 65.0\n", " NaN\n", " ...\n", - " IND\n", - " 644.0\n", + " LGA\n", + " 185.0\n", " NaN\n", " NaN\n", " 0\n", @@ -722,40 +657,40 @@ "text/plain": [ " DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime \\\n", "Year Month \n", - "1992 1 7 2 640.0 640 851.0 853 \n", - " 1 8 3 639.0 640 837.0 853 \n", - " 1 9 4 644.0 640 905.0 853 \n", - " 1 11 6 640.0 640 834.0 853 \n", - " 1 12 7 639.0 640 832.0 853 \n", + "1990 4 29 7 2030.0 2030 2205.0 2145 \n", + " 4 30 1 2031.0 2030 2146.0 2145 \n", + " 4 1 7 2028.0 2030 2121.0 2135 \n", + " 4 2 1 2030.0 2030 2128.0 2135 \n", + " 4 3 2 2030.0 2030 2146.0 2135 \n", "\n", " UniqueCarrier FlightNum CRSElapsedTime AirTime ... Dest \\\n", "Year Month ... \n", - "1992 1 US 53 133.0 NaN ... IND \n", - " 1 US 53 133.0 NaN ... IND \n", - " 1 US 53 133.0 NaN ... IND \n", - " 1 US 53 133.0 NaN ... IND \n", - " 1 US 53 133.0 NaN ... IND \n", + "1990 4 PA (1) 548 75.0 NaN ... BOS \n", + " 4 PA (1) 548 75.0 NaN ... BOS \n", + " 4 PA (1) 549 65.0 NaN ... LGA \n", + " 4 PA (1) 549 65.0 NaN ... LGA \n", + " 4 PA (1) 549 65.0 NaN ... LGA \n", "\n", " Distance TaxiIn TaxiOut Cancelled CarrierDelay WeatherDelay \\\n", "Year Month \n", - "1992 1 644.0 NaN NaN 0 NaN NaN \n", - " 1 644.0 NaN NaN 0 NaN NaN \n", - " 1 644.0 NaN NaN 0 NaN NaN \n", - " 1 644.0 NaN NaN 0 NaN NaN \n", - " 1 644.0 NaN NaN 0 NaN NaN \n", + "1990 4 185.0 NaN NaN 0 NaN NaN \n", + " 4 185.0 NaN NaN 0 NaN NaN \n", + " 4 185.0 NaN NaN 0 NaN NaN \n", + " 4 185.0 NaN NaN 0 NaN NaN \n", + " 4 185.0 NaN NaN 0 NaN NaN \n", "\n", " NASDelay SecurityDelay LateAircraftDelay \n", "Year Month \n", - "1992 1 NaN NaN NaN \n", - " 1 NaN NaN NaN \n", - " 1 NaN NaN NaN \n", - " 1 NaN NaN NaN \n", - " 1 NaN NaN NaN \n", + "1990 4 NaN NaN NaN \n", + " 4 NaN NaN NaN \n", + " 4 NaN NaN NaN \n", + " 4 NaN NaN NaN \n", + " 4 NaN NaN NaN \n", "\n", "[5 rows x 21 columns]" ] }, - "execution_count": 23, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } @@ -766,14 +701,14 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "if USE_ARCHIVE == ARCHIVE:\n", - " assert df.shape==(123_534_969, 21)\n", + " assert df.shape==FILE_SHAPE\n", "if USE_ARCHIVE == ARCHIVE_SMALL:\n", - " assert df.shape==(43_978, 21)" + " assert df.shape==SMALL_FILE_SHAPE, f\"{df.shape}\"" ] }, { @@ -789,54 +724,9 @@ "metadata": {}, "outputs": [], "source": [ - "import shutil\n", - "shutil.rmtree(TARGET_PATH)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### single parquet file\n", - "\n", - "run this only when `dataset=False`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "assert KEY in run.outputs.keys(), f\"mlrun.functions: key {KEY} not found in outputs\"\n", - "assert os.path.isfile(TARGET_PATH+'/'+ FILE_NAME), f\"mlrun.functions: artifact source not found at {TARGET_PATH+'/'+ FILE_NAME}\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "copied = pd.read_parquet(TARGET_PATH+'/'+ FILE_NAME, engine=\"pyarrow\")\n", - "copied.set_index(PARTITION_COLS, inplace=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "copied.head()" + "# import shutil\n", + "# shutil.rmtree(TARGET_PATH)" ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { diff --git a/tests/arc_to_parquet.ipynb b/tests/arc_to_parquet.ipynb index 5a47ded22..c64beef5a 100644 --- a/tests/arc_to_parquet.ipynb +++ b/tests/arc_to_parquet.ipynb @@ -4,7 +4,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# archive to parquet" + "# archive to parquet\n", + "\n", + "HIGGS" ] }, { @@ -31,17 +33,28 @@ "metadata": {}, "outputs": [], "source": [ - "CODE_BASE = '/User/repos/functions/'\n", + "FUNCTION = 'arc_to_parquet'\n", + "DESCRIPTION = 'retrieve archive table and save as parquet file'\n", + "\n", + "BASE_IMAGE = 'yjbds/mlrun_dev-files:latest'\n", + "JOB_KIND = 'job'\n", + "TASK_NAME = 'user-task-arc-to-parq'\n", + "\n", + "CODE_BASE = '/User/repos/functions/fileutils'\n", + "\n", "TARGET_PATH = '/User/mlrun/models'\n", - "# ARCHIVE = \"https://fpsignals-public.s3.amazonaws.com/higgs-small.tar.gz\"\n", + "\n", + "ARCHIVE_SAMPLE = \"https://fpsignals-public.s3.amazonaws.com/higgs-small.tar.gz\"\n", "ARCHIVE = \"https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz\"\n", + "\n", "FILE_NAME = 'higgs.pqt'\n", "KEY = 'higgs'\n", "\n", - "HEADER = ['labels', 'lepton_pT', 'lepton_eta', 'lepton_phi', 'missing_energy_magnitude', 'missing_energy_phi',\n", - " 'jet_1_pt', 'jet_1_eta', 'jet_1_phi', 'jet_1_b-tag', 'jet_2_pt', 'jet_2_eta', 'jet_2_phi', 'jet_2_b-tag', 'jet_3_pt',\n", - " 'jet_3_eta', 'jet_3_phi', 'jet_3_b-tag', 'jet_4_pt', 'jet_4_eta', 'jet_4_phi', 'jet_4_b-tag', 'm_jj', 'm_jjj', 'm_lv',\n", - " 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']" + "HEADER = ['labels', 'lepton_pT', 'lepton_eta', 'lepton_phi', 'missing_energy_magnitude', \n", + " 'missing_energy_phi', 'jet_1_pt', 'jet_1_eta', 'jet_1_phi', 'jet_1_b-tag', \n", + " 'jet_2_pt', 'jet_2_eta', 'jet_2_phi', 'jet_2_b-tag', 'jet_3_pt', 'jet_3_eta',\n", + " 'jet_3_phi', 'jet_3_b-tag', 'jet_4_pt', 'jet_4_eta', 'jet_4_phi', 'jet_4_b-tag',\n", + " 'm_jj', 'm_jjj', 'm_lv', 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']" ] }, { @@ -64,34 +77,68 @@ "cell_type": "code", "execution_count": 4, "metadata": {}, + "outputs": [], + "source": [ + "func_py = os.path.join(CODE_BASE, FUNCTION, 'function.py')\n", + "func_yaml = os.path.join(CODE_BASE, FUNCTION, 'function.yaml')\n", + "\n", + "arctoparq = mlrun.new_function(command=func_py, kind=JOB_KIND)\n", + "\n", + "arctoparq.spec.description = DESCRIPTION\n", + "arctoparq.spec.build.base_image = BASE_IMAGE" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-27 08:12:26,749 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" + "[mlrun] 2020-01-28 21:37:17,278 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" ] } ], "source": [ - "# load function from a local Python file\n", - "arctoparq = mlrun.code_to_function(\n", - " filename=os.path.join(CODE_BASE, 'fileutils/arc_to_parquet', 'arc_to_parquet.py'), \n", - " kind='job')\n", - "arctoparq.build_config(base_image='yjbds/mlrun-files:latest', commands=[])\n", - "yaml_name = os.path.join(CODE_BASE, 'fileutils/arc_to_parquet', 'arc_to_parquet.yaml')\n", - "arctoparq.export(yaml_name)" + "arctoparq.export(func_yaml)" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "arctoparq.apply(mlrun.mount_v3io())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### ...or load from yaml" + ] + }, + { + "cell_type": "code", + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ - "arctoparq = mlrun.import_function(\n", - " os.path.join(CODE_BASE, 'fileutils/arc_to_parquet', 'arc_to_parquet.yaml')\n", - ").apply(mlrun.mount_v3io())" + "# arctoparq = mlrun.import_function(func_yaml).apply(mlrun.mount_v3io())" ] }, { @@ -110,7 +157,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 8, "metadata": {}, "outputs": [ { @@ -119,7 +166,7 @@ "'ready'" ] }, - "execution_count": 14, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } @@ -130,21 +177,21 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-27 08:13:34,366 starting run arc2parq uid=ca75db580ec146038a8a932e85b64ac1 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-27 08:13:34,456 Job is running in the background, pod: arc2parq-2rtrg\n", - "[mlrun] 2020-01-27 08:13:42,564 destination file does not exist, downloading\n", - "[mlrun] 2020-01-27 08:18:45,530 saved table to /User/mlrun/models/higgs.pqt\n", - "[mlrun] 2020-01-27 08:18:45,545 log artifact higgs at /User/mlrun/models/higgs.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-27 08:18:45,558 log artifact header at /User/mlrun/models/header.pkl, size: None, db: Y\n", + "[mlrun] 2020-01-28 21:42:59,772 starting run arc-to-parq-task uid=66dd63566a8147a0b7186742f4402af0 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-28 21:42:59,930 Job is running in the background, pod: arc-to-parq-task-bzszg\n", + "[mlrun] 2020-01-28 21:43:04,226 destination file does not exist, downloading\n", + "[mlrun] 2020-01-28 21:44:56,004 log artifact header at /User/mlrun/models/header-only.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-28 21:48:03,294 saved table to /User/mlrun/models/higgs.pqt\n", + "[mlrun] 2020-01-28 21:48:03,316 log artifact higgs at /User/mlrun/models/higgs.pqt, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-27 08:18:45,581 run executed, status=completed\n", + "[mlrun] 2020-01-28 21:48:03,345 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -317,26 +364,26 @@ " \n", " \n", " \n", - "
...b64ac1
\n", + "
...402af0
\n", " 0\n", - " Jan 27 08:13:42\n", + " Jan 28 21:43:04\n", " completed\n", - " arc-to-parquet\n", - "
host=arc2parq-2rtrg
kind=job
owner=admin
\n", + " arc_to_parquet\n", + "
host=arc-to-parq-task-bzszg
kind=job
owner=admin
\n", " \n", "
archive_url=https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
header=['labels', 'lepton_pT', 'lepton_eta', 'lepton_phi', 'missing_energy_magnitude', 'missing_energy_phi', 'jet_1_pt', 'jet_1_eta', 'jet_1_phi', 'jet_1_b-tag', 'jet_2_pt', 'jet_2_eta', 'jet_2_phi', 'jet_2_b-tag', 'jet_3_pt', 'jet_3_eta', 'jet_3_phi', 'jet_3_b-tag', 'jet_4_pt', 'jet_4_eta', 'jet_4_phi', 'jet_4_b-tag', 'm_jj', 'm_jjj', 'm_lv', 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']
key=higgs
name=higgs.pqt
target_path=/User/mlrun/models
\n", " \n", - "
higgs
header
\n", + "
header
higgs
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -352,16 +399,16 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run ca75db580ec146038a8a932e85b64ac1 , !mlrun logs ca75db580ec146038a8a932e85b64ac1 \n", - "[mlrun] 2020-01-27 08:18:54,929 run executed, status=completed\n" + "!mlrun get run 66dd63566a8147a0b7186742f4402af0 , !mlrun logs 66dd63566a8147a0b7186742f4402af0 \n", + "[mlrun] 2020-01-28 21:48:10,286 run executed, status=completed\n" ] } ], "source": [ "# create and run the task\n", "arc_to_parq_task = mlrun.NewTask(\n", - " 'arc2parq', \n", - " handler='arc_to_parquet', \n", + " TASK_NAME,\n", + " handler=FUNCTION, \n", " params={\n", " 'target_path': TARGET_PATH,\n", " 'name' : FILE_NAME, \n", @@ -371,7 +418,28 @@ " outputs=[KEY])\n", "\n", "# run\n", - "run = arctoparq.run(arc_to_parq_task)" + "rn = arctoparq.run(arc_to_parq_task)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'header': '/User/mlrun/models/header-only.pqt',\n", + " 'higgs': '/User/mlrun/models/higgs.pqt'}" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "rn.outputs" ] }, { @@ -390,7 +458,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 22, "metadata": {}, "outputs": [], "source": [ @@ -401,7 +469,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 23, "metadata": {}, "outputs": [], "source": [ @@ -411,17 +479,17 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 24, "metadata": {}, "outputs": [], "source": [ - "assert KEY in run.outputs.keys(), f\"mlrun.functions: key {KEY} not found in outputs\"\n", + "assert KEY in rn.outputs.keys(), f\"mlrun.functions: key {KEY} not found in outputs\"\n", "assert os.path.isfile(TARGET_PATH+'/'+ FILE_NAME), f\"mlrun.functions: artifact source not found at {TARGET_PATH+'/'+ FILE_NAME}\"" ] }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 25, "metadata": {}, "outputs": [], "source": [ @@ -430,7 +498,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 26, "metadata": {}, "outputs": [ { @@ -635,7 +703,7 @@ "[5 rows x 29 columns]" ] }, - "execution_count": 12, + "execution_count": 26, "metadata": {}, "output_type": "execute_result" } @@ -646,7 +714,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 27, "metadata": {}, "outputs": [ { @@ -655,7 +723,7 @@ "(11000000, 29)" ] }, - "execution_count": 13, + "execution_count": 27, "metadata": {}, "output_type": "execute_result" } @@ -673,7 +741,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 17, "metadata": {}, "outputs": [], "source": [ diff --git a/tests/describe.py b/tests/describe.py new file mode 100644 index 000000000..75e52221b --- /dev/null +++ b/tests/describe.py @@ -0,0 +1,45 @@ +# Copyright 2018 Iguazio +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import json +from pathlib import Path +import numpy as np +import pandas as pd + +import dask +import dask.dataframe as dd +from dask.distributed import Client, LocalCluster + +from mlrun.execution import MLClientCtx +from mlrun.datastore import DataItem + +from typing import IO, AnyStr, Union, List, Optional + +def table_summary( + context: MLClientCtx, + dask_key: str = 'my_dask_dataframe', + target_path: str = '', + name: str = 'table_summary.csv', + key: str = 'table_summary' +) -> None: + """Summarize a table + """ + if hasattr(context, 'dask_client'): + dscr = context.dask_client.datasets[dask_key].describe() + filepath = os.path.join(target_path, name) + dscr.to_csv(filepath) + context.log_artifact(key, target_path=filepath) + else: + context.logger.info('no dask_client found') + \ No newline at end of file diff --git a/tests/describe.yaml b/tests/describe.yaml new file mode 100644 index 000000000..93779b6d2 --- /dev/null +++ b/tests/describe.yaml @@ -0,0 +1,22 @@ +kind: dask +metadata: + name: describe + hash: 931bd0bbc1dde34284e6c604d0b08d5e9666b7a7 + project: default +spec: + command: /User/repos/functions/tests/describe.py + args: [] + image: '' + volumes: [] + volume_mounts: [] + env: [] + build: + base_image: yjbds/mlrun_dev-dask-ds:latest + commands: [] + description: '' + replicas: 4 + remote: true + service_type: NodePort + nthreads: 1 + min_replicas: 0 + max_replicas: 4 diff --git a/tests/parq_to_dask.ipynb b/tests/parq_to_dask.ipynb deleted file mode 100644 index ad69b6e2b..000000000 --- a/tests/parq_to_dask.ipynb +++ /dev/null @@ -1,303 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# archive to parquet\n", - "\n", - "Convert a remote archive or csv file (or local file://), to parquet format" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "import mlrun\n", - "import os\n", - "mlrun.mlconf.dbpath = 'http://mlrun-api:8080'" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import pandas as pd\n", - "import pyarrow as pa\n", - "import pyarrow.parquet as pq" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## parameters\n", - "from **[h20ai](https://github.com/h2oai/h2o-2/wiki/Hacking-Airline-DataSet-with-H2O)**:" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "ARCHIVE_BIG = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears_10.csv\"\n", - "ARCHIVE = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears.csv\"\n", - "ARCHIVE_SMALL = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv\"" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "USE_ARCHIVE = ARCHIVE_SMALL\n", - "SRC_PATH = '/User/mlrun/airlines/dataset-small/partitions/*.parquet'\n", - "\n", - "PARTITIONS_DEST = 'partitions'\n", - "PARTITION_COLS = ['Year', 'Month']" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "BASE_IMAGE = 'yjbds/mlrun-ds:latest'" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "CODE_BASE = '/User/repos/functions/' # 'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/'\n", - "FUNCTION = 'fileutils/parq_to_dask'\n", - "JOB_KIND = 'dask'" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ - "LABEL_COLUMN = \"IsArrDelayed\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## load and configure function" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ - "yaml_name = os.path.join(CODE_BASE, FUNCTION, 'function.yaml')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**If run the first time, create the function:**" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-28 00:08:03,363 function spec saved to path: /User/repos/functions/fileutils/parq_to_dask/function.yaml\n" - ] - } - ], - "source": [ - "# load function from a local Python file\n", - "parq2dask = mlrun.new_function(\n", - " command=os.path.join(CODE_BASE, FUNCTION, 'function.py'), \n", - " kind=JOB_KIND)\n", - "\n", - "parq2dask.spec.remote = True\n", - "parq2dask.spec.replicas = 4 \n", - "parq2dask.spec.max_replicas = 4\n", - "parq2dask.spec.service_type = 'NodePort'\n", - "parq2dask.spec.image_pull_policy = 'Always'\n", - "parq2dask.build_config(base_image=BASE_IMAGE, commands=[])\n", - "\n", - "parq2dask.export(yaml_name)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**otherwise load it:**" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "# parq2dask = mlrun.import_function(yaml_name).apply(mlrun.mount_v3io())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## deploy / build" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The following triggers a build when run for the first time using specs found in the yaml file above." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'ready'" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "parq2dask.deploy(skip_deployed=True, with_mlrun=False)" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-28 00:08:03,437 starting run parq-to-dask uid=687a2c492be3405abcfa85d5430fd42a -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-28 00:08:04,283 saving function: function, tag: latest\n" - ] - }, - { - "ename": "RunDBError", - "evalue": "POST http://mlrun-api:8080/api/start/function, error: HTTPConnectionPool(host='mlrun-api', port=8080): Read timed out. (read timeout=20)", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mtimeout\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/connectionpool.py\u001b[0m in \u001b[0;36m_make_request\u001b[0;34m(self, conn, method, url, timeout, chunked, **httplib_request_kw)\u001b[0m\n\u001b[1;32m 383\u001b[0m \u001b[0;31m# otherwise it looks like a programming error was the cause.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 384\u001b[0;31m \u001b[0msix\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mraise_from\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 385\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mSocketTimeout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mBaseSSLError\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mSocketError\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/packages/six.py\u001b[0m in \u001b[0;36mraise_from\u001b[0;34m(value, from_value)\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/connectionpool.py\u001b[0m in \u001b[0;36m_make_request\u001b[0;34m(self, conn, method, url, timeout, chunked, **httplib_request_kw)\u001b[0m\n\u001b[1;32m 379\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 380\u001b[0;31m \u001b[0mhttplib_response\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mconn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgetresponse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 381\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/conda/lib/python3.6/http/client.py\u001b[0m in \u001b[0;36mgetresponse\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 1330\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1331\u001b[0;31m \u001b[0mresponse\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbegin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1332\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mConnectionError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/conda/lib/python3.6/http/client.py\u001b[0m in \u001b[0;36mbegin\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 296\u001b[0m \u001b[0;32mwhile\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 297\u001b[0;31m \u001b[0mversion\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstatus\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreason\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_read_status\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 298\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mstatus\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0mCONTINUE\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/conda/lib/python3.6/http/client.py\u001b[0m in \u001b[0;36m_read_status\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 257\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_read_status\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 258\u001b[0;31m \u001b[0mline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreadline\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_MAXLINE\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"iso-8859-1\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 259\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mline\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0m_MAXLINE\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/conda/lib/python3.6/socket.py\u001b[0m in \u001b[0;36mreadinto\u001b[0;34m(self, b)\u001b[0m\n\u001b[1;32m 585\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 586\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_sock\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecv_into\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 587\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mtimeout\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mtimeout\u001b[0m: timed out", - "\nDuring handling of the above exception, another exception occurred:\n", - "\u001b[0;31mReadTimeoutError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m/conda/lib/python3.6/site-packages/requests/adapters.py\u001b[0m in \u001b[0;36msend\u001b[0;34m(self, request, stream, timeout, verify, cert, proxies)\u001b[0m\n\u001b[1;32m 448\u001b[0m \u001b[0mretries\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmax_retries\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 449\u001b[0;31m \u001b[0mtimeout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtimeout\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 450\u001b[0m )\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/connectionpool.py\u001b[0m in \u001b[0;36murlopen\u001b[0;34m(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)\u001b[0m\n\u001b[1;32m 637\u001b[0m retries = retries.increment(method, url, error=e, _pool=self,\n\u001b[0;32m--> 638\u001b[0;31m _stacktrace=sys.exc_info()[2])\n\u001b[0m\u001b[1;32m 639\u001b[0m \u001b[0mretries\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msleep\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/util/retry.py\u001b[0m in \u001b[0;36mincrement\u001b[0;34m(self, method, url, response, error, _pool, _stacktrace)\u001b[0m\n\u001b[1;32m 367\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mread\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mFalse\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_method_retryable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 368\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0msix\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreraise\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merror\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_stacktrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 369\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mread\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/packages/six.py\u001b[0m in \u001b[0;36mreraise\u001b[0;34m(tp, value, tb)\u001b[0m\n\u001b[1;32m 685\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwith_traceback\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 686\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 687\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/connectionpool.py\u001b[0m in \u001b[0;36murlopen\u001b[0;34m(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)\u001b[0m\n\u001b[1;32m 599\u001b[0m \u001b[0mbody\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mbody\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mheaders\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mheaders\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 600\u001b[0;31m chunked=chunked)\n\u001b[0m\u001b[1;32m 601\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/connectionpool.py\u001b[0m in \u001b[0;36m_make_request\u001b[0;34m(self, conn, method, url, timeout, chunked, **httplib_request_kw)\u001b[0m\n\u001b[1;32m 385\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mSocketTimeout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mBaseSSLError\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mSocketError\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 386\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_raise_timeout\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0murl\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0murl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtimeout_value\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mread_timeout\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 387\u001b[0m \u001b[0;32mraise\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/urllib3/connectionpool.py\u001b[0m in \u001b[0;36m_raise_timeout\u001b[0;34m(self, err, url, timeout_value)\u001b[0m\n\u001b[1;32m 305\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mSocketTimeout\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 306\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mReadTimeoutError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0murl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"Read timed out. (read timeout=%s)\"\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mtimeout_value\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 307\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mReadTimeoutError\u001b[0m: HTTPConnectionPool(host='mlrun-api', port=8080): Read timed out. (read timeout=20)", - "\nDuring handling of the above exception, another exception occurred:\n", - "\u001b[0;31mReadTimeout\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/db/httpdb.py\u001b[0m in \u001b[0;36mapi_call\u001b[0;34m(self, method, path, error, params, body, json, timeout)\u001b[0m\n\u001b[1;32m 69\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 70\u001b[0;31m \u001b[0mresp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrequests\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrequest\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0murl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtimeout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtimeout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 71\u001b[0m \u001b[0mresp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mraise_for_status\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/conda/lib/python3.6/site-packages/requests/api.py\u001b[0m in \u001b[0;36mrequest\u001b[0;34m(method, url, **kwargs)\u001b[0m\n\u001b[1;32m 59\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0msessions\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSession\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0msession\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 60\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0msession\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrequest\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mmethod\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0murl\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0murl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 61\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/conda/lib/python3.6/site-packages/requests/sessions.py\u001b[0m in \u001b[0;36mrequest\u001b[0;34m(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)\u001b[0m\n\u001b[1;32m 532\u001b[0m \u001b[0msend_kwargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msettings\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 533\u001b[0;31m \u001b[0mresp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mprep\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0msend_kwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 534\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/conda/lib/python3.6/site-packages/requests/sessions.py\u001b[0m in \u001b[0;36msend\u001b[0;34m(self, request, **kwargs)\u001b[0m\n\u001b[1;32m 667\u001b[0m \u001b[0;31m# Resolve redirects if allowed.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 668\u001b[0;31m \u001b[0mhistory\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mresp\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mresp\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mgen\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mallow_redirects\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 669\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/conda/lib/python3.6/site-packages/requests/sessions.py\u001b[0m in \u001b[0;36m\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 667\u001b[0m \u001b[0;31m# Resolve redirects if allowed.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 668\u001b[0;31m \u001b[0mhistory\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mresp\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mresp\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mgen\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mallow_redirects\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 669\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/conda/lib/python3.6/site-packages/requests/sessions.py\u001b[0m in \u001b[0;36mresolve_redirects\u001b[0;34m(self, resp, req, stream, timeout, verify, cert, proxies, yield_requests, **adapter_kwargs)\u001b[0m\n\u001b[1;32m 246\u001b[0m \u001b[0mallow_redirects\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 247\u001b[0;31m \u001b[0;34m**\u001b[0m\u001b[0madapter_kwargs\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 248\u001b[0m )\n", - "\u001b[0;32m/conda/lib/python3.6/site-packages/requests/sessions.py\u001b[0m in \u001b[0;36msend\u001b[0;34m(self, request, **kwargs)\u001b[0m\n\u001b[1;32m 645\u001b[0m \u001b[0;31m# Send the request\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 646\u001b[0;31m \u001b[0mr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0madapter\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrequest\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 647\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/conda/lib/python3.6/site-packages/requests/adapters.py\u001b[0m in \u001b[0;36msend\u001b[0;34m(self, request, stream, timeout, verify, cert, proxies)\u001b[0m\n\u001b[1;32m 528\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mReadTimeoutError\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 529\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mReadTimeout\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrequest\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrequest\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 530\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mReadTimeout\u001b[0m: HTTPConnectionPool(host='mlrun-api', port=8080): Read timed out. (read timeout=20)", - "\nThe above exception was the direct cause of the following exception:\n", - "\u001b[0;31mRunDBError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 9\u001b[0m 'persist' : True})\n\u001b[1;32m 10\u001b[0m \u001b[0;31m# run\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 11\u001b[0;31m \u001b[0mrun\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mparq2dask\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mparq_to_dask_task\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, runspec, handler, name, project, params, inputs, out_path, workdir, watch, schedule)\u001b[0m\n\u001b[1;32m 292\u001b[0m \u001b[0;31m# single run\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 293\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 294\u001b[0;31m \u001b[0mresp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_run\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mexecution\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 295\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mwatch\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkind\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32min\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m''\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'handler'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'local'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 296\u001b[0m \u001b[0mstate\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlogs\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_db\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/daskjob.py\u001b[0m in \u001b[0;36m_run\u001b[0;34m(self, runobj, execution)\u001b[0m\n\u001b[1;32m 246\u001b[0m \u001b[0mautocommit\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 247\u001b[0m host=socket.gethostname())\n\u001b[0;32m--> 248\u001b[0;31m \u001b[0mclient\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclient\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 249\u001b[0m \u001b[0msetattr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcontext\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'dask_client'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mclient\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 250\u001b[0m \u001b[0msout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mserr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mexec_from_params\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mhandler\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcontext\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/daskjob.py\u001b[0m in \u001b[0;36mclient\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 193\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mspec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mremote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscheduler_address\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 194\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_load_db_status\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 195\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_start\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 196\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 197\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscheduler_address\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/daskjob.py\u001b[0m in \u001b[0;36m_start\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 142\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 143\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msave\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mversioned\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 144\u001b[0;31m \u001b[0mresp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdb\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mremote_start\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_function_uri\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 145\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mresp\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;34m'status'\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mresp\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 146\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mresp\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'status'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/db/httpdb.py\u001b[0m in \u001b[0;36mremote_start\u001b[0;34m(self, func_url)\u001b[0m\n\u001b[1;32m 306\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 307\u001b[0m \u001b[0mreq\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m'functionUrl'\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mfunc_url\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 308\u001b[0;31m \u001b[0mresp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapi_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'POST'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'start/function'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mjson\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mreq\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 309\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mOSError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 310\u001b[0m \u001b[0mlogger\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'error starting function: {}'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/db/httpdb.py\u001b[0m in \u001b[0;36mapi_call\u001b[0;34m(self, method, path, error, params, body, json, timeout)\u001b[0m\n\u001b[1;32m 73\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mrequests\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mRequestException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 74\u001b[0m \u001b[0merror\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0merror\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;34m'{} {}, error: {}'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0murl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 75\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mRunDBError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 76\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 77\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_path_of\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mprefix\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mproject\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mRunDBError\u001b[0m: POST http://mlrun-api:8080/api/start/function, error: HTTPConnectionPool(host='mlrun-api', port=8080): Read timed out. (read timeout=20)" - ] - } - ], - "source": [ - "# create and run the task\n", - "parq_to_dask_task = mlrun.NewTask(\n", - " 'parq-to-dask', \n", - " handler='parquet_to_dask', \n", - " params={\n", - " 'parquet_url': SRC_PATH,\n", - " 'index_cols' : PARTITION_COLS,\n", - " 'shards' : 4,\n", - " 'persist' : True})\n", - "# run\n", - "run = parq2dask.run(parq_to_dask_task)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.8" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/tests/parquet_to_dask.ipynb b/tests/parquet_to_dask.ipynb new file mode 100644 index 000000000..02b2f2011 --- /dev/null +++ b/tests/parquet_to_dask.ipynb @@ -0,0 +1,952 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# parquet to dask\n", + "load a parquet dataset into a dask cluster" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import mlrun\n", + "import os\n", + "mlrun.mlconf.dbpath = 'http://mlrun-api:8080'\n", + "mlrun.mlconf.remote_host = '3.133.8.252' " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## parameters\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "FUNCTION = 'parquet_to_dask'\n", + "DESCRIPTION = 'load parquet dataset into a dask cluster'\n", + "\n", + "BASE_IMAGE = 'yjbds/mlrun_dev-dask-ds:latest'\n", + "JOB_KIND = 'dask'\n", + "TASK_NAME = 'user-task-parq-to-dask'\n", + "\n", + "CODE_BASE = '/User/repos/functions/fileutils'\n", + "\n", + "SRC_PATH = '/User/mlrun/airlines/dataset/partitions'\n", + "\n", + "PARTITION_COLS = ['Year', 'Month']\n", + "\n", + "DASK_SHARDS = 4\n", + "DASK_THREADS_PER = 4" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## load and configure function" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "func_py = os.path.join(CODE_BASE, FUNCTION, 'function.py')\n", + "func_yaml = os.path.join(CODE_BASE, FUNCTION, 'function.yaml')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**If run the first time, create the function:**" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# load function from a local Python file\n", + "parq2dask = mlrun.new_function(command=func_py, kind=JOB_KIND)\n", + "\n", + "parq2dask.spec.remote = True\n", + "parq2dask.spec.replicas = 4 \n", + "parq2dask.spec.max_replicas = 4\n", + "parq2dask.spec.service_type = 'NodePort'\n", + "parq2dask.spec.build.base_image = BASE_IMAGE" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-29 01:44:34,435 function spec saved to path: /User/repos/functions/fileutils/parquet_to_dask/function.yaml\n" + ] + } + ], + "source": [ + "parq2dask.export(func_yaml)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**otherwise load it:**" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-29 01:44:47,617 starting remote build, image: .mlrun/func-default-function-latest\n" + ] + }, + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "parq2dask = mlrun.import_function(func_yaml)\n", + "\n", + "parq2dask.apply(mlrun.mount_v3io())\n", + "\n", + "parq2dask.deploy(skip_deployed=True, with_mlrun=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-29 01:44:50,780 starting run user-task-parq-to-dask uid=7f501215960146a89447aa29cb21c2a1 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-29 01:44:51,667 saving function: function, tag: latest\n", + "[mlrun] 2020-01-29 01:44:57,827 using remote dask scheduler (mlrun-function-e71085fe-8) at: 3.133.8.252:30113\n", + "[mlrun] 2020-01-29 01:44:57,828 remote dashboard (node) port: 3.133.8.252:31064\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/conda/lib/python3.6/site-packages/distributed/client.py:1074: VersionMismatchWarning: Mismatched versions found\n", + "\n", + "blosc\n", + "+-----------+---------+\n", + "| | version |\n", + "+-----------+---------+\n", + "| client | None |\n", + "| scheduler | 1.7.0 |\n", + "+-----------+---------+\n", + "\n", + "lz4\n", + "+-----------+---------+\n", + "| | version |\n", + "+-----------+---------+\n", + "| client | None |\n", + "| scheduler | 2.2.1 |\n", + "+-----------+---------+\n", + "\n", + "msgpack\n", + "+-----------+---------+\n", + "| | version |\n", + "+-----------+---------+\n", + "| client | 0.6.2 |\n", + "| scheduler | 0.6.1 |\n", + "+-----------+---------+\n", + "\n", + "tornado\n", + "+-----------+---------+\n", + "| | version |\n", + "+-----------+---------+\n", + "| client | 5.1.1 |\n", + "| scheduler | 6.0.3 |\n", + "+-----------+---------+\n", + " warnings.warn(version_module.VersionMismatchWarning(msg[0][\"warning\"]))\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + " Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime \\\n", + "0 1996 12 10 2 932.0 935 1112.0 \n", + "1 1996 12 11 3 945.0 935 1145.0 \n", + "2 1996 12 7 6 730.0 730 940.0 \n", + "3 1996 12 1 7 2357.0 2005 212.0 \n", + "4 1996 12 2 1 2006.0 2005 2206.0 \n", + "\n", + " CRSArrTime UniqueCarrier FlightNum ... Dest Distance TaxiIn TaxiOut \\\n", + "0 1140 CO 661 ... CAE 602.0 4.0 10.0 \n", + "1 1140 CO 661 ... CAE 602.0 5.0 18.0 \n", + "2 937 CO 678 ... CAE 602.0 5.0 20.0 \n", + "3 2210 CO 695 ... CAE 602.0 4.0 25.0 \n", + "4 2210 CO 695 ... CAE 602.0 3.0 26.0 \n", + "\n", + " Cancelled CarrierDelay WeatherDelay NASDelay SecurityDelay \\\n", + "0 0 NaN NaN NaN NaN \n", + "1 0 NaN NaN NaN NaN \n", + "2 0 NaN NaN NaN NaN \n", + "3 0 NaN NaN NaN NaN \n", + "4 0 NaN NaN NaN NaN \n", + "\n", + " LateAircraftDelay \n", + "0 NaN \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", + "\n", + "[5 rows x 23 columns]\n", + "\n", + "[mlrun] 2020-01-29 01:47:10,961 run ended with state \n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...21c2a1
0Jan 29 01:44:51completeduser-task-parq-to-dask
kind=dask
owner=admin
host=jupyter-1-6ccccd5fdf-mz2ld
parquet_url=/User/mlrun/airlines/dataset/partitions
index_cols=['Year', 'Month']
shards=4
threads_per=4
persist=True
\n", + "
\n", + "
\n", + "
\n", + " Title\n", + " ×\n", + "
\n", + " \n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "to track results use .show() or .logs() or in CLI: \n", + "!mlrun get run 7f501215960146a89447aa29cb21c2a1 , !mlrun logs 7f501215960146a89447aa29cb21c2a1 \n", + "[mlrun] 2020-01-29 01:47:11,004 run executed, status=completed\n" + ] + } + ], + "source": [ + "# create and run the task\n", + "parq_to_dask_task = mlrun.NewTask(\n", + " TASK_NAME, \n", + " handler=FUNCTION, \n", + " params={\n", + " 'parquet_url': SRC_PATH,\n", + " 'index_cols' : PARTITION_COLS,\n", + " 'shards' : DASK_SHARDS,\n", + " 'threads_per': DASK_THREADS_PER,\n", + " 'persist' : True})\n", + "# run\n", + "rn = parq2dask.run(parq_to_dask_task)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{}" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "rn.outputs" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "summ = mlrun.new_function(command='/User/repos/functions/tests/describe.py', kind=JOB_KIND)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "summ.spec.remote = True\n", + "summ.spec.replicas = 4 \n", + "summ.spec.max_replicas = 4\n", + "summ.spec.service_type = 'NodePort'\n", + "summ.spec.build.base_image = BASE_IMAGE" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-29 01:47:51,451 function spec saved to path: /User/repos/functions/tests/describe.yaml\n" + ] + } + ], + "source": [ + "summ.export('/User/repos/functions/tests/describe.yaml')" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-29 01:47:56,228 starting remote build, image: .mlrun/func-default-describe-latest\n" + ] + }, + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "summ.apply(mlrun.mount_v3io())\n", + "\n", + "summ.deploy(skip_deployed=True, with_mlrun=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-29 01:47:58,445 starting run user-task-my-sum uid=f46e35077e114018836ac247b025e8ae -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-29 01:47:58,559 saving function: describe, tag: latest\n", + "[mlrun] 2020-01-29 01:48:07,794 using remote dask scheduler (mlrun-describe-82c5acd0-e) at: 3.133.8.252:30018\n", + "[mlrun] 2020-01-29 01:48:07,795 remote dashboard (node) port: 3.133.8.252:32596\n", + "[mlrun] 2020-01-29 01:48:07,862 exec error - \"Dataset 'my_dask_dataframe' not found\"\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/conda/lib/python3.6/site-packages/distributed/client.py:1074: VersionMismatchWarning: Mismatched versions found\n", + "\n", + "blosc\n", + "+-----------+---------+\n", + "| | version |\n", + "+-----------+---------+\n", + "| client | None |\n", + "| scheduler | 1.7.0 |\n", + "+-----------+---------+\n", + "\n", + "lz4\n", + "+-----------+---------+\n", + "| | version |\n", + "+-----------+---------+\n", + "| client | None |\n", + "| scheduler | 2.2.1 |\n", + "+-----------+---------+\n", + "\n", + "msgpack\n", + "+-----------+---------+\n", + "| | version |\n", + "+-----------+---------+\n", + "| client | 0.6.2 |\n", + "| scheduler | 0.6.1 |\n", + "+-----------+---------+\n", + "\n", + "tornado\n", + "+-----------+---------+\n", + "| | version |\n", + "+-----------+---------+\n", + "| client | 5.1.1 |\n", + "| scheduler | 6.0.3 |\n", + "+-----------+---------+\n", + " warnings.warn(version_module.VersionMismatchWarning(msg[0][\"warning\"]))\n", + "\"Dataset 'my_dask_dataframe' not found\"\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
uiditerstartstatenamelabelsinputsparametersresultsartifacts
...25e8ae
0Jan 29 01:47:58
error
user-task-my-sum
host=jupyter-1-6ccccd5fdf-mz2ld
kind=dask
owner=admin
data_key=my_dask_dataframe
key=table-summary
name=table-summary.csv
target_path=/User/mlrun/models
\n", + "
\n", + "
\n", + "
\n", + " Title\n", + " ×\n", + "
\n", + " \n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "to track results use .show() or .logs() or in CLI: \n", + "!mlrun get run f46e35077e114018836ac247b025e8ae , !mlrun logs f46e35077e114018836ac247b025e8ae \n", + "[mlrun] 2020-01-29 01:48:07,933 run executed, status=error\n" + ] + }, + { + "ename": "RunError", + "evalue": "\"Dataset 'my_dask_dataframe' not found\"", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mRunError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 9\u001b[0m 'key' : 'table-summary'})\n\u001b[1;32m 10\u001b[0m \u001b[0;31m# run\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 11\u001b[0;31m \u001b[0mrn2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msumm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msumm_task\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, runspec, handler, name, project, params, inputs, out_path, workdir, watch, schedule)\u001b[0m\n\u001b[1;32m 302\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_post_run\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 303\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 304\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mlast_err\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 305\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 306\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mdict\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mRunObject\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36m_wrap_result\u001b[0;34m(self, result, runspec, err)\u001b[0m\n\u001b[1;32m 334\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_child\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 335\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'runtime error: {}'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 336\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mRunError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 337\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrun\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 338\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mRunError\u001b[0m: \"Dataset 'my_dask_dataframe' not found\"" + ] + } + ], + "source": [ + "# create and run the task\n", + "summ_task = mlrun.NewTask(\n", + " 'user-task-my-sum', \n", + " handler='table_summary', \n", + " params={\n", + " 'data_key' : 'my_dask_dataframe',\n", + " 'target_path': '/User/mlrun/models',\n", + " 'name' : 'table-summary.csv',\n", + " 'key' : 'table-summary'})\n", + "# run\n", + "rn2 = summ.run(summ_task)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rn2.outputs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "____\n", + "\n", + "# tests" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import pyarrow as pa\n", + "import pyarrow.parquet as pq" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import dask\n", + "import dask.dataframe as dd" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dataset = pq.ParquetDataset(os.path.join(SRC_PATH))\n", + "df = dataset.read().to_pandas()\n", + "\n", + "\n", + "ddf = dd.read_parquet(SRC_PATH) #+'/*.parquet')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ddf = ddf.persist()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ddf.head()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From dcb6bcb15a78b877f6dadc6a6a4a779b2cce3cf0 Mon Sep 17 00:00:00 2001 From: yasha Date: Thu, 30 Jan 2020 00:46:51 +0000 Subject: [PATCH 30/32] eod, backup --- datagen/{classification => binary}/binary.py | 0 .../{classification => binary}/binary.yaml | 0 .../function.py} | 0 .../function.yaml} | 0 fileutils/arc_to_parquet/function.py | 4 +- fileutils/arc_to_parquet/function.yaml | 8 +- fileutils/parquet_to_dask/function.py | 30 +- fileutils/parquet_to_dask/function.yaml | 15 +- tests/arc_to_parquet-airlines.ipynb | 297 ++--------- tests/arc_to_parquet.ipynb | 274 ++++++----- tests/describe.py | 16 +- tests/describe.yaml | 14 +- tests/parquet_to_dask.ipynb | 464 +++++++++++------- tests/train_valid_test_split.ipynb | 64 ++- 14 files changed, 533 insertions(+), 653 deletions(-) rename datagen/{classification => binary}/binary.py (100%) rename datagen/{classification => binary}/binary.yaml (100%) rename datagen/{splitters/train_valid_test.py => train_valid_test/function.py} (100%) rename datagen/{splitters/train_valid_test.yaml => train_valid_test/function.yaml} (100%) diff --git a/datagen/classification/binary.py b/datagen/binary/binary.py similarity index 100% rename from datagen/classification/binary.py rename to datagen/binary/binary.py diff --git a/datagen/classification/binary.yaml b/datagen/binary/binary.yaml similarity index 100% rename from datagen/classification/binary.yaml rename to datagen/binary/binary.yaml diff --git a/datagen/splitters/train_valid_test.py b/datagen/train_valid_test/function.py similarity index 100% rename from datagen/splitters/train_valid_test.py rename to datagen/train_valid_test/function.py diff --git a/datagen/splitters/train_valid_test.yaml b/datagen/train_valid_test/function.yaml similarity index 100% rename from datagen/splitters/train_valid_test.yaml rename to datagen/train_valid_test/function.yaml diff --git a/fileutils/arc_to_parquet/function.py b/fileutils/arc_to_parquet/function.py index 396dc0bd9..8b261c944 100644 --- a/fileutils/arc_to_parquet/function.py +++ b/fileutils/arc_to_parquet/function.py @@ -96,10 +96,10 @@ def arc_to_parquet( if dataset: # just write header here pq.ParquetWriter(filepath, table.schema) - context.log_artifact('header', target_path=filepath) + #context.log_artifact('header', target_path=filepath) else: # start writing file - context.log_artifact('header', target_path=filepath) + #context.log_artifact('header', target_path=filepath) pqwriter = pq.ParquetWriter(dest_path, table.schema) if dataset: diff --git a/fileutils/arc_to_parquet/function.yaml b/fileutils/arc_to_parquet/function.yaml index 8b8bacca2..e4d296a3c 100644 --- a/fileutils/arc_to_parquet/function.yaml +++ b/fileutils/arc_to_parquet/function.yaml @@ -1,16 +1,16 @@ kind: job metadata: - name: arc_to_parquet - hash: 30906948185d1e5c2b3a20482a1e97dbb1505f8a + name: function + hash: 0a17345fa693f3b0fd5671a8f94e09f97676ded2 project: default spec: - command: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.py + command: /User/repos/functions/fileutils/arc_to_parquet/function.py args: [] image: '' volumes: [] volume_mounts: [] env: [] - description: retrieve archive table and save as partitioned parquet dataset + description: retrieve archive table and save as parquet file build: base_image: yjbds/mlrun_dev-files:latest commands: [] diff --git a/fileutils/parquet_to_dask/function.py b/fileutils/parquet_to_dask/function.py index 56e5b4f44..02dca4d05 100644 --- a/fileutils/parquet_to_dask/function.py +++ b/fileutils/parquet_to_dask/function.py @@ -34,27 +34,33 @@ def parquet_to_dask( shards: int = 4, threads_per: int = 4, persist: bool = True, - dask_key: str = 'my_dask_dataframe' + dask_key: str = 'my_dask_dataframe', + target_path: str = '' ) -> None: """Load parquet dataset into dask cluster If no cluster is found loads a new one and persist the data to it """ - # Setup Dask if hasattr(context, 'dask_client'): - dask_client = context.dask_client + context.logger.info('found cluster...') + dask_client = context.dask_client else: - context.dask_client = Client(LocalCluster(n_workers=shards, - threads_per_worker=threads_per)) - context.logger.info(context.dask_client) - - assert context.dask_client + context.logger.info('starting new cluster...') + cluster = LocalCluster(n_workers=shards, threads_per_worker=threads_per) + dask_client = Client(cluster) + context.logger.info(dask_client) + df = dd.read_parquet(parquet_url) if persist and context: - df = context.dask_client.persist(df) - context.dask_client.datasets[dask_key] = df + df = dask_client.persist(df) + dask_client.publish_dataset(dask_key=df) + context.dask_client = dask_client + + # share the scheduler + filepath = os.path.join(target_path, 'scheduler.json') + dask_client.write_scheduler_file(filepath) + context.log_artifact('scheduler', target_path=filepath) + print(df.head()) - # or can use: - # context.dask_client.publish_dataset(my_dataset=df) diff --git a/fileutils/parquet_to_dask/function.yaml b/fileutils/parquet_to_dask/function.yaml index a54ca4e35..d05d008a2 100644 --- a/fileutils/parquet_to_dask/function.yaml +++ b/fileutils/parquet_to_dask/function.yaml @@ -1,7 +1,7 @@ -kind: dask +kind: job metadata: name: function - hash: 2a336714ac18bcc984baaf3b24a9c05dabed9a62 + hash: ba58f928478016117de9cefbce93054cad9ee263 project: default spec: command: /User/repos/functions/fileutils/parquet_to_dask/function.py @@ -10,13 +10,8 @@ spec: volumes: [] volume_mounts: [] env: [] - build: - base_image: yjbds/mlrun_dev-dask-ds:latest - commands: [] description: '' replicas: 4 - remote: true - service_type: NodePort - nthreads: 1 - min_replicas: 0 - max_replicas: 4 + build: + base_image: yjbds/mlrun-daskboost:dev + commands: [] diff --git a/tests/arc_to_parquet-airlines.ipynb b/tests/arc_to_parquet-airlines.ipynb index a74f34d17..152dd039a 100644 --- a/tests/arc_to_parquet-airlines.ipynb +++ b/tests/arc_to_parquet-airlines.ipynb @@ -46,8 +46,8 @@ "ARCHIVE = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears.csv\"\n", "ARCHIVE_SMALL = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv\"\n", "\n", - "USE_ARCHIVE = ARCHIVE\n", - "TARGET_PATH = '/User/mlrun/airlines/dataset'\n", + "USE_ARCHIVE = ARCHIVE_SMALL\n", + "TARGET_PATH = '/User/mlrun/airlines/dataset-small'\n", "\n", "FILE_SHAPE = (123_534_969, 21) # (rows, cols)\n", "SMALL_FILE_SHAPE = (43_978, 21) # (rows, cols)\n", @@ -144,7 +144,7 @@ } ], "source": [ - "func_yaml = os.path.join(CODE_BASE, FUNCTION, 'arc_to_parquet.yaml')\n", + "func_yaml = os.path.join(CODE_BASE, FUNCTION, 'function.yaml')\n", "\n", "arctoparq = mlrun.import_function(func_yaml)\n", "\n", @@ -155,21 +155,20 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-28 21:58:53,305 starting run arc-to-part-parq-task uid=95e9b64ce8664abaad4cba846b2caa5e -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-28 21:58:53,385 Job is running in the background, pod: arc-to-part-parq-task-kw84v\n", - "[mlrun] 2020-01-28 21:58:57,639 destination file does not exist, downloading\n", - "[mlrun] 2020-01-28 22:02:33,755 log artifact header at /User/mlrun/airlines/dataset/header-only.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-28 22:11:31,709 saved table to /User/mlrun/airlines/dataset/partitions\n", - "[mlrun] 2020-01-28 22:11:31,782 log artifact airlines at /User/mlrun/airlines/dataset/partitions, size: None, db: Y\n", + "[mlrun] 2020-01-29 12:30:19,645 starting run user-task-arc-to-part-parq uid=963c75c5d76642da9bbae845f527e361 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-29 12:30:19,808 Job is running in the background, pod: user-task-arc-to-part-parq-c8fjx\n", + "[mlrun] 2020-01-29 12:30:24,158 destination file does not exist, downloading\n", + "[mlrun] 2020-01-29 12:30:24,614 saved table to /User/mlrun/airlines/dataset-small/partitions\n", + "[mlrun] 2020-01-29 12:30:24,647 log artifact airlines at /User/mlrun/airlines/dataset-small/partitions, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-28 22:11:31,837 run executed, status=completed\n", + "[mlrun] 2020-01-29 12:30:24,667 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -342,26 +341,26 @@ " \n", " \n", " \n", - "
...2caa5e
\n", + "
...27e361
\n", " 0\n", - " Jan 28 21:58:57\n", + " Jan 29 12:30:24\n", " completed\n", - " arc_to_parquet\n", - "
host=arc-to-part-parq-task-kw84v
kind=job
owner=admin
\n", + " function\n", + "
host=user-task-arc-to-part-parq-c8fjx
kind=job
owner=admin
\n", " \n", - "
archive_url=https://s3.amazonaws.com/h2o-airlines-unpacked/allyears.csv
dataset=partitions
dtype={'AirTime': 'float32', 'ArrTime': 'float32', 'CRSElapsedTime': 'float32', 'CarrierDelay': 'float32', 'DepTime': 'float32', 'Distance': 'float32', 'LateAircraftDelay': 'float32', 'NASDelay': 'float32', 'SecurityDelay': 'float32', 'TailNum': 'str', 'TaxiIn': 'float32', 'TaxiOut': 'float32', 'WeatherDelay': 'float32'}
encoding=latin-1
inc_cols=['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'CRSElapsedTime', 'AirTime', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut', 'Cancelled', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
key=airlines
name=airlines.pqt
part_cols=['Year', 'Month']
target_path=/User/mlrun/airlines/dataset
\n", + "
archive_url=https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv
dataset=partitions
dtype={'AirTime': 'float32', 'ArrTime': 'float32', 'CRSElapsedTime': 'float32', 'CarrierDelay': 'float32', 'DepTime': 'float32', 'Distance': 'float32', 'LateAircraftDelay': 'float32', 'NASDelay': 'float32', 'SecurityDelay': 'float32', 'TailNum': 'str', 'TaxiIn': 'float32', 'TaxiOut': 'float32', 'WeatherDelay': 'float32'}
encoding=latin-1
inc_cols=['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'CRSElapsedTime', 'AirTime', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut', 'Cancelled', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
key=airlines
name=airlines.pqt
part_cols=['Year', 'Month']
target_path=/User/mlrun/airlines/dataset-small
\n", " \n", - "
header
airlines
\n", + "
airlines
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -377,8 +376,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 95e9b64ce8664abaad4cba846b2caa5e , !mlrun logs 95e9b64ce8664abaad4cba846b2caa5e \n", - "[mlrun] 2020-01-28 22:11:35,919 run executed, status=completed\n" + "!mlrun get run 963c75c5d76642da9bbae845f527e361 , !mlrun logs 963c75c5d76642da9bbae845f527e361 \n", + "[mlrun] 2020-01-29 12:30:25,977 run executed, status=completed\n" ] } ], @@ -424,7 +423,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -435,7 +434,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -445,7 +444,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -454,254 +453,16 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
DayofMonthDayOfWeekDepTimeCRSDepTimeArrTimeCRSArrTimeUniqueCarrierFlightNumCRSElapsedTimeAirTime...DestDistanceTaxiInTaxiOutCancelledCarrierDelayWeatherDelayNASDelaySecurityDelayLateAircraftDelay
YearMonth
199042972030.020302205.02145PA (1)54875.0NaN...BOS185.0NaNNaN0NaNNaNNaNNaNNaN
43012031.020302146.02145PA (1)54875.0NaN...BOS185.0NaNNaN0NaNNaNNaNNaNNaN
4172028.020302121.02135PA (1)54965.0NaN...LGA185.0NaNNaN0NaNNaNNaNNaNNaN
4212030.020302128.02135PA (1)54965.0NaN...LGA185.0NaNNaN0NaNNaNNaNNaNNaN
4322030.020302146.02135PA (1)54965.0NaN...LGA185.0NaNNaN0NaNNaNNaNNaNNaN
\n", - "

5 rows × 21 columns

\n", - "
" - ], - "text/plain": [ - " DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime \\\n", - "Year Month \n", - "1990 4 29 7 2030.0 2030 2205.0 2145 \n", - " 4 30 1 2031.0 2030 2146.0 2145 \n", - " 4 1 7 2028.0 2030 2121.0 2135 \n", - " 4 2 1 2030.0 2030 2128.0 2135 \n", - " 4 3 2 2030.0 2030 2146.0 2135 \n", - "\n", - " UniqueCarrier FlightNum CRSElapsedTime AirTime ... Dest \\\n", - "Year Month ... \n", - "1990 4 PA (1) 548 75.0 NaN ... BOS \n", - " 4 PA (1) 548 75.0 NaN ... BOS \n", - " 4 PA (1) 549 65.0 NaN ... LGA \n", - " 4 PA (1) 549 65.0 NaN ... LGA \n", - " 4 PA (1) 549 65.0 NaN ... LGA \n", - "\n", - " Distance TaxiIn TaxiOut Cancelled CarrierDelay WeatherDelay \\\n", - "Year Month \n", - "1990 4 185.0 NaN NaN 0 NaN NaN \n", - " 4 185.0 NaN NaN 0 NaN NaN \n", - " 4 185.0 NaN NaN 0 NaN NaN \n", - " 4 185.0 NaN NaN 0 NaN NaN \n", - " 4 185.0 NaN NaN 0 NaN NaN \n", - "\n", - " NASDelay SecurityDelay LateAircraftDelay \n", - "Year Month \n", - "1990 4 NaN NaN NaN \n", - " 4 NaN NaN NaN \n", - " 4 NaN NaN NaN \n", - " 4 NaN NaN NaN \n", - " 4 NaN NaN NaN \n", - "\n", - "[5 rows x 21 columns]" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", - "execution_count": 13, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -720,7 +481,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ diff --git a/tests/arc_to_parquet.ipynb b/tests/arc_to_parquet.ipynb index c64beef5a..0c87165bf 100644 --- a/tests/arc_to_parquet.ipynb +++ b/tests/arc_to_parquet.ipynb @@ -11,7 +11,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 23, "metadata": {}, "outputs": [], "source": [ @@ -29,7 +29,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 24, "metadata": {}, "outputs": [], "source": [ @@ -59,7 +59,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 25, "metadata": {}, "outputs": [], "source": [ @@ -75,7 +75,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 26, "metadata": {}, "outputs": [], "source": [ @@ -90,14 +90,14 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-28 21:37:17,278 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/arc_to_parquet.yaml\n" + "[mlrun] 2020-01-29 12:23:04,377 function spec saved to path: /User/repos/functions/fileutils/arc_to_parquet/function.yaml\n" ] } ], @@ -107,16 +107,16 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 6, + "execution_count": 28, "metadata": {}, "output_type": "execute_result" } @@ -134,7 +134,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 29, "metadata": {}, "outputs": [], "source": [ @@ -157,7 +157,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 30, "metadata": {}, "outputs": [ { @@ -166,7 +166,7 @@ "'ready'" ] }, - "execution_count": 8, + "execution_count": 30, "metadata": {}, "output_type": "execute_result" } @@ -177,21 +177,20 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-28 21:42:59,772 starting run arc-to-parq-task uid=66dd63566a8147a0b7186742f4402af0 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-28 21:42:59,930 Job is running in the background, pod: arc-to-parq-task-bzszg\n", - "[mlrun] 2020-01-28 21:43:04,226 destination file does not exist, downloading\n", - "[mlrun] 2020-01-28 21:44:56,004 log artifact header at /User/mlrun/models/header-only.pqt, size: None, db: Y\n", - "[mlrun] 2020-01-28 21:48:03,294 saved table to /User/mlrun/models/higgs.pqt\n", - "[mlrun] 2020-01-28 21:48:03,316 log artifact higgs at /User/mlrun/models/higgs.pqt, size: None, db: Y\n", + "[mlrun] 2020-01-29 12:23:17,789 starting run user-task-arc-to-parq uid=c3c3a9ade23d413781b1f62fba0f7593 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-29 12:23:17,864 Job is running in the background, pod: user-task-arc-to-parq-nx92p\n", + "[mlrun] 2020-01-29 12:23:22,149 destination file does not exist, downloading\n", + "[mlrun] 2020-01-29 12:28:19,478 saved table to /User/mlrun/models/higgs.pqt\n", + "[mlrun] 2020-01-29 12:28:19,492 log artifact higgs at /User/mlrun/models/higgs.pqt, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-28 21:48:03,345 run executed, status=completed\n", + "[mlrun] 2020-01-29 12:28:19,514 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -364,26 +363,26 @@ " \n", " \n", " \n", - "
...402af0
\n", + "
...0f7593
\n", " 0\n", - " Jan 28 21:43:04\n", + " Jan 29 12:23:22\n", " completed\n", - " arc_to_parquet\n", - "
host=arc-to-parq-task-bzszg
kind=job
owner=admin
\n", + " function\n", + "
host=user-task-arc-to-parq-nx92p
kind=job
owner=admin
\n", " \n", - "
archive_url=https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
header=['labels', 'lepton_pT', 'lepton_eta', 'lepton_phi', 'missing_energy_magnitude', 'missing_energy_phi', 'jet_1_pt', 'jet_1_eta', 'jet_1_phi', 'jet_1_b-tag', 'jet_2_pt', 'jet_2_eta', 'jet_2_phi', 'jet_2_b-tag', 'jet_3_pt', 'jet_3_eta', 'jet_3_phi', 'jet_3_b-tag', 'jet_4_pt', 'jet_4_eta', 'jet_4_phi', 'jet_4_b-tag', 'm_jj', 'm_jjj', 'm_lv', 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']
key=higgs
name=higgs.pqt
target_path=/User/mlrun/models
\n", + "
archive_url=https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
header=None
key=higgs
name=higgs.pqt
target_path=/User/mlrun/models
\n", " \n", - "
header
higgs
\n", + "
higgs
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -399,8 +398,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 66dd63566a8147a0b7186742f4402af0 , !mlrun logs 66dd63566a8147a0b7186742f4402af0 \n", - "[mlrun] 2020-01-28 21:48:10,286 run executed, status=completed\n" + "!mlrun get run c3c3a9ade23d413781b1f62fba0f7593 , !mlrun logs c3c3a9ade23d413781b1f62fba0f7593 \n", + "[mlrun] 2020-01-29 12:28:28,186 run executed, status=completed\n" ] } ], @@ -414,7 +413,7 @@ " 'name' : FILE_NAME, \n", " 'key' : KEY,\n", " 'archive_url': ARCHIVE,\n", - " 'header' : HEADER},\n", + " 'header' : None},\n", " outputs=[KEY])\n", "\n", "# run\n", @@ -423,17 +422,16 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'header': '/User/mlrun/models/header-only.pqt',\n", - " 'higgs': '/User/mlrun/models/higgs.pqt'}" + "{'higgs': '/User/mlrun/models/higgs.pqt'}" ] }, - "execution_count": 21, + "execution_count": 34, "metadata": {}, "output_type": "execute_result" } @@ -458,7 +456,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 35, "metadata": {}, "outputs": [], "source": [ @@ -469,7 +467,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 36, "metadata": {}, "outputs": [], "source": [ @@ -479,7 +477,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 37, "metadata": {}, "outputs": [], "source": [ @@ -489,7 +487,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 38, "metadata": {}, "outputs": [], "source": [ @@ -498,7 +496,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 39, "metadata": {}, "outputs": [ { @@ -522,57 +520,33 @@ " \n", " \n", " \n", - " labels\n", - " lepton_pT\n", - " lepton_eta\n", - " lepton_phi\n", - " missing_energy_magnitude\n", - " missing_energy_phi\n", - " jet_1_pt\n", - " jet_1_eta\n", - " jet_1_phi\n", - " jet_1_b-tag\n", + " 1.000000000000000000e+00\n", + " 8.692932128906250000e-01\n", + " -6.350818276405334473e-01\n", + " 2.256902605295181274e-01\n", + " 3.274700641632080078e-01\n", + " -6.899932026863098145e-01\n", + " 7.542022466659545898e-01\n", + " -2.485731393098831177e-01\n", + " -1.092063903808593750e+00\n", + " 0.000000000000000000e+00\n", " ...\n", - " jet_4_eta\n", - " jet_4_phi\n", - " jet_4_b-tag\n", - " m_jj\n", - " m_jjj\n", - " m_lv\n", - " m_jlv\n", - " m_bb\n", - " m_wbb\n", - " m_wwbb\n", + " -1.045456994324922562e-02\n", + " -4.576716944575309753e-02\n", + " 3.101961374282836914e+00\n", + " 1.353760004043579102e+00\n", + " 9.795631170272827148e-01\n", + " 9.780761599540710449e-01\n", + " 9.200048446655273438e-01\n", + " 7.216574549674987793e-01\n", + " 9.887509346008300781e-01\n", + " 8.766783475875854492e-01\n", " \n", " \n", " \n", " \n", " 0\n", " 1.0\n", - " 0.869293\n", - " -0.635082\n", - " 0.225690\n", - " 0.327470\n", - " -0.689993\n", - " 0.754202\n", - " -0.248573\n", - " -1.092064\n", - " 0.000000\n", - " ...\n", - " -0.010455\n", - " -0.045767\n", - " 3.101961\n", - " 1.353760\n", - " 0.979563\n", - " 0.978076\n", - " 0.920005\n", - " 0.721657\n", - " 0.988751\n", - " 0.876678\n", - " \n", - " \n", - " 1\n", - " 1.0\n", " 0.907542\n", " 0.329147\n", " 0.359412\n", @@ -595,7 +569,7 @@ " 0.798343\n", " \n", " \n", - " 2\n", + " 1\n", " 1.0\n", " 0.798835\n", " 1.470639\n", @@ -619,7 +593,7 @@ " 0.780118\n", " \n", " \n", - " 3\n", + " 2\n", " 0.0\n", " 1.344385\n", " -0.876626\n", @@ -643,7 +617,7 @@ " 0.957904\n", " \n", " \n", - " 4\n", + " 3\n", " 1.0\n", " 1.105009\n", " 0.321356\n", @@ -666,44 +640,110 @@ " 0.872245\n", " 0.808487\n", " \n", + " \n", + " 4\n", + " 0.0\n", + " 1.595839\n", + " -0.607811\n", + " 0.007075\n", + " 1.818450\n", + " -0.111906\n", + " 0.847550\n", + " -0.566437\n", + " 1.581239\n", + " 2.173076\n", + " ...\n", + " -0.654227\n", + " -1.274345\n", + " 3.101961\n", + " 0.823761\n", + " 0.938191\n", + " 0.971758\n", + " 0.789176\n", + " 0.430553\n", + " 0.961357\n", + " 0.957818\n", + " \n", " \n", "\n", "

5 rows × 29 columns

\n", "" ], "text/plain": [ - " labels lepton_pT lepton_eta lepton_phi missing_energy_magnitude \\\n", - "0 1.0 0.869293 -0.635082 0.225690 0.327470 \n", - "1 1.0 0.907542 0.329147 0.359412 1.497970 \n", - "2 1.0 0.798835 1.470639 -1.635975 0.453773 \n", - "3 0.0 1.344385 -0.876626 0.935913 1.992050 \n", - "4 1.0 1.105009 0.321356 1.522401 0.882808 \n", + " 1.000000000000000000e+00 8.692932128906250000e-01 \\\n", + "0 1.0 0.907542 \n", + "1 1.0 0.798835 \n", + "2 0.0 1.344385 \n", + "3 1.0 1.105009 \n", + "4 0.0 1.595839 \n", + "\n", + " -6.350818276405334473e-01 2.256902605295181274e-01 \\\n", + "0 0.329147 0.359412 \n", + "1 1.470639 -1.635975 \n", + "2 -0.876626 0.935913 \n", + "3 0.321356 1.522401 \n", + "4 -0.607811 0.007075 \n", + "\n", + " 3.274700641632080078e-01 -6.899932026863098145e-01 \\\n", + "0 1.497970 -0.313010 \n", + "1 0.453773 0.425629 \n", + "2 1.992050 0.882454 \n", + "3 0.882808 -1.205349 \n", + "4 1.818450 -0.111906 \n", + "\n", + " 7.542022466659545898e-01 -2.485731393098831177e-01 \\\n", + "0 1.095531 -0.557525 \n", + "1 1.104875 1.282322 \n", + "2 1.786066 -1.646778 \n", + "3 0.681466 -1.070464 \n", + "4 0.847550 -0.566437 \n", "\n", - " missing_energy_phi jet_1_pt jet_1_eta jet_1_phi jet_1_b-tag ... \\\n", - "0 -0.689993 0.754202 -0.248573 -1.092064 0.000000 ... \n", - "1 -0.313010 1.095531 -0.557525 -1.588230 2.173076 ... \n", - "2 0.425629 1.104875 1.282322 1.381664 0.000000 ... \n", - "3 0.882454 1.786066 -1.646778 -0.942383 0.000000 ... \n", - "4 -1.205349 0.681466 -1.070464 -0.921871 0.000000 ... \n", + " -1.092063903808593750e+00 0.000000000000000000e+00 ... \\\n", + "0 -1.588230 2.173076 ... \n", + "1 1.381664 0.000000 ... \n", + "2 -0.942383 0.000000 ... \n", + "3 -0.921871 0.000000 ... \n", + "4 1.581239 2.173076 ... \n", "\n", - " jet_4_eta jet_4_phi jet_4_b-tag m_jj m_jjj m_lv m_jlv \\\n", - "0 -0.010455 -0.045767 3.101961 1.353760 0.979563 0.978076 0.920005 \n", - "1 -1.138930 -0.000819 0.000000 0.302220 0.833048 0.985700 0.978098 \n", - "2 1.128848 0.900461 0.000000 0.909753 1.108330 0.985692 0.951331 \n", - "3 -0.678379 -1.360356 0.000000 0.946652 1.028704 0.998656 0.728281 \n", - "4 -0.373566 0.113041 0.000000 0.755856 1.361057 0.986610 0.838085 \n", + " -1.045456994324922562e-02 -4.576716944575309753e-02 \\\n", + "0 -1.138930 -0.000819 \n", + "1 1.128848 0.900461 \n", + "2 -0.678379 -1.360356 \n", + "3 -0.373566 0.113041 \n", + "4 -0.654227 -1.274345 \n", "\n", - " m_bb m_wbb m_wwbb \n", - "0 0.721657 0.988751 0.876678 \n", - "1 0.779732 0.992356 0.798343 \n", - "2 0.803252 0.865924 0.780118 \n", - "3 0.869200 1.026736 0.957904 \n", - "4 1.133295 0.872245 0.808487 \n", + " 3.101961374282836914e+00 1.353760004043579102e+00 \\\n", + "0 0.000000 0.302220 \n", + "1 0.000000 0.909753 \n", + "2 0.000000 0.946652 \n", + "3 0.000000 0.755856 \n", + "4 3.101961 0.823761 \n", + "\n", + " 9.795631170272827148e-01 9.780761599540710449e-01 \\\n", + "0 0.833048 0.985700 \n", + "1 1.108330 0.985692 \n", + "2 1.028704 0.998656 \n", + "3 1.361057 0.986610 \n", + "4 0.938191 0.971758 \n", + "\n", + " 9.200048446655273438e-01 7.216574549674987793e-01 \\\n", + "0 0.978098 0.779732 \n", + "1 0.951331 0.803252 \n", + "2 0.728281 0.869200 \n", + "3 0.838085 1.133295 \n", + "4 0.789176 0.430553 \n", + "\n", + " 9.887509346008300781e-01 8.766783475875854492e-01 \n", + "0 0.992356 0.798343 \n", + "1 0.865924 0.780118 \n", + "2 1.026736 0.957904 \n", + "3 0.872245 0.808487 \n", + "4 0.961357 0.957818 \n", "\n", "[5 rows x 29 columns]" ] }, - "execution_count": 26, + "execution_count": 39, "metadata": {}, "output_type": "execute_result" } @@ -714,16 +754,16 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "(11000000, 29)" + "(10999999, 29)" ] }, - "execution_count": 27, + "execution_count": 40, "metadata": {}, "output_type": "execute_result" } diff --git a/tests/describe.py b/tests/describe.py index 75e52221b..3b6de642e 100644 --- a/tests/describe.py +++ b/tests/describe.py @@ -28,6 +28,7 @@ def table_summary( context: MLClientCtx, + dask_client: Union[DataItem, str], dask_key: str = 'my_dask_dataframe', target_path: str = '', name: str = 'table_summary.csv', @@ -35,11 +36,12 @@ def table_summary( ) -> None: """Summarize a table """ - if hasattr(context, 'dask_client'): - dscr = context.dask_client.datasets[dask_key].describe() - filepath = os.path.join(target_path, name) - dscr.to_csv(filepath) - context.log_artifact(key, target_path=filepath) - else: - context.logger.info('no dask_client found') + print(str(dask_client)) + + context.dask_client = Client(scheduler_file='/User/mlrun/models/scheduler.json') + dscr = context.dask_client.datasets[dask_key].describe() + filepath = os.path.join(target_path, name) + dscr.to_csv(filepath) + print(dscr) + context.log_artifact(key, target_path=filepath) \ No newline at end of file diff --git a/tests/describe.yaml b/tests/describe.yaml index 93779b6d2..49f602d06 100644 --- a/tests/describe.yaml +++ b/tests/describe.yaml @@ -1,7 +1,7 @@ -kind: dask +kind: job metadata: name: describe - hash: 931bd0bbc1dde34284e6c604d0b08d5e9666b7a7 + hash: 3f3a5547127800b6351fc4bb15198bd7022e7e99 project: default spec: command: /User/repos/functions/tests/describe.py @@ -10,13 +10,7 @@ spec: volumes: [] volume_mounts: [] env: [] + description: '' build: - base_image: yjbds/mlrun_dev-dask-ds:latest + base_image: yjbds/mlrun-daskboost:dev commands: [] - description: '' - replicas: 4 - remote: true - service_type: NodePort - nthreads: 1 - min_replicas: 0 - max_replicas: 4 diff --git a/tests/parquet_to_dask.ipynb b/tests/parquet_to_dask.ipynb index 02b2f2011..ff3af28c7 100644 --- a/tests/parquet_to_dask.ipynb +++ b/tests/parquet_to_dask.ipynb @@ -10,7 +10,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 22, "metadata": {}, "outputs": [], "source": [ @@ -29,20 +29,20 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "FUNCTION = 'parquet_to_dask'\n", "DESCRIPTION = 'load parquet dataset into a dask cluster'\n", "\n", - "BASE_IMAGE = 'yjbds/mlrun_dev-dask-ds:latest'\n", - "JOB_KIND = 'dask'\n", + "BASE_IMAGE = 'yjbds/mlrun-daskboost:dev'\n", + "JOB_KIND = 'job'\n", "TASK_NAME = 'user-task-parq-to-dask'\n", "\n", "CODE_BASE = '/User/repos/functions/fileutils'\n", "\n", - "SRC_PATH = '/User/mlrun/airlines/dataset/partitions'\n", + "SRC_PATH = '/User/mlrun/airlines/dataset-small/partitions'\n", "\n", "PARTITION_COLS = ['Year', 'Month']\n", "\n", @@ -59,7 +59,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 24, "metadata": {}, "outputs": [], "source": [ @@ -76,7 +76,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 25, "metadata": {}, "outputs": [], "source": [ @@ -92,14 +92,14 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-29 01:44:34,435 function spec saved to path: /User/repos/functions/fileutils/parquet_to_dask/function.yaml\n" + "[mlrun] 2020-01-30 00:27:38,400 function spec saved to path: /User/repos/functions/fileutils/parquet_to_dask/function.yaml\n" ] } ], @@ -116,14 +116,14 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-29 01:44:47,617 starting remote build, image: .mlrun/func-default-function-latest\n" + "[mlrun] 2020-01-30 00:27:38,522 starting remote build, image: .mlrun/func-default-function-latest\n" ] }, { @@ -132,7 +132,7 @@ "True" ] }, - "execution_count": 6, + "execution_count": 27, "metadata": {}, "output_type": "execute_result" } @@ -147,94 +147,31 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-29 01:44:50,780 starting run user-task-parq-to-dask uid=7f501215960146a89447aa29cb21c2a1 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-29 01:44:51,667 saving function: function, tag: latest\n", - "[mlrun] 2020-01-29 01:44:57,827 using remote dask scheduler (mlrun-function-e71085fe-8) at: 3.133.8.252:30113\n", - "[mlrun] 2020-01-29 01:44:57,828 remote dashboard (node) port: 3.133.8.252:31064\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/conda/lib/python3.6/site-packages/distributed/client.py:1074: VersionMismatchWarning: Mismatched versions found\n", - "\n", - "blosc\n", - "+-----------+---------+\n", - "| | version |\n", - "+-----------+---------+\n", - "| client | None |\n", - "| scheduler | 1.7.0 |\n", - "+-----------+---------+\n", - "\n", - "lz4\n", - "+-----------+---------+\n", - "| | version |\n", - "+-----------+---------+\n", - "| client | None |\n", - "| scheduler | 2.2.1 |\n", - "+-----------+---------+\n", - "\n", - "msgpack\n", - "+-----------+---------+\n", - "| | version |\n", - "+-----------+---------+\n", - "| client | 0.6.2 |\n", - "| scheduler | 0.6.1 |\n", - "+-----------+---------+\n", - "\n", - "tornado\n", - "+-----------+---------+\n", - "| | version |\n", - "+-----------+---------+\n", - "| client | 5.1.1 |\n", - "| scheduler | 6.0.3 |\n", - "+-----------+---------+\n", - " warnings.warn(version_module.VersionMismatchWarning(msg[0][\"warning\"]))\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - " Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime \\\n", - "0 1996 12 10 2 932.0 935 1112.0 \n", - "1 1996 12 11 3 945.0 935 1145.0 \n", - "2 1996 12 7 6 730.0 730 940.0 \n", - "3 1996 12 1 7 2357.0 2005 212.0 \n", - "4 1996 12 2 1 2006.0 2005 2206.0 \n", - "\n", - " CRSArrTime UniqueCarrier FlightNum ... Dest Distance TaxiIn TaxiOut \\\n", - "0 1140 CO 661 ... CAE 602.0 4.0 10.0 \n", - "1 1140 CO 661 ... CAE 602.0 5.0 18.0 \n", - "2 937 CO 678 ... CAE 602.0 5.0 20.0 \n", - "3 2210 CO 695 ... CAE 602.0 4.0 25.0 \n", - "4 2210 CO 695 ... CAE 602.0 3.0 26.0 \n", - "\n", - " Cancelled CarrierDelay WeatherDelay NASDelay SecurityDelay \\\n", - "0 0 NaN NaN NaN NaN \n", - "1 0 NaN NaN NaN NaN \n", - "2 0 NaN NaN NaN NaN \n", - "3 0 NaN NaN NaN NaN \n", - "4 0 NaN NaN NaN NaN \n", - "\n", - " LateAircraftDelay \n", - "0 NaN \n", - "1 NaN \n", - "2 NaN \n", - "3 NaN \n", - "4 NaN \n", + "[mlrun] 2020-01-30 00:27:38,588 starting run user-task-parq-to-dask uid=56bcb9b52e1f4a1a81fc51f8d4e15d8e -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-30 00:27:38,693 Job is running in the background, pod: user-task-parq-to-dask-c9fmj\n", + "/opt/conda/lib/python3.7/site-packages/bokeh/themes/theme.py:94: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.\n", + " json = yaml.load(f)\n", + "[mlrun] 2020-01-30 00:27:46,724 starting new cluster...\n", + "[mlrun] 2020-01-30 00:27:48,599 \n", + "[mlrun] 2020-01-30 00:27:48,875 log artifact scheduler at /User/mlrun/models/scheduler.json, size: None, db: Y\n", + " Year Month DayofMonth ... NASDelay SecurityDelay LateAircraftDelay\n", + "0 2007 1 1 ... 0.0 0.0 67.0\n", + "1 2007 1 1 ... 0.0 0.0 17.0\n", + "2 2007 1 1 ... 0.0 0.0 0.0\n", + "3 2007 1 1 ... 0.0 0.0 0.0\n", + "4 2007 1 1 ... 0.0 0.0 0.0\n", "\n", "[5 rows x 23 columns]\n", "\n", - "[mlrun] 2020-01-29 01:47:10,961 run ended with state \n" + "[mlrun] 2020-01-30 00:27:49,134 run executed, status=completed\n", + "final state: succeeded\n" ] }, { @@ -406,26 +343,26 @@ " \n", " \n", " \n", - "
...21c2a1
\n", + "
...e15d8e
\n", " 0\n", - " Jan 29 01:44:51\n", + " Jan 30 00:27:46\n", " completed\n", - " user-task-parq-to-dask\n", - "
kind=dask
owner=admin
host=jupyter-1-6ccccd5fdf-mz2ld
\n", - " \n", - "
parquet_url=/User/mlrun/airlines/dataset/partitions
index_cols=['Year', 'Month']
shards=4
threads_per=4
persist=True
\n", + " function\n", + "
host=user-task-parq-to-dask-c9fmj
kind=job
owner=admin
\n", " \n", + "
dask_key=testdf1
index_cols=['Year', 'Month']
parquet_url=/User/mlrun/airlines/dataset-small/partitions
persist=True
shards=4
target_path=/User/mlrun/models
threads_per=4
\n", " \n", + "
scheduler
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -441,8 +378,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 7f501215960146a89447aa29cb21c2a1 , !mlrun logs 7f501215960146a89447aa29cb21c2a1 \n", - "[mlrun] 2020-01-29 01:47:11,004 run executed, status=completed\n" + "!mlrun get run 56bcb9b52e1f4a1a81fc51f8d4e15d8e , !mlrun logs 56bcb9b52e1f4a1a81fc51f8d4e15d8e \n", + "[mlrun] 2020-01-30 00:27:53,903 run executed, status=completed\n" ] } ], @@ -456,63 +393,191 @@ " 'index_cols' : PARTITION_COLS,\n", " 'shards' : DASK_SHARDS,\n", " 'threads_per': DASK_THREADS_PER,\n", - " 'persist' : True})\n", + " 'persist' : True,\n", + " 'dask_key' : 'testdf1',\n", + " 'target_path': '/User/mlrun/models'})\n", "# run\n", "rn = parq2dask.run(parq_to_dask_task)" ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'/User/mlrun/models/scheduler.json'" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "rn.outputs['scheduler']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### What's the scheduler address?" + ] + }, + { + "cell_type": "code", + "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{}" + "{'type': 'Scheduler',\n", + " 'id': 'Scheduler-0a0672a0-1532-42cb-a151-3aa3da211991',\n", + " 'address': 'tcp://127.0.0.1:38329',\n", + " 'services': {},\n", + " 'workers': {'tcp://127.0.0.1:32950': {'type': 'Worker',\n", + " 'id': 3,\n", + " 'host': '127.0.0.1',\n", + " 'resources': {},\n", + " 'local_directory': '/dask-worker-space/worker-db2wzpjl',\n", + " 'name': 3,\n", + " 'nthreads': 4,\n", + " 'memory_limit': 16612705280,\n", + " 'last_seen': 1580344068.5361557,\n", + " 'services': {},\n", + " 'metrics': {'cpu': 0.0,\n", + " 'memory': 154923008,\n", + " 'time': 1580344068.530411,\n", + " 'read_bytes': 0.0,\n", + " 'write_bytes': 0.0,\n", + " 'num_fds': 22,\n", + " 'executing': 0,\n", + " 'in_memory': 0,\n", + " 'ready': 0,\n", + " 'in_flight': 0,\n", + " 'bandwidth': {'total': 100000000, 'workers': {}, 'types': {}}},\n", + " 'nanny': 'tcp://127.0.0.1:34755'},\n", + " 'tcp://127.0.0.1:33754': {'type': 'Worker',\n", + " 'id': 2,\n", + " 'host': '127.0.0.1',\n", + " 'resources': {},\n", + " 'local_directory': '/dask-worker-space/worker-3hgbwwcw',\n", + " 'name': 2,\n", + " 'nthreads': 4,\n", + " 'memory_limit': 16612705280,\n", + " 'last_seen': 1580344068.53231,\n", + " 'services': {},\n", + " 'metrics': {'cpu': 0.0,\n", + " 'memory': 154923008,\n", + " 'time': 1580344068.5266526,\n", + " 'read_bytes': 0.0,\n", + " 'write_bytes': 0.0,\n", + " 'num_fds': 22,\n", + " 'executing': 0,\n", + " 'in_memory': 0,\n", + " 'ready': 0,\n", + " 'in_flight': 0,\n", + " 'bandwidth': {'total': 100000000, 'workers': {}, 'types': {}}},\n", + " 'nanny': 'tcp://127.0.0.1:36665'},\n", + " 'tcp://127.0.0.1:40727': {'type': 'Worker',\n", + " 'id': 1,\n", + " 'host': '127.0.0.1',\n", + " 'resources': {},\n", + " 'local_directory': '/dask-worker-space/worker-evnbey62',\n", + " 'name': 1,\n", + " 'nthreads': 4,\n", + " 'memory_limit': 16612705280,\n", + " 'last_seen': 1580344068.5346513,\n", + " 'services': {},\n", + " 'metrics': {'cpu': 0.0,\n", + " 'memory': 154918912,\n", + " 'time': 1580344068.5266192,\n", + " 'read_bytes': 0.0,\n", + " 'write_bytes': 0.0,\n", + " 'num_fds': 22,\n", + " 'executing': 0,\n", + " 'in_memory': 0,\n", + " 'ready': 0,\n", + " 'in_flight': 0,\n", + " 'bandwidth': {'total': 100000000, 'workers': {}, 'types': {}}},\n", + " 'nanny': 'tcp://127.0.0.1:42264'},\n", + " 'tcp://127.0.0.1:43293': {'type': 'Worker',\n", + " 'id': 0,\n", + " 'host': '127.0.0.1',\n", + " 'resources': {},\n", + " 'local_directory': '/dask-worker-space/worker-_przo4q2',\n", + " 'name': 0,\n", + " 'nthreads': 4,\n", + " 'memory_limit': 16612705280,\n", + " 'last_seen': 1580344068.5533519,\n", + " 'services': {},\n", + " 'metrics': {'cpu': 0.0,\n", + " 'memory': 154923008,\n", + " 'time': 1580344068.5480375,\n", + " 'read_bytes': 0.0,\n", + " 'write_bytes': 0.0,\n", + " 'num_fds': 22,\n", + " 'executing': 0,\n", + " 'in_memory': 0,\n", + " 'ready': 0,\n", + " 'in_flight': 0,\n", + " 'bandwidth': {'total': 100000000, 'workers': {}, 'types': {}}},\n", + " 'nanny': 'tcp://127.0.0.1:45396'}}}" ] }, - "execution_count": 9, + "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "rn.outputs" + "import json\n", + "json.load(open(rn.outputs['scheduler']))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### create a component 'on the fly' to summarise the table\n", + "\n", + "The nice thing about having a dask clkuster loaded with all you rdata is that you can write _quick and dirty_ jobs either in your notebook, a local file, or a gihub repo." ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 43, "metadata": {}, "outputs": [], "source": [ - "summ = mlrun.new_function(command='/User/repos/functions/tests/describe.py', kind=JOB_KIND)" + "summ = mlrun.new_function(\n", + " command='/User/repos/functions/tests/describe.py', \n", + " kind='job')" ] }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 44, "metadata": {}, "outputs": [], "source": [ - "summ.spec.remote = True\n", - "summ.spec.replicas = 4 \n", - "summ.spec.max_replicas = 4\n", - "summ.spec.service_type = 'NodePort'\n", "summ.spec.build.base_image = BASE_IMAGE" ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-29 01:47:51,451 function spec saved to path: /User/repos/functions/tests/describe.yaml\n" + "[mlrun] 2020-01-30 00:35:45,128 function spec saved to path: /User/repos/functions/tests/describe.yaml\n" ] } ], @@ -522,23 +587,16 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 46, "metadata": {}, "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-29 01:47:56,228 starting remote build, image: .mlrun/func-default-describe-latest\n" - ] - }, { "data": { "text/plain": [ - "True" + "'ready'" ] }, - "execution_count": 13, + "execution_count": 46, "metadata": {}, "output_type": "execute_result" } @@ -551,59 +609,56 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-29 01:47:58,445 starting run user-task-my-sum uid=f46e35077e114018836ac247b025e8ae -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-29 01:47:58,559 saving function: describe, tag: latest\n", - "[mlrun] 2020-01-29 01:48:07,794 using remote dask scheduler (mlrun-describe-82c5acd0-e) at: 3.133.8.252:30018\n", - "[mlrun] 2020-01-29 01:48:07,795 remote dashboard (node) port: 3.133.8.252:32596\n", - "[mlrun] 2020-01-29 01:48:07,862 exec error - \"Dataset 'my_dask_dataframe' not found\"\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/conda/lib/python3.6/site-packages/distributed/client.py:1074: VersionMismatchWarning: Mismatched versions found\n", + "[mlrun] 2020-01-30 00:35:45,992 starting run user-task-my-sum uid=a90a033391c9471db638448f497fa9ef -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-30 00:35:46,062 Job is running in the background, pod: user-task-my-sum-fdq27\n", + "3.133.8.252:8786\n", + "[mlrun] 2020-01-30 00:36:04,279 Traceback (most recent call last):\n", + " File \"/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py\", line 218, in connect\n", + " _raise(error)\n", + " File \"/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py\", line 203, in _raise\n", + " raise IOError(msg)\n", + "OSError: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: in : ConnectionRefusedError: [Errno 111] Connection refused\n", "\n", - "blosc\n", - "+-----------+---------+\n", - "| | version |\n", - "+-----------+---------+\n", - "| client | None |\n", - "| scheduler | 1.7.0 |\n", - "+-----------+---------+\n", + "During handling of the above exception, another exception occurred:\n", "\n", - "lz4\n", - "+-----------+---------+\n", - "| | version |\n", - "+-----------+---------+\n", - "| client | None |\n", - "| scheduler | 2.2.1 |\n", - "+-----------+---------+\n", + "Traceback (most recent call last):\n", + " File \"/opt/conda/lib/python3.7/site-packages/mlrun/runtimes/local.py\", line 199, in exec_from_params\n", + " val = handler(*args_list)\n", + " File \"/User/repos/functions/tests/describe.py\", line 41, in table_summary\n", + " context.dask_client = Client(scheduler_file='/User/mlrun/models/scheduler.json')\n", + " File \"/opt/conda/lib/python3.7/site-packages/distributed/client.py\", line 726, in __init__\n", + " self.start(timeout=timeout)\n", + " File \"/opt/conda/lib/python3.7/site-packages/distributed/client.py\", line 891, in start\n", + " sync(self.loop, self._start, **kwargs)\n", + " File \"/opt/conda/lib/python3.7/site-packages/distributed/utils.py\", line 345, in sync\n", + " raise exc.with_traceback(tb)\n", + " File \"/opt/conda/lib/python3.7/site-packages/distributed/utils.py\", line 329, in f\n", + " result[0] = yield future\n", + " File \"/opt/conda/lib/python3.7/site-packages/tornado/gen.py\", line 735, in run\n", + " value = future.result()\n", + " File \"/opt/conda/lib/python3.7/site-packages/distributed/client.py\", line 984, in _start\n", + " await self._ensure_connected(timeout=timeout)\n", + " File \"/opt/conda/lib/python3.7/site-packages/distributed/client.py\", line 1041, in _ensure_connected\n", + " connection_args=self.connection_args,\n", + " File \"/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py\", line 227, in connect\n", + " _raise(error)\n", + " File \"/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py\", line 203, in _raise\n", + " raise IOError(msg)\n", + "OSError: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: in : ConnectionRefusedError: [Errno 111] Connection refused\n", "\n", - "msgpack\n", - "+-----------+---------+\n", - "| | version |\n", - "+-----------+---------+\n", - "| client | 0.6.2 |\n", - "| scheduler | 0.6.1 |\n", - "+-----------+---------+\n", "\n", - "tornado\n", - "+-----------+---------+\n", - "| | version |\n", - "+-----------+---------+\n", - "| client | 5.1.1 |\n", - "| scheduler | 6.0.3 |\n", - "+-----------+---------+\n", - " warnings.warn(version_module.VersionMismatchWarning(msg[0][\"warning\"]))\n", - "\"Dataset 'my_dask_dataframe' not found\"\n" + "[mlrun] 2020-01-30 00:36:04,293 exec error - Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: in : ConnectionRefusedError: [Errno 111] Connection refused\n", + "[mlrun] 2020-01-30 00:36:04,323 run executed, status=error\n", + "runtime error: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: in : ConnectionRefusedError: [Errno 111] Connection refused\n", + "Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: in : ConnectionRefusedError: [Errno 111] Connection refused\n", + "final state: failed\n" ] }, { @@ -775,26 +830,26 @@ " \n", " \n", " \n", - "
...25e8ae
\n", + "
...7fa9ef
\n", " 0\n", - " Jan 29 01:47:58\n", - "
error
\n", - " user-task-my-sum\n", - "
host=jupyter-1-6ccccd5fdf-mz2ld
kind=dask
owner=admin
\n", + " Jan 30 00:35:54\n", + "
: ConnectionRefusedError: [Errno 111] Connection refused\">error
\n", + " describe\n", + "
host=user-task-my-sum-fdq27
kind=job
owner=admin
\n", " \n", - "
data_key=my_dask_dataframe
key=table-summary
name=table-summary.csv
target_path=/User/mlrun/models
\n", + "
dask_client=3.133.8.252:8786
dask_key=testdf
key=table-summary
name=table-summary.csv
target_path=/User/mlrun/models
\n", " \n", " \n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -810,21 +865,22 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run f46e35077e114018836ac247b025e8ae , !mlrun logs f46e35077e114018836ac247b025e8ae \n", - "[mlrun] 2020-01-29 01:48:07,933 run executed, status=error\n" + "!mlrun get run a90a033391c9471db638448f497fa9ef , !mlrun logs a90a033391c9471db638448f497fa9ef \n", + "[mlrun] 2020-01-30 00:36:08,251 run executed, status=error\n", + "runtime error: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: in : ConnectionRefusedError: [Errno 111] Connection refused\n" ] }, { "ename": "RunError", - "evalue": "\"Dataset 'my_dask_dataframe' not found\"", + "evalue": "Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: in : ConnectionRefusedError: [Errno 111] Connection refused", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mRunError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 9\u001b[0m 'key' : 'table-summary'})\n\u001b[1;32m 10\u001b[0m \u001b[0;31m# run\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 11\u001b[0;31m \u001b[0mrn2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msumm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msumm_task\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, runspec, handler, name, project, params, inputs, out_path, workdir, watch, schedule)\u001b[0m\n\u001b[1;32m 302\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_post_run\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 303\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 304\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mlast_err\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 305\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 306\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mdict\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mRunObject\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 10\u001b[0m 'key' : 'table-summary'})\n\u001b[1;32m 11\u001b[0m \u001b[0;31m# run\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 12\u001b[0;31m \u001b[0mrn2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msumm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msumm_task\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, runspec, handler, name, project, params, inputs, out_path, workdir, watch, schedule)\u001b[0m\n\u001b[1;32m 266\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_post_run\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 267\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 268\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 269\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 270\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_api_server\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkfp\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36m_wrap_result\u001b[0;34m(self, result, runspec, err)\u001b[0m\n\u001b[1;32m 334\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_child\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 335\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'runtime error: {}'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 336\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mRunError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 337\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrun\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 338\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mRunError\u001b[0m: \"Dataset 'my_dask_dataframe' not found\"" + "\u001b[0;31mRunError\u001b[0m: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: in : ConnectionRefusedError: [Errno 111] Connection refused" ] } ], @@ -834,7 +890,8 @@ " 'user-task-my-sum', \n", " handler='table_summary', \n", " params={\n", - " 'data_key' : 'my_dask_dataframe',\n", + " 'dask_key' : 'testdf',\n", + " 'dask_client': '3.133.8.252:8786',\n", " 'target_path': '/User/mlrun/models',\n", " 'name' : 'table-summary.csv',\n", " 'key' : 'table-summary'})\n", @@ -851,19 +908,32 @@ "rn2.outputs" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## our cluster" + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "from dask.distributed import Client, LocalCluster\n", + "\n", + "client = Client(scheduler)" + ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "client.datasets['testdf']" + ] }, { "cell_type": "markdown", @@ -926,6 +996,22 @@ "source": [ "ddf.head()" ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ddf.shape[0].compute()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { diff --git a/tests/train_valid_test_split.ipynb b/tests/train_valid_test_split.ipynb index b1d568b59..d83636882 100644 --- a/tests/train_valid_test_split.ipynb +++ b/tests/train_valid_test_split.ipynb @@ -30,11 +30,20 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ - "CODE_BASE = '/User/repos/functions' \n", + "FUNCTION = 'train_valid_test'\n", + "DESCRIPTION = 'split data into train, validation and test splits'\n", + "\n", + "BASE_IMAGE = 'yjbds/mlrun-intel:dev'\n", + "JOB_KIND = 'job'\n", + "TASK_NAME = 'user-task-data-splits'\n", + "\n", + "CODE_BASE = '/User/repos/functions/datagen/splitters'\n", + "PROJECT = 'splitters'\n", + "\n", "RNG = 1\n", "TARGET_DATA_PATH = '/User/mlrun/models'\n", "SRC_FILE = 'higgs.pqt'\n", @@ -50,53 +59,40 @@ }, { "cell_type": "code", - "execution_count": 53, + "execution_count": 3, "metadata": {}, "outputs": [ { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-26 19:13:33,873 function spec saved to path: /User/repos/functions/datagen/splitters/train_valid_test.yaml\n" + "ename": "NameError", + "evalue": "name 'yaml_name' is not defined", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 3\u001b[0m filename=os.path.join(CODE_BASE, 'datagen/splitters', 'train_valid_test.py'))\n\u001b[1;32m 4\u001b[0m \u001b[0mtestfn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuild_config\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbase_image\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'yjbds/mlrun-ds:latest'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcommands\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mtestfn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexport\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0myaml_name\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mNameError\u001b[0m: name 'yaml_name' is not defined" ] } ], "source": [ - "testfn = mlrun.code_to_function(\n", - " kind='job', \n", - " filename=os.path.join(CODE_BASE, 'datagen/splitters', 'train_valid_test.py'))\n", - "testfn.build_config(base_image='yjbds/mlrun-ds:latest', commands=[])\n", - "testfn.export(yaml_name)" + "testfn = mlrun.new_function(\n", + " command=os.path.join(CODE_BASE, FUNCTION, 'function.py'),\n", + " kind=JOB_KIND)\n", + "testfn.build_config(base_image=BASE_IMAGE, commands=[])\n", + "testfn.export(os.path.join(CODE_BASE, FUNCTION, 'function.yaml'))" ] }, { "cell_type": "code", - "execution_count": 54, + "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "splitter = mlrun.import_function(\n", - " os.path.join(CODE_BASE, 'datagen/splitters', 'train_valid_test.yaml')\n", - ").apply(mlrun.mount_v3io())" - ] - }, - { - "cell_type": "code", - "execution_count": 55, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'ready'" - ] - }, - "execution_count": 55, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ + " os.path.join(CODE_BASE, FUNCTION, 'function.yaml'))\n", + "\n", + "splitter.apply(mlrun.mount_v3io())\n", + "\n", "splitter.deploy(skip_deployed=True, with_mlrun=False)" ] }, From 9c0c9dac018fb51e161711c80f8eda4876d562aa Mon Sep 17 00:00:00 2001 From: yasha Date: Thu, 30 Jan 2020 09:40:54 +0000 Subject: [PATCH 31/32] added table-summary artifact to dask workflow --- fileutils/arc_to_parquet/function.yaml | 2 +- fileutils/parquet_to_dask/function.py | 7 +- fileutils/parquet_to_dask/function.yaml | 13 +- tests/arc_to_parquet-airlines.ipynb | 446 +++++++++++++++++++-- tests/describe.py | 12 +- tests/parquet_to_dask.ipynb | 498 +++++++++++------------- 6 files changed, 655 insertions(+), 323 deletions(-) diff --git a/fileutils/arc_to_parquet/function.yaml b/fileutils/arc_to_parquet/function.yaml index e4d296a3c..d64cdfdfc 100644 --- a/fileutils/arc_to_parquet/function.yaml +++ b/fileutils/arc_to_parquet/function.yaml @@ -12,5 +12,5 @@ spec: env: [] description: retrieve archive table and save as parquet file build: - base_image: yjbds/mlrun_dev-files:latest + base_image: yjbds/mlrun-base:dev commands: [] diff --git a/fileutils/parquet_to_dask/function.py b/fileutils/parquet_to_dask/function.py index 02dca4d05..8807d49c5 100644 --- a/fileutils/parquet_to_dask/function.py +++ b/fileutils/parquet_to_dask/function.py @@ -33,6 +33,8 @@ def parquet_to_dask( index_cols: Optional[List[str]] = None, shards: int = 4, threads_per: int = 4, + processes: bool = False, + memory_limit: str = '2GB', persist: bool = True, dask_key: str = 'my_dask_dataframe', target_path: str = '' @@ -46,7 +48,10 @@ def parquet_to_dask( dask_client = context.dask_client else: context.logger.info('starting new cluster...') - cluster = LocalCluster(n_workers=shards, threads_per_worker=threads_per) + cluster = LocalCluster(n_workers=shards, + threads_per_worker=threads_per, + processes=processes, + memory_limit=memory_limit) dask_client = Client(cluster) context.logger.info(dask_client) diff --git a/fileutils/parquet_to_dask/function.yaml b/fileutils/parquet_to_dask/function.yaml index d05d008a2..1d3057db2 100644 --- a/fileutils/parquet_to_dask/function.yaml +++ b/fileutils/parquet_to_dask/function.yaml @@ -1,7 +1,7 @@ -kind: job +kind: dask metadata: name: function - hash: ba58f928478016117de9cefbce93054cad9ee263 + hash: 4ed6e4dfc23b35ca9a7a6029b1f08b9a1d786885 project: default spec: command: /User/repos/functions/fileutils/parquet_to_dask/function.py @@ -10,8 +10,13 @@ spec: volumes: [] volume_mounts: [] env: [] - description: '' - replicas: 4 build: base_image: yjbds/mlrun-daskboost:dev commands: [] + description: '' + replicas: 4 + remote: true + service_type: NodePort + nthreads: 1 + min_replicas: 0 + max_replicas: 4 diff --git a/tests/arc_to_parquet-airlines.ipynb b/tests/arc_to_parquet-airlines.ipynb index 152dd039a..6071710a7 100644 --- a/tests/arc_to_parquet-airlines.ipynb +++ b/tests/arc_to_parquet-airlines.ipynb @@ -11,7 +11,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -36,8 +36,8 @@ "FUNCTION = 'arc_to_parquet'\n", "DESCRIPTION = 'retrieve archive table and save as partitioned parquet dataset'\n", "\n", - "BASE_IMAGE = 'yjbds/mlrun_dev-files:latest'\n", - "JOB_KIND = 'job'\n", + "BASE_IMAGE = 'yjbds/mlrun-base:dev'\n", + "JOB_KIND = 'dask'\n", "TASK_NAME = 'user-task-arc-to-part-parq'\n", "\n", "CODE_BASE = '/User/repos/functions/fileutils'\n", @@ -130,12 +130,145 @@ { "cell_type": "code", "execution_count": 7, - "metadata": {}, + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-30 01:19:26,578 starting remote build, image: .mlrun/func-default-function-latest\n", + "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-base:dev to yjbds/mlrun-base:dev \n", + "\u001b[36mINFO\u001b[0m[0000] Resolved base name yjbds/mlrun-base:dev to yjbds/mlrun-base:dev \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-base:dev \n", + "\u001b[36mINFO\u001b[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:2bbe9095ff126252340957bde01f8d26d7742ee802d9b07a1490ad87e13ea3eb: no such file or directory \n", + "\u001b[36mINFO\u001b[0m[0000] Downloading base image yjbds/mlrun-base:dev \n", + "\u001b[36mINFO\u001b[0m[0001] Built cross stage deps: map[] \n", + "\u001b[36mINFO\u001b[0m[0001] Downloading base image yjbds/mlrun-base:dev \n", + "\u001b[36mINFO\u001b[0m[0001] Error while retrieving image from cache: getting file info: stat /cache/sha256:2bbe9095ff126252340957bde01f8d26d7742ee802d9b07a1490ad87e13ea3eb: no such file or directory \n", + "\u001b[36mINFO\u001b[0m[0001] Downloading base image yjbds/mlrun-base:dev \n", + "\u001b[36mINFO\u001b[0m[0001] Unpacking rootfs as cmd RUN pip install mlrun requires it. \n", + "\u001b[36mINFO\u001b[0m[0021] Taking snapshot of full filesystem... \n", + "\u001b[36mINFO\u001b[0m[0031] RUN pip install mlrun \n", + "\u001b[36mINFO\u001b[0m[0031] cmd: /bin/sh \n", + "\u001b[36mINFO\u001b[0m[0031] args: [-c pip install mlrun] \n", + "Requirement already satisfied: mlrun in /opt/conda/lib/python3.7/site-packages (0.4.4)\n", + "Requirement already satisfied: nuclio-sdk>=0.0.3 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.0.7)\n", + "Requirement already satisfied: sqlalchemy==1.3.11 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.3.11)\n", + "Requirement already satisfied: pandas>=0.23.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.25.3)\n", + "Requirement already satisfied: aiohttp>=3.5.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.6.2)\n", + "Requirement already satisfied: requests>=2.20.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (2.22.0)\n", + "Requirement already satisfied: kfp>=0.1.29 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.2.0)\n", + "Requirement already satisfied: Flask>=1.1.1 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.1.1)\n", + "Requirement already satisfied: gunicorn==19.9.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (19.9.0)\n", + "Requirement already satisfied: croniter==0.3.31 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.3.31)\n", + "Requirement already satisfied: nest-asyncio>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.2.3)\n", + "Requirement already satisfied: boto3>=1.9 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.11.9)\n", + "Requirement already satisfied: pyyaml>=5.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (5.3)\n", + "Requirement already satisfied: tabulate<=0.8.3,>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.3)\n", + "Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (7.0)\n", + "Requirement already satisfied: nuclio-jupyter>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (0.8.1)\n", + "Requirement already satisfied: GitPython>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (3.0.5)\n", + "Requirement already satisfied: gevent==1.4.0 in /opt/conda/lib/python3.7/site-packages (from mlrun) (1.4.0)\n", + "Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (1.18.1)\n", + "Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (2019.3)\n", + "Requirement already satisfied: python-dateutil>=2.6.1 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->mlrun) (2.8.1)\n", + "Requirement already satisfied: multidict<5.0,>=4.5 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (4.7.4)\n", + "Requirement already satisfied: chardet<4.0,>=2.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.4)\n", + "Requirement already satisfied: async-timeout<4.0,>=3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (3.0.1)\n", + "Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (19.3.0)\n", + "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp>=3.5.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (1.24.3)\n", + "Requirement already satisfied: idna<2.9,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (2.8)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.1->mlrun) (2019.9.11)\n", + "Requirement already satisfied: six>=1.10 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.12.0)\n", + "Requirement already satisfied: argo-models==2.2.1a in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.2.1a0)\n", + "Requirement already satisfied: kfp-server-api<=0.1.40,>=0.1.18 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.1.40)\n", + "Requirement already satisfied: kubernetes<=10.0.0,>=8.0.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (10.0.0)\n", + "Requirement already satisfied: Deprecated in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.2.7)\n", + "Requirement already satisfied: google-cloud-storage>=1.13.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.25.0)\n", + "Requirement already satisfied: jsonschema>=3.0.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (3.2.0)\n", + "Requirement already satisfied: cloudpickle==1.1.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.1.1)\n", + "Requirement already satisfied: google-auth>=1.6.1 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.11.0)\n", + "Requirement already satisfied: cryptography>=2.4.2 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (2.7)\n", + "Requirement already satisfied: requests-toolbelt>=0.8.0 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (0.9.1)\n", + "Requirement already satisfied: PyJWT>=1.6.4 in /opt/conda/lib/python3.7/site-packages (from kfp>=0.1.29->mlrun) (1.7.1)\n", + "Requirement already satisfied: Jinja2>=2.10.1 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (2.11.0)\n", + "Requirement already satisfied: itsdangerous>=0.24 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (1.1.0)\n", + "Requirement already satisfied: Werkzeug>=0.15 in /opt/conda/lib/python3.7/site-packages (from Flask>=1.1.1->mlrun) (0.16.1)\n", + "Requirement already satisfied: botocore<1.15.0,>=1.14.9 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (1.14.9)\n", + "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.9.4)\n", + "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.9->mlrun) (0.3.2)\n", + "Requirement already satisfied: nbconvert>=5.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (5.6.1)\n", + "Requirement already satisfied: jupyterlab>=0.35.4 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (1.2.6)\n", + "Requirement already satisfied: ipython>=7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (7.11.1)\n", + "Requirement already satisfied: notebook>=5.7.2 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.3)\n", + "Requirement already satisfied: tornado>=5 in /opt/conda/lib/python3.7/site-packages (from nuclio-jupyter>=0.8.0->mlrun) (6.0.3)\n", + "Requirement already satisfied: gitdb2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from GitPython>=2.1.0->mlrun) (2.0.6)\n", + "Requirement already satisfied: greenlet>=0.4.14; platform_python_implementation == \"CPython\" in /opt/conda/lib/python3.7/site-packages (from gevent==1.4.0->mlrun) (0.4.15)\n", + "Requirement already satisfied: requests-oauthlib in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (1.3.0)\n", + "Requirement already satisfied: setuptools>=21.0.0 in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (41.4.0)\n", + "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /opt/conda/lib/python3.7/site-packages (from kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (0.57.0)\n", + "Requirement already satisfied: wrapt<2,>=1.10 in /opt/conda/lib/python3.7/site-packages (from Deprecated->kfp>=0.1.29->mlrun) (1.11.2)\n", + "Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (0.5.0)\n", + "Requirement already satisfied: google-cloud-core<2.0dev,>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.2.0)\n", + "Requirement already satisfied: pyrsistent>=0.14.0 in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (0.15.7)\n", + "Requirement already satisfied: importlib-metadata; python_version < \"3.8\" in /opt/conda/lib/python3.7/site-packages (from jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (1.5.0)\n", + "Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.2.8)\n", + "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0.0)\n", + "Requirement already satisfied: rsa<4.1,>=3.1.4 in /opt/conda/lib/python3.7/site-packages (from google-auth>=1.6.1->kfp>=0.1.29->mlrun) (4.0)\n", + "Requirement already satisfied: asn1crypto>=0.21.0 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (1.0.1)\n", + "Requirement already satisfied: cffi!=1.11.3,>=1.8 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.4.2->kfp>=0.1.29->mlrun) (1.12.3)\n", + "Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.7/site-packages (from Jinja2>=2.10.1->Flask>=1.1.1->mlrun) (1.1.1)\n", + "Requirement already satisfied: docutils<0.16,>=0.10 in /opt/conda/lib/python3.7/site-packages (from botocore<1.15.0,>=1.14.9->boto3>=1.9->mlrun) (0.15.2)\n", + "Requirement already satisfied: pygments in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (2.5.2)\n", + "Requirement already satisfied: nbformat>=4.4 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (5.0.4)\n", + "Requirement already satisfied: traitlets>=4.2 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (4.3.3)\n", + "Requirement already satisfied: defusedxml in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: mistune<2,>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.4)\n", + "Requirement already satisfied: bleach in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (3.1.0)\n", + "Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (1.4.2)\n", + "Requirement already satisfied: jupyter-core in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (4.6.1)\n", + "Requirement already satisfied: entrypoints>=0.2.2 in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.3)\n", + "Requirement already satisfied: testpath in /opt/conda/lib/python3.7/site-packages (from nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.4.4)\n", + "Requirement already satisfied: jupyterlab-server~=1.0.0 in /opt/conda/lib/python3.7/site-packages (from jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (1.0.6)\n", + "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (3.0.3)\n", + "Requirement already satisfied: pickleshare in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.5)\n", + "Requirement already satisfied: jedi>=0.10 in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.16.0)\n", + "Requirement already satisfied: decorator in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.4.1)\n", + "Requirement already satisfied: backcall in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.0)\n", + "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /opt/conda/lib/python3.7/site-packages (from ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (4.8.0)\n", + "Requirement already satisfied: prometheus-client in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.7.1)\n", + "Requirement already satisfied: ipykernel in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.1.4)\n", + "Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.2.0)\n", + "Requirement already satisfied: pyzmq>=17 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (18.1.1)\n", + "Requirement already satisfied: Send2Trash in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (1.5.0)\n", + "Requirement already satisfied: terminado>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (0.8.3)\n", + "Requirement already satisfied: jupyter-client>=5.3.4 in /opt/conda/lib/python3.7/site-packages (from notebook>=5.7.2->nuclio-jupyter>=0.8.0->mlrun) (5.3.4)\n", + "Requirement already satisfied: smmap2>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from gitdb2>=2.0.0->GitPython>=2.1.0->mlrun) (2.0.5)\n", + "Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from requests-oauthlib->kubernetes<=10.0.0,>=8.0.0->kfp>=0.1.29->mlrun) (3.1.0)\n", + "Requirement already satisfied: google-api-core<2.0.0dev,>=1.16.0 in /opt/conda/lib/python3.7/site-packages (from google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.16.0)\n", + "Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata; python_version < \"3.8\"->jsonschema>=3.0.1->kfp>=0.1.29->mlrun) (2.1.0)\n", + "Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /opt/conda/lib/python3.7/site-packages (from pyasn1-modules>=0.2.1->google-auth>=1.6.1->kfp>=0.1.29->mlrun) (0.4.8)\n", + "Requirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.4.2->kfp>=0.1.29->mlrun) (2.19)\n", + "Requirement already satisfied: webencodings in /opt/conda/lib/python3.7/site-packages (from bleach->nbconvert>=5.4->nuclio-jupyter>=0.8.0->mlrun) (0.5.1)\n", + "Requirement already satisfied: json5 in /opt/conda/lib/python3.7/site-packages (from jupyterlab-server~=1.0.0->jupyterlab>=0.35.4->nuclio-jupyter>=0.8.0->mlrun) (0.8.5)\n", + "Requirement already satisfied: wcwidth in /opt/conda/lib/python3.7/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.1.8)\n", + "Requirement already satisfied: parso>=0.5.2 in /opt/conda/lib/python3.7/site-packages (from jedi>=0.10->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.7/site-packages (from pexpect; sys_platform != \"win32\"->ipython>=7.2->nuclio-jupyter>=0.8.0->mlrun) (0.6.0)\n", + "Requirement already satisfied: protobuf>=3.4.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (3.11.2)\n", + "Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /opt/conda/lib/python3.7/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.13.0->kfp>=0.1.29->mlrun) (1.51.0)\n", + "\u001b[36mINFO\u001b[0m[0033] Taking snapshot of full filesystem... \n" + ] + }, { "data": { "text/plain": [ - "'ready'" + "True" ] }, "execution_count": 7, @@ -150,7 +283,7 @@ "\n", "arctoparq.apply(mlrun.mount_v3io())\n", "\n", - "arctoparq.deploy(skip_deployed=True, with_mlrun=False)" + "arctoparq.deploy() #skip_deployed=True, with_mlrun=False)" ] }, { @@ -162,13 +295,13 @@ "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-29 12:30:19,645 starting run user-task-arc-to-part-parq uid=963c75c5d76642da9bbae845f527e361 -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-29 12:30:19,808 Job is running in the background, pod: user-task-arc-to-part-parq-c8fjx\n", - "[mlrun] 2020-01-29 12:30:24,158 destination file does not exist, downloading\n", - "[mlrun] 2020-01-29 12:30:24,614 saved table to /User/mlrun/airlines/dataset-small/partitions\n", - "[mlrun] 2020-01-29 12:30:24,647 log artifact airlines at /User/mlrun/airlines/dataset-small/partitions, size: None, db: Y\n", + "[mlrun] 2020-01-30 01:20:30,226 starting run user-task-arc-to-part-parq uid=e98743f403fc4c1aabb5fd293ae16613 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-30 01:20:30,314 Job is running in the background, pod: user-task-arc-to-part-parq-km9tw\n", + "[mlrun] 2020-01-30 01:20:36,058 destination file does not exist, downloading\n", + "[mlrun] 2020-01-30 01:20:36,537 saved table to /User/mlrun/airlines/dataset-small/partitions\n", + "[mlrun] 2020-01-30 01:20:36,564 log artifact airlines at /User/mlrun/airlines/dataset-small/partitions, size: None, db: Y\n", "\n", - "[mlrun] 2020-01-29 12:30:24,667 run executed, status=completed\n", + "[mlrun] 2020-01-30 01:20:36,578 run executed, status=completed\n", "final state: succeeded\n" ] }, @@ -341,12 +474,12 @@ " \n", " \n", " \n", - "
...27e361
\n", + "
...e16613
\n", " 0\n", - " Jan 29 12:30:24\n", + " Jan 30 01:20:36\n", " completed\n", " function\n", - "
host=user-task-arc-to-part-parq-c8fjx
kind=job
owner=admin
\n", + "
host=user-task-arc-to-part-parq-km9tw
kind=job
owner=admin
\n", " \n", "
archive_url=https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv
dataset=partitions
dtype={'AirTime': 'float32', 'ArrTime': 'float32', 'CRSElapsedTime': 'float32', 'CarrierDelay': 'float32', 'DepTime': 'float32', 'Distance': 'float32', 'LateAircraftDelay': 'float32', 'NASDelay': 'float32', 'SecurityDelay': 'float32', 'TailNum': 'str', 'TaxiIn': 'float32', 'TaxiOut': 'float32', 'WeatherDelay': 'float32'}
encoding=latin-1
inc_cols=['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'CRSElapsedTime', 'AirTime', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut', 'Cancelled', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
key=airlines
name=airlines.pqt
part_cols=['Year', 'Month']
target_path=/User/mlrun/airlines/dataset-small
\n", " \n", @@ -355,12 +488,12 @@ " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -376,8 +509,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 963c75c5d76642da9bbae845f527e361 , !mlrun logs 963c75c5d76642da9bbae845f527e361 \n", - "[mlrun] 2020-01-29 12:30:25,977 run executed, status=completed\n" + "!mlrun get run e98743f403fc4c1aabb5fd293ae16613 , !mlrun logs e98743f403fc4c1aabb5fd293ae16613 \n", + "[mlrun] 2020-01-30 01:20:39,512 run executed, status=completed\n" ] } ], @@ -400,13 +533,6 @@ "run = arctoparq.run(arc_to_parq_task)" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "___" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -423,7 +549,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ @@ -434,7 +560,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ @@ -444,7 +570,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -453,18 +579,268 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 12, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
DayofMonthDayOfWeekDepTimeCRSDepTimeArrTimeCRSArrTimeUniqueCarrierFlightNumCRSElapsedTimeAirTime...DestDistanceTaxiInTaxiOutCancelledCarrierDelayWeatherDelayNASDelaySecurityDelayLateAircraftDelay
YearMonth
198710143741.0730912.0849PS145179.0NaN...SFO447.0NaNNaN0NaNNaNNaNNaNNaN
10154729.0730903.0849PS145179.0NaN...SFO447.0NaNNaN0NaNNaNNaNNaNNaN
10176741.0730918.0849PS145179.0NaN...SFO447.0NaNNaN0NaNNaNNaNNaNNaN
10187729.0730847.0849PS145179.0NaN...SFO447.0NaNNaN0NaNNaNNaNNaNNaN
10191749.0730922.0849PS145179.0NaN...SFO447.0NaNNaN0NaNNaNNaNNaNNaN
\n", + "

5 rows × 21 columns

\n", + "
" + ], + "text/plain": [ + " DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime \\\n", + "Year Month \n", + "1987 10 14 3 741.0 730 912.0 849 \n", + " 10 15 4 729.0 730 903.0 849 \n", + " 10 17 6 741.0 730 918.0 849 \n", + " 10 18 7 729.0 730 847.0 849 \n", + " 10 19 1 749.0 730 922.0 849 \n", + "\n", + " UniqueCarrier FlightNum CRSElapsedTime AirTime ... Dest \\\n", + "Year Month ... \n", + "1987 10 PS 1451 79.0 NaN ... SFO \n", + " 10 PS 1451 79.0 NaN ... SFO \n", + " 10 PS 1451 79.0 NaN ... SFO \n", + " 10 PS 1451 79.0 NaN ... SFO \n", + " 10 PS 1451 79.0 NaN ... SFO \n", + "\n", + " Distance TaxiIn TaxiOut Cancelled CarrierDelay WeatherDelay \\\n", + "Year Month \n", + "1987 10 447.0 NaN NaN 0 NaN NaN \n", + " 10 447.0 NaN NaN 0 NaN NaN \n", + " 10 447.0 NaN NaN 0 NaN NaN \n", + " 10 447.0 NaN NaN 0 NaN NaN \n", + " 10 447.0 NaN NaN 0 NaN NaN \n", + "\n", + " NASDelay SecurityDelay LateAircraftDelay \n", + "Year Month \n", + "1987 10 NaN NaN NaN \n", + " 10 NaN NaN NaN \n", + " 10 NaN NaN NaN \n", + " 10 NaN NaN NaN \n", + " 10 NaN NaN NaN \n", + "\n", + "[5 rows x 21 columns]" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "df.head()" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 13, "metadata": {}, - "outputs": [], + "outputs": [ + { + "ename": "AssertionError", + "evalue": "(87956, 21)", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mAssertionError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;32massert\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m==\u001b[0m\u001b[0mFILE_SHAPE\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mUSE_ARCHIVE\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0mARCHIVE_SMALL\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0;32massert\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m==\u001b[0m\u001b[0mSMALL_FILE_SHAPE\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34mf\"{df.shape}\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mAssertionError\u001b[0m: (87956, 21)" + ] + } + ], "source": [ "if USE_ARCHIVE == ARCHIVE:\n", " assert df.shape==FILE_SHAPE\n", diff --git a/tests/describe.py b/tests/describe.py index 3b6de642e..cee522060 100644 --- a/tests/describe.py +++ b/tests/describe.py @@ -19,10 +19,11 @@ import dask import dask.dataframe as dd -from dask.distributed import Client, LocalCluster +from dask.distributed import Client from mlrun.execution import MLClientCtx from mlrun.datastore import DataItem +from mlrun.artifacts import ChartArtifact, TableArtifact, PlotArtifact from typing import IO, AnyStr, Union, List, Optional @@ -36,12 +37,11 @@ def table_summary( ) -> None: """Summarize a table """ - print(str(dask_client)) + context.dask_client = Client(scheduler_file=str(dask_client)) + df = context.dask_client.get_dataset('dask_key') + dscr = df.describe() - context.dask_client = Client(scheduler_file='/User/mlrun/models/scheduler.json') - dscr = context.dask_client.datasets[dask_key].describe() filepath = os.path.join(target_path, name) - dscr.to_csv(filepath) - print(dscr) + dd.to_csv(dscr, filepath, single_file=True, index=False) context.log_artifact(key, target_path=filepath) \ No newline at end of file diff --git a/tests/parquet_to_dask.ipynb b/tests/parquet_to_dask.ipynb index ff3af28c7..450025872 100644 --- a/tests/parquet_to_dask.ipynb +++ b/tests/parquet_to_dask.ipynb @@ -10,7 +10,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -29,7 +29,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -37,7 +37,7 @@ "DESCRIPTION = 'load parquet dataset into a dask cluster'\n", "\n", "BASE_IMAGE = 'yjbds/mlrun-daskboost:dev'\n", - "JOB_KIND = 'job'\n", + "JOB_KIND = 'dask'\n", "TASK_NAME = 'user-task-parq-to-dask'\n", "\n", "CODE_BASE = '/User/repos/functions/fileutils'\n", @@ -59,7 +59,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -76,7 +76,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -92,14 +92,14 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-30 00:27:38,400 function spec saved to path: /User/repos/functions/fileutils/parquet_to_dask/function.yaml\n" + "[mlrun] 2020-01-30 09:38:16,699 function spec saved to path: /User/repos/functions/fileutils/parquet_to_dask/function.yaml\n" ] } ], @@ -116,14 +116,14 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-30 00:27:38,522 starting remote build, image: .mlrun/func-default-function-latest\n" + "[mlrun] 2020-01-30 09:38:18,145 starting remote build, image: .mlrun/func-default-function-latest\n" ] }, { @@ -132,7 +132,7 @@ "True" ] }, - "execution_count": 27, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } @@ -142,36 +142,94 @@ "\n", "parq2dask.apply(mlrun.mount_v3io())\n", "\n", - "parq2dask.deploy(skip_deployed=True, with_mlrun=False)" + "parq2dask.deploy() # skip_deployed=True, with_mlrun=False)" ] }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-30 00:27:38,588 starting run user-task-parq-to-dask uid=56bcb9b52e1f4a1a81fc51f8d4e15d8e -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-30 00:27:38,693 Job is running in the background, pod: user-task-parq-to-dask-c9fmj\n", - "/opt/conda/lib/python3.7/site-packages/bokeh/themes/theme.py:94: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.\n", - " json = yaml.load(f)\n", - "[mlrun] 2020-01-30 00:27:46,724 starting new cluster...\n", - "[mlrun] 2020-01-30 00:27:48,599 \n", - "[mlrun] 2020-01-30 00:27:48,875 log artifact scheduler at /User/mlrun/models/scheduler.json, size: None, db: Y\n", - " Year Month DayofMonth ... NASDelay SecurityDelay LateAircraftDelay\n", - "0 2007 1 1 ... 0.0 0.0 67.0\n", - "1 2007 1 1 ... 0.0 0.0 17.0\n", - "2 2007 1 1 ... 0.0 0.0 0.0\n", - "3 2007 1 1 ... 0.0 0.0 0.0\n", - "4 2007 1 1 ... 0.0 0.0 0.0\n", + "[mlrun] 2020-01-30 09:38:20,399 starting run user-task-parq-to-dask uid=8d780f8755984477975fb16927110af1 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-30 09:38:21,436 saving function: function, tag: latest\n", + "[mlrun] 2020-01-30 09:38:27,297 using remote dask scheduler (mlrun-function-90bd99ce-2) at: 3.133.8.252:30417\n", + "[mlrun] 2020-01-30 09:38:27,298 remote dashboard (node) port: 3.133.8.252:30164\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/conda/lib/python3.6/site-packages/distributed/client.py:1074: VersionMismatchWarning: Mismatched versions found\n", + "\n", + "dask\n", + "+-----------+---------+\n", + "| | version |\n", + "+-----------+---------+\n", + "| client | 2.9.2 |\n", + "| scheduler | 2.10.0 |\n", + "+-----------+---------+\n", + "\n", + "distributed\n", + "+-----------+---------+\n", + "| | version |\n", + "+-----------+---------+\n", + "| client | 2.9.3 |\n", + "| scheduler | 2.10.0 |\n", + "+-----------+---------+\n", + "\n", + "msgpack\n", + "+-----------+---------+\n", + "| | version |\n", + "+-----------+---------+\n", + "| client | 0.6.2 |\n", + "| scheduler | 0.6.1 |\n", + "+-----------+---------+\n", + " warnings.warn(version_module.VersionMismatchWarning(msg[0][\"warning\"]))\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[mlrun] 2020-01-30 09:38:27,301 found cluster...\n", + "[mlrun] 2020-01-30 09:38:27,301 \n", + "[mlrun] 2020-01-30 09:38:27,636 log artifact scheduler at /User/mlrun/models/scheduler.json, size: None, db: Y\n", + " Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime \\\n", + "0 1997 1 7 2 1020.0 1020 1123.0 \n", + "1 1997 1 8 3 1107.0 1020 1205.0 \n", + "2 1997 1 9 4 1020.0 1020 1130.0 \n", + "3 1997 1 10 5 1020.0 1020 1123.0 \n", + "4 1997 1 12 7 1020.0 1020 1134.0 \n", + "\n", + " CRSArrTime UniqueCarrier FlightNum ... Dest Distance TaxiIn TaxiOut \\\n", + "0 1130 WN 1293 ... PHX 328.0 2.0 5.0 \n", + "1 1130 WN 1293 ... PHX 328.0 3.0 9.0 \n", + "2 1130 WN 1293 ... PHX 328.0 3.0 8.0 \n", + "3 1130 WN 1293 ... PHX 328.0 2.0 5.0 \n", + "4 1130 WN 1293 ... PHX 328.0 2.0 7.0 \n", + "\n", + " Cancelled CarrierDelay WeatherDelay NASDelay SecurityDelay \\\n", + "0 0 NaN NaN NaN NaN \n", + "1 0 NaN NaN NaN NaN \n", + "2 0 NaN NaN NaN NaN \n", + "3 0 NaN NaN NaN NaN \n", + "4 0 NaN NaN NaN NaN \n", + "\n", + " LateAircraftDelay \n", + "0 NaN \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN \n", "\n", "[5 rows x 23 columns]\n", "\n", - "[mlrun] 2020-01-30 00:27:49,134 run executed, status=completed\n", - "final state: succeeded\n" + "[mlrun] 2020-01-30 09:38:35,050 run ended with state \n" ] }, { @@ -343,26 +401,26 @@ " \n", " \n", " \n", - "
...e15d8e
\n", + "
...110af1
\n", " 0\n", - " Jan 30 00:27:46\n", + " Jan 30 09:38:21\n", " completed\n", - " function\n", - "
host=user-task-parq-to-dask-c9fmj
kind=job
owner=admin
\n", + " user-task-parq-to-dask\n", + "
kind=dask
owner=admin
host=jupyter-1-6ccccd5fdf-mz2ld
\n", " \n", - "
dask_key=testdf1
index_cols=['Year', 'Month']
parquet_url=/User/mlrun/airlines/dataset-small/partitions
persist=True
shards=4
target_path=/User/mlrun/models
threads_per=4
\n", + "
parquet_url=/User/mlrun/airlines/dataset-small/partitions
index_cols=['Year', 'Month']
shards=4
threads_per=4
persist=True
dask_key=testdf1
target_path=/User/mlrun/models
\n", " \n", - "
scheduler
\n", + "
scheduler
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -378,8 +436,8 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run 56bcb9b52e1f4a1a81fc51f8d4e15d8e , !mlrun logs 56bcb9b52e1f4a1a81fc51f8d4e15d8e \n", - "[mlrun] 2020-01-30 00:27:53,903 run executed, status=completed\n" + "!mlrun get run 8d780f8755984477975fb16927110af1 , !mlrun logs 8d780f8755984477975fb16927110af1 \n", + "[mlrun] 2020-01-30 09:38:35,085 run executed, status=completed\n" ] } ], @@ -402,7 +460,7 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 9, "metadata": {}, "outputs": [ { @@ -411,7 +469,7 @@ "'/User/mlrun/models/scheduler.json'" ] }, - "execution_count": 29, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } @@ -429,107 +487,20 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'type': 'Scheduler',\n", - " 'id': 'Scheduler-0a0672a0-1532-42cb-a151-3aa3da211991',\n", - " 'address': 'tcp://127.0.0.1:38329',\n", + " 'id': 'Scheduler-e216939d-7eaf-4946-98dc-29a0b571b1e2',\n", + " 'address': 'tcp://10.233.64.55:8786',\n", " 'services': {},\n", - " 'workers': {'tcp://127.0.0.1:32950': {'type': 'Worker',\n", - " 'id': 3,\n", - " 'host': '127.0.0.1',\n", - " 'resources': {},\n", - " 'local_directory': '/dask-worker-space/worker-db2wzpjl',\n", - " 'name': 3,\n", - " 'nthreads': 4,\n", - " 'memory_limit': 16612705280,\n", - " 'last_seen': 1580344068.5361557,\n", - " 'services': {},\n", - " 'metrics': {'cpu': 0.0,\n", - " 'memory': 154923008,\n", - " 'time': 1580344068.530411,\n", - " 'read_bytes': 0.0,\n", - " 'write_bytes': 0.0,\n", - " 'num_fds': 22,\n", - " 'executing': 0,\n", - " 'in_memory': 0,\n", - " 'ready': 0,\n", - " 'in_flight': 0,\n", - " 'bandwidth': {'total': 100000000, 'workers': {}, 'types': {}}},\n", - " 'nanny': 'tcp://127.0.0.1:34755'},\n", - " 'tcp://127.0.0.1:33754': {'type': 'Worker',\n", - " 'id': 2,\n", - " 'host': '127.0.0.1',\n", - " 'resources': {},\n", - " 'local_directory': '/dask-worker-space/worker-3hgbwwcw',\n", - " 'name': 2,\n", - " 'nthreads': 4,\n", - " 'memory_limit': 16612705280,\n", - " 'last_seen': 1580344068.53231,\n", - " 'services': {},\n", - " 'metrics': {'cpu': 0.0,\n", - " 'memory': 154923008,\n", - " 'time': 1580344068.5266526,\n", - " 'read_bytes': 0.0,\n", - " 'write_bytes': 0.0,\n", - " 'num_fds': 22,\n", - " 'executing': 0,\n", - " 'in_memory': 0,\n", - " 'ready': 0,\n", - " 'in_flight': 0,\n", - " 'bandwidth': {'total': 100000000, 'workers': {}, 'types': {}}},\n", - " 'nanny': 'tcp://127.0.0.1:36665'},\n", - " 'tcp://127.0.0.1:40727': {'type': 'Worker',\n", - " 'id': 1,\n", - " 'host': '127.0.0.1',\n", - " 'resources': {},\n", - " 'local_directory': '/dask-worker-space/worker-evnbey62',\n", - " 'name': 1,\n", - " 'nthreads': 4,\n", - " 'memory_limit': 16612705280,\n", - " 'last_seen': 1580344068.5346513,\n", - " 'services': {},\n", - " 'metrics': {'cpu': 0.0,\n", - " 'memory': 154918912,\n", - " 'time': 1580344068.5266192,\n", - " 'read_bytes': 0.0,\n", - " 'write_bytes': 0.0,\n", - " 'num_fds': 22,\n", - " 'executing': 0,\n", - " 'in_memory': 0,\n", - " 'ready': 0,\n", - " 'in_flight': 0,\n", - " 'bandwidth': {'total': 100000000, 'workers': {}, 'types': {}}},\n", - " 'nanny': 'tcp://127.0.0.1:42264'},\n", - " 'tcp://127.0.0.1:43293': {'type': 'Worker',\n", - " 'id': 0,\n", - " 'host': '127.0.0.1',\n", - " 'resources': {},\n", - " 'local_directory': '/dask-worker-space/worker-_przo4q2',\n", - " 'name': 0,\n", - " 'nthreads': 4,\n", - " 'memory_limit': 16612705280,\n", - " 'last_seen': 1580344068.5533519,\n", - " 'services': {},\n", - " 'metrics': {'cpu': 0.0,\n", - " 'memory': 154923008,\n", - " 'time': 1580344068.5480375,\n", - " 'read_bytes': 0.0,\n", - " 'write_bytes': 0.0,\n", - " 'num_fds': 22,\n", - " 'executing': 0,\n", - " 'in_memory': 0,\n", - " 'ready': 0,\n", - " 'in_flight': 0,\n", - " 'bandwidth': {'total': 100000000, 'workers': {}, 'types': {}}},\n", - " 'nanny': 'tcp://127.0.0.1:45396'}}}" + " 'workers': {}}" ] }, - "execution_count": 30, + "execution_count": 10, "metadata": {}, "output_type": "execute_result" } @@ -550,115 +521,20 @@ }, { "cell_type": "code", - "execution_count": 43, - "metadata": {}, - "outputs": [], - "source": [ - "summ = mlrun.new_function(\n", - " command='/User/repos/functions/tests/describe.py', \n", - " kind='job')" - ] - }, - { - "cell_type": "code", - "execution_count": 44, - "metadata": {}, - "outputs": [], - "source": [ - "summ.spec.build.base_image = BASE_IMAGE" - ] - }, - { - "cell_type": "code", - "execution_count": 45, + "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[mlrun] 2020-01-30 00:35:45,128 function spec saved to path: /User/repos/functions/tests/describe.yaml\n" - ] - } - ], - "source": [ - "summ.export('/User/repos/functions/tests/describe.yaml')" - ] - }, - { - "cell_type": "code", - "execution_count": 46, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'ready'" - ] - }, - "execution_count": 46, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "summ.apply(mlrun.mount_v3io())\n", - "\n", - "summ.deploy(skip_deployed=True, with_mlrun=False)" - ] - }, - { - "cell_type": "code", - "execution_count": 47, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-30 00:35:45,992 starting run user-task-my-sum uid=a90a033391c9471db638448f497fa9ef -> http://mlrun-api:8080\n", - "[mlrun] 2020-01-30 00:35:46,062 Job is running in the background, pod: user-task-my-sum-fdq27\n", - "3.133.8.252:8786\n", - "[mlrun] 2020-01-30 00:36:04,279 Traceback (most recent call last):\n", - " File \"/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py\", line 218, in connect\n", - " _raise(error)\n", - " File \"/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py\", line 203, in _raise\n", - " raise IOError(msg)\n", - "OSError: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: in : ConnectionRefusedError: [Errno 111] Connection refused\n", - "\n", - "During handling of the above exception, another exception occurred:\n", - "\n", - "Traceback (most recent call last):\n", - " File \"/opt/conda/lib/python3.7/site-packages/mlrun/runtimes/local.py\", line 199, in exec_from_params\n", - " val = handler(*args_list)\n", - " File \"/User/repos/functions/tests/describe.py\", line 41, in table_summary\n", - " context.dask_client = Client(scheduler_file='/User/mlrun/models/scheduler.json')\n", - " File \"/opt/conda/lib/python3.7/site-packages/distributed/client.py\", line 726, in __init__\n", - " self.start(timeout=timeout)\n", - " File \"/opt/conda/lib/python3.7/site-packages/distributed/client.py\", line 891, in start\n", - " sync(self.loop, self._start, **kwargs)\n", - " File \"/opt/conda/lib/python3.7/site-packages/distributed/utils.py\", line 345, in sync\n", - " raise exc.with_traceback(tb)\n", - " File \"/opt/conda/lib/python3.7/site-packages/distributed/utils.py\", line 329, in f\n", - " result[0] = yield future\n", - " File \"/opt/conda/lib/python3.7/site-packages/tornado/gen.py\", line 735, in run\n", - " value = future.result()\n", - " File \"/opt/conda/lib/python3.7/site-packages/distributed/client.py\", line 984, in _start\n", - " await self._ensure_connected(timeout=timeout)\n", - " File \"/opt/conda/lib/python3.7/site-packages/distributed/client.py\", line 1041, in _ensure_connected\n", - " connection_args=self.connection_args,\n", - " File \"/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py\", line 227, in connect\n", - " _raise(error)\n", - " File \"/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py\", line 203, in _raise\n", - " raise IOError(msg)\n", - "OSError: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: in : ConnectionRefusedError: [Errno 111] Connection refused\n", + "[mlrun] 2020-01-30 09:38:53,769 function spec saved to path: /User/repos/functions/tests/describe.yaml\n", + "[mlrun] 2020-01-30 09:38:53,822 starting run user-task-my-sum uid=5a52e1a6009647848d71dd211b741ee8 -> http://mlrun-api:8080\n", + "[mlrun] 2020-01-30 09:38:53,905 Job is running in the background, pod: user-task-my-sum-k5nsk\n", + "[mlrun] 2020-01-30 09:39:04,332 log artifact table-summary at /User/mlrun/models/table-summary.csv, size: None, db: Y\n", "\n", - "\n", - "[mlrun] 2020-01-30 00:36:04,293 exec error - Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: in : ConnectionRefusedError: [Errno 111] Connection refused\n", - "[mlrun] 2020-01-30 00:36:04,323 run executed, status=error\n", - "runtime error: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: in : ConnectionRefusedError: [Errno 111] Connection refused\n", - "Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: in : ConnectionRefusedError: [Errno 111] Connection refused\n", - "final state: failed\n" + "[mlrun] 2020-01-30 09:39:04,347 run executed, status=completed\n", + "final state: succeeded\n" ] }, { @@ -830,26 +706,26 @@ " \n", " \n", " \n", - "
...7fa9ef
\n", + "
...741ee8
\n", " 0\n", - " Jan 30 00:35:54\n", - "
: ConnectionRefusedError: [Errno 111] Connection refused\">error
\n", + " Jan 30 09:39:02\n", + " completed\n", " describe\n", - "
host=user-task-my-sum-fdq27
kind=job
owner=admin
\n", - " \n", - "
dask_client=3.133.8.252:8786
dask_key=testdf
key=table-summary
name=table-summary.csv
target_path=/User/mlrun/models
\n", + "
host=user-task-my-sum-k5nsk
kind=job
owner=admin
\n", " \n", + "
dask_client=/User/mlrun/models/scheduler.json
dask_key=testdf1
key=table-summary
name=table-summary.csv
target_path=/User/mlrun/models
\n", " \n", + "
table-summary
\n", " \n", " \n", "\n", "\n", - "
\n", + "
\n", "
\n", - " Title\n", - " ×\n", + " Title\n", + " ×\n", "
\n", - " \n", + " \n", "
\n", "
\n" ], @@ -865,46 +741,51 @@ "output_type": "stream", "text": [ "to track results use .show() or .logs() or in CLI: \n", - "!mlrun get run a90a033391c9471db638448f497fa9ef , !mlrun logs a90a033391c9471db638448f497fa9ef \n", - "[mlrun] 2020-01-30 00:36:08,251 run executed, status=error\n", - "runtime error: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: in : ConnectionRefusedError: [Errno 111] Connection refused\n" + "!mlrun get run 5a52e1a6009647848d71dd211b741ee8 , !mlrun logs 5a52e1a6009647848d71dd211b741ee8 \n", + "[mlrun] 2020-01-30 09:39:13,120 run executed, status=completed\n" ] }, { - "ename": "RunError", - "evalue": "Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: in : ConnectionRefusedError: [Errno 111] Connection refused", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mRunError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 10\u001b[0m 'key' : 'table-summary'})\n\u001b[1;32m 11\u001b[0m \u001b[0;31m# run\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 12\u001b[0;31m \u001b[0mrn2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msumm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msumm_task\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, runspec, handler, name, project, params, inputs, out_path, workdir, watch, schedule)\u001b[0m\n\u001b[1;32m 266\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_post_run\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 267\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 268\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_wrap_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrunspec\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 269\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 270\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_api_server\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkfp\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/.pythonlibs/jupyter-1/lib/python3.6/site-packages/mlrun/runtimes/base.py\u001b[0m in \u001b[0;36m_wrap_result\u001b[0;34m(self, result, runspec, err)\u001b[0m\n\u001b[1;32m 334\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_is_remote\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_child\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 335\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'runtime error: {}'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 336\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mRunError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 337\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrun\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 338\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mRunError\u001b[0m: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: Timed out trying to connect to 'tcp://127.0.0.1:38329' after 10 s: in : ConnectionRefusedError: [Errno 111] Connection refused" - ] + "data": { + "text/plain": [ + "{'table-summary': '/User/mlrun/models/table-summary.csv'}" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ - "# create and run the task\n", + "# write up function in local directory\n", + "summ = mlrun.new_function(command='/User/repos/functions/tests/describe.py', \n", + " kind='job')\n", + "# specify a base image\n", + "summ.spec.build.base_image = BASE_IMAGE\n", + "\n", + "# (optional) export it as yaml\n", + "summ.export('/User/repos/functions/tests/describe.yaml')\n", + "\n", + "# mount it on iguazio data fabric\n", + "summ.apply(mlrun.mount_v3io())\n", + "\n", + "# deploy the function\n", + "summ.deploy(skip_deployed=True, with_mlrun=False)\n", + "\n", + "# create the task\n", "summ_task = mlrun.NewTask(\n", " 'user-task-my-sum', \n", " handler='table_summary', \n", " params={\n", - " 'dask_key' : 'testdf',\n", - " 'dask_client': '3.133.8.252:8786',\n", + " 'dask_key' : 'testdf1',\n", + " 'dask_client': rn.outputs['scheduler'],\n", " 'target_path': '/User/mlrun/models',\n", " 'name' : 'table-summary.csv',\n", " 'key' : 'table-summary'})\n", + "\n", "# run\n", - "rn2 = summ.run(summ_task)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ + "rn2 = summ.run(summ_task)\n", + "\n", "rn2.outputs" ] }, @@ -917,22 +798,87 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 12, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/conda/lib/python3.6/site-packages/distributed/client.py:1074: VersionMismatchWarning: Mismatched versions found\n", + "\n", + "dask\n", + "+--------------------------+---------+\n", + "| | version |\n", + "+--------------------------+---------+\n", + "| client | 2.9.2 |\n", + "| scheduler | 2.10.0 |\n", + "| tcp://10.233.64.56:38718 | 2.10.0 |\n", + "| tcp://10.233.64.57:36325 | 2.10.0 |\n", + "| tcp://10.233.64.58:38383 | 2.10.0 |\n", + "| tcp://10.233.64.59:44139 | 2.10.0 |\n", + "+--------------------------+---------+\n", + "\n", + "distributed\n", + "+--------------------------+---------+\n", + "| | version |\n", + "+--------------------------+---------+\n", + "| client | 2.9.3 |\n", + "| scheduler | 2.10.0 |\n", + "| tcp://10.233.64.56:38718 | 2.10.0 |\n", + "| tcp://10.233.64.57:36325 | 2.10.0 |\n", + "| tcp://10.233.64.58:38383 | 2.10.0 |\n", + "| tcp://10.233.64.59:44139 | 2.10.0 |\n", + "+--------------------------+---------+\n", + "\n", + "msgpack\n", + "+--------------------------+---------+\n", + "| | version |\n", + "+--------------------------+---------+\n", + "| client | 0.6.2 |\n", + "| scheduler | 0.6.1 |\n", + "| tcp://10.233.64.56:38718 | 0.6.1 |\n", + "| tcp://10.233.64.57:36325 | 0.6.1 |\n", + "| tcp://10.233.64.58:38383 | 0.6.1 |\n", + "| tcp://10.233.64.59:44139 | 0.6.1 |\n", + "+--------------------------+---------+\n", + " warnings.warn(version_module.VersionMismatchWarning(msg[0][\"warning\"]))\n" + ] + } + ], "source": [ "from dask.distributed import Client, LocalCluster\n", "\n", - "client = Client(scheduler)" + "client = Client(scheduler_file='/User/mlrun/models/scheduler.json') # Client(scheduler_file=rn.outputs['scheduler'])" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 14, "metadata": {}, "outputs": [], "source": [ - "client.datasets['testdf']" + "df = client.get_dataset('dask_key')" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "175912" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.shape[0].compute()" ] }, { From f781512d6b1d290b87cb1e10829b89ac86d29ac9 Mon Sep 17 00:00:00 2001 From: yasha Date: Thu, 30 Jan 2020 15:21:02 +0000 Subject: [PATCH 32/32] temp backup --- datagen/binary/{binary.py => function.py} | 0 datagen/binary/{binary.yaml => function.yaml} | 2 +- datagen/train_valid_test/function.yaml | 2 +- evaluation/test-classifier.yaml | 4 +- fileutils/arc_to_parquet/function.yaml | 4 +- fileutils/open_archive/function.yaml | 2 +- fileutils/parquet_to_dask/function.py | 13 +- fileutils/parquet_to_dask/function.yaml | 2 +- tests/arc_to_parquet-airlines.ipynb | 4 +- tests/arc_to_parquet.ipynb | 2 +- tests/create_binary_data.ipynb | 29 +- tests/describe.py | 8 + tests/describe.yaml | 3 +- tests/parquet_to_dask.ipynb | 280 ++++++++++++++---- tests/test_classifier.ipynb | 2 +- tests/train_classifier.ipynb | 2 +- tests/train_valid_test_split.ipynb | 27 +- train/sklearn-classifier.yaml | 2 +- 18 files changed, 267 insertions(+), 121 deletions(-) rename datagen/binary/{binary.py => function.py} (100%) rename datagen/binary/{binary.yaml => function.yaml} (99%) diff --git a/datagen/binary/binary.py b/datagen/binary/function.py similarity index 100% rename from datagen/binary/binary.py rename to datagen/binary/function.py diff --git a/datagen/binary/binary.yaml b/datagen/binary/function.yaml similarity index 99% rename from datagen/binary/binary.yaml rename to datagen/binary/function.yaml index ae6a759cf..4f9721e1a 100644 --- a/datagen/binary/binary.yaml +++ b/datagen/binary/function.yaml @@ -13,6 +13,6 @@ spec: description: '' build: functionSourceCode: IyBDb3B5cmlnaHQgMjAxOSBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgppbXBvcnQgb3MKaW1wb3J0IHBhbmRhcyBhcyBwZAppbXBvcnQgcHlhcnJvdyBhcyBwYQppbXBvcnQgcHlhcnJvdy5wYXJxdWV0IGFzIHBxCmZyb20gdHlwaW5nIGltcG9ydCBPcHRpb25hbCwgTGlzdCwgQW55CmZyb20gc2tsZWFybi5kYXRhc2V0cyBpbXBvcnQgbWFrZV9jbGFzc2lmaWNhdGlvbgoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CgoKZGVmIGNyZWF0ZV9iaW5hcnlfY2xhc3NpZmljYXRpb24oCiAgICBjb250ZXh0IDogTUxDbGllbnRDdHggPSBOb25lLAogICAgbl9zYW1wbGVzIDogaW50ID0gMTAwXzAwMCwKICAgIG1fZmVhdHVyZXMgOiBpbnQgPSAyMCwKICAgIGZlYXR1cmVzX2hkciA6IE9wdGlvbmFsW0xpc3Rbc3RyXV0gPSBOb25lLAogICAgd2VpZ2h0IDogZmxvYXQgPSAwLjUwLAogICAgcmFuZG9tX3N0YXRlIDogaW50ID0xLAogICAgZmlsZW5hbWUgOiBPcHRpb25hbFtzdHJdID0gTm9uZSwKICAgIHRhcmdldF9wYXRoIDogc3RyID0gIiIsCiAgICBrZXkgOiBzdHIgPSAiIgopOgogICAgIiIiQ3JlYXRlIGEgYmluYXJ5IGNsYXNzaWZpY2F0aW9uIHNhbXBsZSBkYXRhc2V0IGFuZCBzYXZlLgogICAgSWYgbm8gZmlsZW5hbWUgaXMgZ2l2ZW4gaXQgd2lsbCBkZWZhdWx0IHRvOgogICAgJ3NpbWRhdGEte25fc2FtcGxlc31Ye21fZmVhdHVyZXN9LnBhcnF1ZXQnLgogICAgQWxsIG9mIHRoZSBzY2lraXQtbGVhcm4gcGFyYW1ldGVycyBjYW4gYmUgc2V0IHVzaW5nICoqc2tfcGFyYW1zCiAgICAKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgICBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gbl9zYW1wbGVzOiAgICAgbnVtYmVyIG9mIHJvd3Mvc2FtcGxlcwogICAgOnBhcmFtIG1fZmVhdHVyZXM6ICAgIG51bWJlciBvZiBjb2xzL2ZlYXR1cmVzCiAgICA6cGFyYW0gZmVhdHVyZXNfaGRyOiAgaGVhZGVyIGZvciBmZWF0dXJlcyBhcnJheQogICAgOnBhcmFtIHdlaWdodDogICAgICAgIGZyYWN0aW9uIG9mIHNhbXBsZSAobmVnKQogICAgOnBhcmFtIHJhbmRvbV9zdGF0ZTogIHJuZyBzZWVkIChzZWUgaHR0cHM6Ly9zY2lraXQtbGVhcm4ub3JnL3N0YWJsZS9nbG9zc2FyeS5odG1sI3Rlcm0tcmFuZG9tLXN0YXRlKQogICAgOnBhcmFtIGZpbGVuYW1lOiAgICAgIG9wdGlvbmFsIG5hbWUgZm9yIHN0b3JlZCBkYXRhIGZpbGUKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogICBkZXN0aW1hdGlvbiBmb3IgZmlsZQogICAgOnBhcmFtIGtleTogICAgICAgICAgIGtleSBvZiBkYXRhIGluIGFydGlmYWN0IHN0b3JlCiAgICBSZXR1cm5zIGZpbGVuYW1lIG9mIGNyZWF0ZWQgZGF0YSAoaW5jbHVkZXMgcGF0aCkuCiAgICAiIiIKICAgICMgY2hlY2sgZGlyZWN0b3JpZXMgZXhpc3QgYW5kIGNyZWF0ZSBmaWxlbmFtZSBpZiBOb25lOgogICAgb3MubWFrZWRpcnModGFyZ2V0X3BhdGgsIGV4aXN0X29rPVRydWUpCiAgICBpZiBub3QgZmlsZW5hbWU6CiAgICAgICAgbmFtZSA9IGYic2ltZGF0YS17bl9zYW1wbGVzOjAuMGV9WHttX2ZlYXR1cmVzfS5wYXJxdWV0Ii5yZXBsYWNlKCIrIiwgIiIpCiAgICAgICAgZmlsZW5hbWUgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUpCiAgICBlbHNlOgogICAgICAgIGZpbGVuYW1lID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBmaWxlbmFtZSkKICAgIAogICAgZmVhdHVyZXMsIGxhYmVscyA9IG1ha2VfY2xhc3NpZmljYXRpb24oCiAgICAgICAgbl9zYW1wbGVzPW5fc2FtcGxlcywKICAgICAgICBuX2ZlYXR1cmVzPW1fZmVhdHVyZXMsCiAgICAgICAgd2VpZ2h0cz1bd2VpZ2h0XSwgICMgRmFsc2UKICAgICAgICBuX2NsYXNzZXM9MiwKICAgICAgICByYW5kb21fc3RhdGU9cmFuZG9tX3N0YXRlKQoKICAgICMgbWFrZSBkYXRhZnJhbWVzLCBhZGQgY29sdW1uIG5hbWVzLCBjb25jYXRlbmF0ZSAoWCwgeSkKICAgIFggPSBwZC5EYXRhRnJhbWUoZmVhdHVyZXMpCiAgICBpZiBub3QgZmVhdHVyZXNfaGRyOgogICAgICAgIFguY29sdW1ucyA9IFsiZmVhdF8iICsgc3RyKHgpIGZvciB4IGluIHJhbmdlKG1fZmVhdHVyZXMpXQogICAgZWxzZToKICAgICAgICBYLmNvbHVtbnMgPSBmZWF0dXJlc19oZHIKCiAgICB5ID0gcGQuRGF0YUZyYW1lKGxhYmVscywgY29sdW1ucz1bImxhYmVscyJdKQogICAgZGF0YSA9IHBkLmNvbmNhdChbWCwgeV0sIGF4aXM9MSkKCiAgICBwcS53cml0ZV90YWJsZShwYS5UYWJsZS5mcm9tX3BhbmRhcyhkYXRhKSwgZmlsZW5hbWUpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWZpbGVuYW1lKQo= - base_image: yjbds/mlrun-ds:latest + base_image: yjbds/mlrun-intel:dev commands: [] code_origin: https://github.com/yjb-ds/functions.git#e4d74d784d42fb25cc75cbcab6d817bb1d2b150c:/User/repos/functions/datagen/classification/binary.py diff --git a/datagen/train_valid_test/function.yaml b/datagen/train_valid_test/function.yaml index 1c1603e3c..f413e1840 100644 --- a/datagen/train_valid_test/function.yaml +++ b/datagen/train_valid_test/function.yaml @@ -13,6 +13,6 @@ spec: description: '' build: functionSourceCode: aW1wb3J0IHBhbmRhcyBhcyBwZAppbXBvcnQgb3MKaW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBweWFycm93LnBhcnF1ZXQgYXMgcHEKaW1wb3J0IHB5YXJyb3cgYXMgcGEKZnJvbSBjbG91ZHBpY2tsZSBpbXBvcnQgZHVtcAoKaW1wb3J0IHB5YXJyb3cucGFycXVldCBhcyBwcQppbXBvcnQgcHlhcnJvdyBhcyBwYQoKZnJvbSBza2xlYXJuLm1vZGVsX3NlbGVjdGlvbiBpbXBvcnQgdHJhaW5fdGVzdF9zcGxpdApmcm9tIHR5cGluZyBpbXBvcnQgT3B0aW9uYWwsIFVuaW9uCmZyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIG1scnVuLmRhdGFzdG9yZSBpbXBvcnQgRGF0YUl0ZW0KCmltcG9ydCB3YXJuaW5ncwp3YXJuaW5ncy5zaW1wbGVmaWx0ZXIoYWN0aW9uPSdpZ25vcmUnLCBjYXRlZ29yeT1GdXR1cmVXYXJuaW5nKQoKZGVmIHRyYWluX3ZhbGlkX3Rlc3Rfc3BsaXR0ZXIoCiAgICBjb250ZXh0OiBPcHRpb25hbFtNTENsaWVudEN0eF0gPSBOb25lLAogICAgc3JjX2ZpbGU6IFVuaW9uW0RhdGFJdGVtLCBzdHJdID0gJycsCiAgICBoZWFkZXI6IFVuaW9uW0RhdGFJdGVtLCBzdHIsIGxpc3RdID0gJycsCiAgICBzYW1wbGU6IGludCA9IC0xLAogICAgbGFiZWxfY29sdW1uOiBzdHIgPSAnbGFiZWxzJywKICAgIHRlc3Rfc2l6ZTogZmxvYXQgPSAwLjEsCiAgICB0cmFpbl92YWxfc3BsaXQ6IGZsb2F0ID0gMC43NSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAnJywKICAgIG5hbWU6IHN0ciA9ICcnLAogICAga2V5OiBzdHIgPSAnJywKICAgIHJhbmRvbV9zdGF0ZSA9IDEKKSAtPiBOb25lOgogICAgIiIiU3BsaXQgcmF3IGRhdGEgaW5wdXQgaW50byB0cmFpbiwgdmFsaWRhdGlvbiBhbmQgdGVzdCBzZXRzLgoKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgICAgIHRoZSBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gc3JjX2ZpbGU6ICAgICAgICAoJ3JhdycpIG5hbWUgb2YgcmF3IGRhdGEgZmlsZQogICAgOnBhcmFtIGhlYWRlcjogICAgICAgICAgKE5vbmUpIGhlYWRlciBhcnRpZmFjdCBvciBsaXN0IG9mIGNvbHVtbiBuYW1lcy4KICAgIDpwYXJhbSBzYW1wbGU6ICAgICAgICAgICgtMSkuIFNlbGVjdHMgdGhlIGZpcnN0IG4gcm93cywgb3Igc2VsZWN0IGEgc2FtcGxlIHN0YXJ0aW5nCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBmcm9tIHRoZSBmaXJzdC4gSWYgbmVnYXRpdmUgPC0xLCBzZWxlY3QgYSByYW5kb20gc2FtcGxlIGZyb20gCiAgICAgICAgICAgICAgICAgICAgICAgICAgICB0aGUgZW50aXJlIGZpbGUKICAgIDpwYXJhbSBsYWJlbF9jb2x1bW46ICAgIGdyb3VuZC10cnV0aCAoeSkgbGFiZWxzCiAgICA6cGFyYW0gdGVzdF9zaXplOiAgICAgICAoMC4xKSB0ZXN0IHNldCBzaXplCiAgICA6cGFyYW0gdHJhaW5fdmFsX3NwbGl0OiAoMC43NSkgT25jZSB0aGUgdGVzdCBzZXQgaGFzIGJlZW4gcmVtb3ZlZCB0aGUgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICB0cmFpbmluZyBzZXQgZ2V0cyB0aGlzIHByb3BvcnRpb24uCiAgICA6cGFyYW0gdGFyZ2V0X3BhdGg6ICAgICBmb2xkZXIgbG9jYXRpb24gb2YgZmlsZXMKICAgIDpwYXJhbSBuYW1lOiAgICAgICAgICAgIGRlc3RpbmF0aW9uIHByZWZpeCBuYW1lIGZvciBtb2RlbCBmaWxlcwogICAgOnBhcmFtIGtleTogICAgICAgICAgICAga2V5IGZvciBtb2RlbCBhcnRpZmFjdAogICAgOnBhcmFtIHJhbmRvbV9zdGF0ZTogICAgKDEpIHNrbGVhcm4gcm5nIHNlZWQKICAgICIiIgogICAgc3JjZmlsZXBhdGggPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIHN0cihzcmNfZmlsZSkpCgogICAgaWYgKHNhbXBsZSA9PSAtMSkgb3IgKHNhbXBsZSA+PSAxKToKICAgICAgICAjIGdldCBhbGwgcm93cywgb3IgY29udGlndW91cyBzYW1wbGUgc3RhcnRpbmcgYXQgcm93IDEuCiAgICAgICAgcmF3ID0gcHEucmVhZF90YWJsZShzcmNmaWxlcGF0aCkudG9fcGFuZGFzKCkKICAgICAgICBsYWJlbHMgPSByYXcucG9wKGxhYmVsX2NvbHVtbikKICAgICAgICByYXcgPSByYXcuaWxvY1s6c2FtcGxlLCA6XQogICAgICAgIGxhYmVscyA9IGxhYmVscy5pbG9jWzpzYW1wbGVdCiAgICBlbHNlOgogICAgICAgICMgZ3JhYiBhIHJhbmRvbSBzYW1wbGUKICAgICAgICAjcmF3ID0gcGQucmVhZF9wYXJxdWV0KHNyY2ZpbGVwYXRoLCBlbmdpbmU9J3B5YXJyb3cnKS5zYW1wbGUoc2FtcGxlKi0xKQogICAgICAgIHJhdyA9IHBxLnJlYWRfdGFibGUoc3JjZmlsZXBhdGgpLnRvX3BhbmRhcygpLnNhbXBsZShzYW1wbGUqLTEpCiAgICAgICAgbGFiZWxzID0gcmF3LnBvcChsYWJlbF9jb2x1bW4pCiAgICAKICAgICMgZG91YmxlIHNwbGl0IHRwIGdlbmVyYXRlIDMgZGF0YSBzZXRzOiB0cmFpbiwgdmFsaWRhdGlvbiBhbmQgdGVzdAogICAgeCwgeHRlc3QsIHksIHl0ZXN0ID0gdHJhaW5fdGVzdF9zcGxpdChyYXcsIGxhYmVscywgdGVzdF9zaXplPXRlc3Rfc2l6ZSwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHJhbmRvbV9zdGF0ZT1yYW5kb21fc3RhdGUpCiAgIAogICAgeHRyYWluLCB4dmFsaWQsIHl0cmFpbiwgeXZhbGlkID0gdHJhaW5fdGVzdF9zcGxpdCh4LCB5LCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgdHJhaW5fc2l6ZT10cmFpbl92YWxfc3BsaXQsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICByYW5kb21fc3RhdGU9cmFuZG9tX3N0YXRlKSAgICAgICAgCgogICAgaWYgbmFtZToKICAgICAgICBuYW1lID0gJy0nICsgbmFtZQogICAgCiAgICAjIHNhdmUgaGVhZGVyCiAgICBmID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lICsgJ2hlYWRlci5wa2wnKQogICAgZHVtcChyYXcuY29sdW1ucy52YWx1ZXMsIG9wZW4oZiwgJ3diJykpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgnaGVhZGVyJywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgIyBzYXZlIGRhdGEgc2V0cwogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICd4dHJhaW4ucHF0JykKICAgIHh0cmFpbi50b19wYXJxdWV0KGYpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgneHRyYWluJywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICd4dmFsaWQucHF0JykKICAgIHh2YWxpZC50b19wYXJxdWV0KGYpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgneHZhbGlkJywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICd4dGVzdC5wcXQnKQogICAgeHRlc3QudG9fcGFycXVldChmKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ3h0ZXN0JywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICd5dHJhaW4ucHF0JykKICAgIHBkLkRhdGFGcmFtZSh7J2xhYmVscyc6IHl0cmFpbn0pLnRvX3BhcnF1ZXQoZikKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KCd5dHJhaW4nLCB0YXJnZXRfcGF0aD1mKQogICAgCiAgICBmID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lICsgJ3l2YWxpZC5wcXQnKQogICAgcGQuRGF0YUZyYW1lKHsnbGFiZWxzJzogeXZhbGlkfSkudG9fcGFycXVldChmKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ3l2YWxpZCcsIHRhcmdldF9wYXRoPWYpCiAgICAKICAgIGYgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUgKyAneXRlc3QucHF0JykKICAgIHBkLkRhdGFGcmFtZSh7J2xhYmVscyc6IHl0ZXN0fSkudG9fcGFycXVldChmKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ3l0ZXN0JywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgY29udGV4dC5sb2dnZXIuaW5mbygnbnVtcHknLCBucC5fX3ZlcnNpb25fXykKICAgIGNvbnRleHQubG9nZ2VyLmluZm8oJ3BhbmRhcyAnLCBwZC5fX3ZlcnNpb25fXykKICAgIGNvbnRleHQubG9nZ2VyLmluZm8oJ3B5YXJyb3cnLCBwYS5fX3ZlcnNpb25fXyk= - base_image: yjbds/mlrun-ds:latest + base_image: yjbds/mlrun-intel:dev commands: [] code_origin: https://github.com/yjb-ds/functions.git#e613e55761fd1ed325ad88155877924aa5b49ccc:/User/repos/functions/datagen/splitters/train_valid_test.py diff --git a/evaluation/test-classifier.yaml b/evaluation/test-classifier.yaml index 893ec30e4..73f7cd29a 100644 --- a/evaluation/test-classifier.yaml +++ b/evaluation/test-classifier.yaml @@ -7,13 +7,13 @@ metadata: spec: command: '' args: [] - image: yjbds/mlrun-ds:latest + image: yjbds/mlrun-daskboost:latest volumes: [] volume_mounts: [] env: [] description: '' build: functionSourceCode: aW1wb3J0IG9zCmltcG9ydCBpbXBvcnRsaWIKZnJvbSBjbG91ZHBpY2tsZSBpbXBvcnQgbG9hZAoKaW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBwYW5kYXMgYXMgcGQKaW1wb3J0IGxpZ2h0Z2JtIGFzIGxnYgoKZnJvbSBza2xlYXJuLm1ldHJpY3MgaW1wb3J0IChyb2NfY3VydmUsIGNvbmZ1c2lvbl9tYXRyaXgpCmZyb20gc2tsZWFybi5tb2RlbF9zZWxlY3Rpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXQKCmltcG9ydCBtYXRwbG90bGliLnB5cGxvdCBhcyBwbHQKZnJvbSBtYXRwbG90bGliLmZpZ3VyZSBpbXBvcnQgRmlndXJlCmltcG9ydCBzZWFib3JuIGFzIHNucwoKZnJvbSB0eXBpbmcgaW1wb3J0IE9wdGlvbmFsLCBVbmlvbiwgTGlzdAoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CmZyb20gbWxydW4uZGF0YXN0b3JlIGltcG9ydCBEYXRhSXRlbQpmcm9tIG1scnVuLmFydGlmYWN0cyBpbXBvcnQgVGFibGVBcnRpZmFjdCwgUGxvdEFydGlmYWN0CgppbXBvcnQgd2FybmluZ3MKd2FybmluZ3Muc2ltcGxlZmlsdGVyKGFjdGlvbj0naWdub3JlJywgY2F0ZWdvcnk9RnV0dXJlV2FybmluZykKCmRlZiB0ZXN0X21vZGVsKAogICAgY29udGV4dDogT3B0aW9uYWxbTUxDbGllbnRDdHhdLAogICAgbW9kZWw6IFVuaW9uW0RhdGFJdGVtLCBzdHJdLAogICAgeHRlc3QsIAogICAgeXRlc3QsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gJycsCiAgICBuYW1lOiBzdHIgPSAnJywKICAgIGtleTogc3RyID0gJycsCiAgICByYW5kb21fc3RhdGUgPSAxCikgLT4gTm9uZToKICAgICIiIlRlc3QgYSBjbGFzc2lmaWVyIG1vZGVsCiAgICAKICAgIFVzaW5nIGhlbGQtb3V0IHRlc3QgZmVhdHVyZXMsIGNhbGxzIGBtb2RlbC5wcmVkaWN0KHh0ZXN0KWAgYW5kIGV2YWx1YXRlcyB0aGUgYWNjdXJhY3kgb2YgdGhlIAogICAgZXN0aW1hdGVkIG1vZGVsLgogICAgCiAgICBDYW4gYmUgcGFydCBvZiBhIGt1YmVmbG93IHBpcGVsaW5lIGFzIGEgdGVzdCBzdGVwIG9yIGNhbGxlZAogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgICB0aGUgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIG1vZGVsOiAgICAgICAgICAgZXN0aW1hdGVkIG1vZGVsIGZpbGUgbmFtZSBhcyBhcnRpZmFjdCBzdG9yZSBpdGVtCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBvciBwaWNrbGUgZmlsZSBuYW1lCiAgICA6cGFyYW0geHRlc3Q6ICAgICAgICAgICB0ZXN0IGZlYXR1cmVzIGZpbGUgbmFtZSBhcyBhcnRpZmFjdCBzdG9yZSBpdGVtCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBvciBwaWNrbGUgZmlsZSBuYW1lCiAgICA6cGFyYW0gaGVhZGVyOiAgICAgICAgICAoT3B0aW9uYWwpIHVzZSBpZiB4dGVzdCBkb2VzIG5vdCBoYXZlIGEgaGVhZGVyCiAgICA6cGFyYW0geXRlc3Q6ICAgICAgICAgICB0ZXN0IGxhYmVscyBmaWxlIG5hbWUgYXMgYXJ0aWZhY3Qgc3RvcmUgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBpdGVtIG9yIHBpY2tsZSBmaWxlIG5hbWUKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogICAgIGZvbGRlciBsb2NhdGlvbiBvZiBmaWxlcwogICAgOnBhcmFtIG5hbWU6ICAgICAgICAgICAgZGVzdGluYXRpb24gbmFtZSBmb3IgdGVzdCByZXN1bHRzCiAgICA6cGFyYW0ga2V5OiAgICAgICAgICAgICBrZXkgZm9yIG1vZGVsIGFydGlmYWN0CiAgICAiIiIKICAgICMgbG9hZCBtb2RlbCBhbmQgZGF0YQogICAgaWYgaXNpbnN0YW5jZShtb2RlbCwgRGF0YUl0ZW0pOgogICAgICAgIGNsZiA9IGxvYWQob3BlbihzdHIobW9kZWwpLCAncmInKSkKICAgIGVsc2U6CiAgICAgICAgY2xmID0gbG9hZChvcGVuKG1vZGVsLCAncmInKSkKCiAgICBpZiBpc2luc3RhbmNlKHh0ZXN0LCBEYXRhSXRlbSk6CiAgICAgICAgeHRlc3QgPSBwZC5yZWFkX3BhcnF1ZXQoc3RyKHh0ZXN0KSkKICAgICAgICB5dGVzdCA9IHBkLnJlYWRfcGFycXVldChzdHIoeXRlc3QpKQogICAgZWxzZToKICAgICAgICB4dGVzdCA9IHBkLnJlYWRfcGFycXVldCh4dGVzdCkKICAgICAgICB5dGVzdCA9IHBkLnJlYWRfcGFycXVldCh5dGVzdCkKICAgIAogICAgaWYgY2FsbGFibGUoZ2V0YXR0cihjbGYsICdwcmVkaWN0X3Byb2JhJykpOgogICAgICAgIHlwcmVkX3Byb2JzID0gY2xmLnByZWRpY3RfcHJvYmEoeHRlc3QpWzosIDFdCiAgICAgICAgeXByZWQgPSBucC53aGVyZSh5cHJlZF9wcm9icyA+PSAwLjUsIDEsIDApCiAgICAgICAgcGxvdF9yb2MoY29udGV4dCwgeXRlc3QsIHlwcmVkX3Byb2JzLCB0YXJnZXRfcGF0aCkKICAgIGVsc2U6CiAgICAgICAgeXByZWQgPSBjbGYucHJlZGljdCh4dGVzdCkKICAgICAgICB5cHJlZF9wcm9icyA9IE5vbmUKICAgIAogICAgcGxvdF9jb25mdXNpb25fbWF0cml4KGNvbnRleHQsIHl0ZXN0LCB5cHJlZCwgdGFyZ2V0X3BhdGgpCgogICAgaWYgaGFzYXR0cihjbGYsICdmZWF0dXJlX2ltcG9ydGFuY2VzXycpOgogICAgICAgIHBsb3RfaW1wb3J0YW5jZShjb250ZXh0LCBjbGYsIHh0ZXN0LmNvbHVtbnMudmFsdWVzLCB0YXJnZXRfcGF0aCkKCmRlZiBfZ2NmX2NsZWFyKHBsdCk6CiAgICBwbHQuY2xhKCkKICAgIHBsdC5jbGYoKQogICAgcGx0LmNsb3NlKCkgICAgICAgIAoKZGVmIHBsb3Rfcm9jKAogICAgY29udGV4dDogTUxDbGllbnRDdHgsIAogICAgeV9sYWJlbHMsCiAgICB5X3Byb2JzLAogICAgdGFyZ2V0X3BhdGg6IHN0ciA9ICcnLAogICAgbmFtZT0ncm9jLnBuZycsCiAgICBrZXk9J3JvYycsCiAgICBmbXQ9J3BuZycKKToKICAgICIiIlBsb3QgYW4gUk9DIGN1cnZlIGZyb20gdGVzdCBkYXRhIHNhdmVkIGluIGFuIGFydGlmYWN0IHN0b3JlLgogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgICBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0geV9sYWJlbHM6ICAgICAgICB0ZXN0IGRhdGEgbGFiZWxzCiAgICA6cGFyYW0geV9wcm9iczogICAgICAgICB0ZXN0IGRhdGEgCiAgICAiIiIKICAgIGZwcl94ZywgdHByX3hnLCBfID0gcm9jX2N1cnZlKHlfbGFiZWxzLCB5X3Byb2JzKQogICAgcGx0LnBsb3QoWzAsIDFdLCBbMCwgMV0sICJrLS0iKQogICAgcGx0LnBsb3QoZnByX3hnLCB0cHJfeGcsIGxhYmVsPSJyb2MiKQogICAgcGx0LnhsYWJlbCgiZmFsc2UgcG9zaXRpdmUgcmF0ZSIpCiAgICBwbHQueWxhYmVsKCJ0cnVlIHBvc2l0aXZlIHJhdGUiKQogICAgcGx0LnRpdGxlKCJyb2MgY3VydmUiKQogICAgcGx0LmxlZ2VuZChsb2M9ImJlc3QiKQogICAgZmlnID0gcGx0LmdjZigpCgogICAgcGxvdHBhdGggPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUpCiAgICBmaWcuc2F2ZWZpZyhwbG90cGF0aCwgZm9ybWF0PWZtdCkKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KFBsb3RBcnRpZmFjdChrZXksIGJvZHk9ZmlnKSkKCiAgICBfZ2NmX2NsZWFyKHBsdCkKCmRlZiBwbG90X2NvbmZ1c2lvbl9tYXRyaXgoCiAgICBjb250ZXh0OiBNTENsaWVudEN0eCwgCiAgICBsYWJlbHMsIAogICAgcHJlZGljdGlvbnMsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gJycsIAogICAgbmFtZTogc3RyID0iY29uZnVzaW9uLnBuZyIsIAogICAga2V5OiBzdHIgPSdjb25mdXNpb25fbWF0cml4JywKICAgIGZtdDogc3RyID0gJ3BuZycKKToKICAgICIiIkNyZWF0ZSBhIGNvbmZ1c2lvbiBtYXRyaXguCiAgICBQbG90IGFuZCBzYXZlIGEgY29uZnVzaW9uIG1hdHJpeCB1c2luZyB0ZXN0IGRhdGEgZnJvbSBhCiAgICBwaXBlbGluZSBzdGVwLgoKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgICAgIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSBsYWJlbHM6ICAgICAgICAgIHRlc3QgZGF0YSBsYWJlbHMKICAgIDpwYXJhbSBwcmVkaWN0aW9uczogICAgIHRlc3QgZGF0YSBwcmVkaWN0aW9ucwogICAgIiIiCiAgICBjbSA9IGNvbmZ1c2lvbl9tYXRyaXgobGFiZWxzLAogICAgICAgICAgICAgICAgICAgICAgICAgICAgcHJlZGljdGlvbnMsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBzYW1wbGVfd2VpZ2h0PU5vbmUsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBub3JtYWxpemU9J2FsbCcpCiAgICBzbnMuaGVhdG1hcChjbSwgYW5ub3Q9VHJ1ZSwgY21hcD0iQmx1ZXMiKQogICAgcGxvdHBhdGggPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUpCiAgICBmaWcgPSBwbHQuZ2NmKCkKICAgIGZpZy5zYXZlZmlnKHBsb3RwYXRoLCBmb3JtYXQ9Zm10KQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoUGxvdEFydGlmYWN0KGtleSwgYm9keT1maWcpKQoKICAgIF9nY2ZfY2xlYXIocGx0KQoKZGVmIHBsb3RfaW1wb3J0YW5jZSgKICAgIGNvbnRleHQsCiAgICBtb2RlbCwKICAgIGhlYWRlcjogTGlzdCA9IFtdLAogICAgdGFyZ2V0X3BhdGg6IHN0ciA9ICcnLAogICAgbmFtZTogc3RyID0gJ2ZlYXR1cmUtaW1wb3J0YW5jZXMucG5nJywKICAgIGtleTogc3RyID0gJ2ZlYXR1cmUtaW1wb3J0YW5jZXMnLAogICAgZm10ID0gJ3BuZycKKToKICAgICIiIkRpc3BsYXkgZXN0aW1hdGVkIGZlYXR1cmUgaW1wb3J0YW5jZXMuCgogICAgOnBhcmFtIGNvbnRleHQ6ICAgICBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gbW9kZWw6ICAgICAgIGZpdHRlZCBsaWdodGdibSBtb2RlbAogICAgOnBhcmFtIGhlYWRlcjogICAgICBsaXN0IG9mIGZlYXR1cmUgbmFtZXMKICAgICIiIgogICAgIyBjcmVhdGUgYSBmZWF0dXJlIGltcG9ydGFuY2UgdGFibGUgd2l0aCBkZXNpcmVkIGxhYmVscwogICAgemlwcGVkID0gemlwKG1vZGVsLmZlYXR1cmVfaW1wb3J0YW5jZXNfLCBoZWFkZXIpCgogICAgZmVhdHVyZV9pbXAgPSBwZC5EYXRhRnJhbWUoc29ydGVkKHppcHBlZCksIGNvbHVtbnM9WydmcmVxJywnZmVhdHVyZSddCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgKS5zb3J0X3ZhbHVlcyhieT0iZnJlcSIsIGFzY2VuZGluZz1GYWxzZSkKCiAgICBwbHQuZmlndXJlKGZpZ3NpemU9KDIwLCAxMCkpCiAgICBzbnMuYmFycGxvdCh4PSJmcmVxIiwgeT0iZmVhdHVyZSIsIGRhdGE9ZmVhdHVyZV9pbXApCiAgICBwbHQudGl0bGUoJ0xpZ2h0R0JNIEZlYXR1cmVzJykKICAgIHBsdC50aWdodF9sYXlvdXQoKQogICAgZmlnID0gcGx0LmdjZigpCiAgICBwbG90cGF0aCA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSkKICAgIGZpZy5zYXZlZmlnKHBsb3RwYXRoLCBmb3JtYXQ9J3BuZycpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChQbG90QXJ0aWZhY3Qoa2V5ICsgJy1wbG90JywgYm9keT1maWcpKQoKICAgICMgZmVhdHVyZSBpbXBvcnRhbmNlcyBhcmUgYWxzbyBzYXZlZCBhcyBhIHRhYmxlOgogICAgdGFibGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBrZXkgKyAnLXRhYmxlLmNzdicpCiAgICBmZWF0dXJlX2ltcC50b19jc3YodGFibGVwYXRoKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoVGFibGVBcnRpZmFjdChrZXkgKyAnLXRhYmxlJywgdGFyZ2V0X3BhdGg9dGFibGVwYXRoKSkKCiAgICAjIHRvIGVuc3VyZSB3ZSBkb24ndCBvdmVyd3JpdGUgdGhpcyBmaWd1cmUgd2hlbiBjcmVhdGluZyB0aGUgbmV4dDoKICAgIF9nY2ZfY2xlYXIocGx0KQo= - base_image: yjbds/mlrun-ds:latest + base_image: yjbds/mlrun-daskboost:dev commands: [] code_origin: https://github.com/yjb-ds/functions.git#e613e55761fd1ed325ad88155877924aa5b49ccc:/User/repos/functions/evaluation/test-classifier.py diff --git a/fileutils/arc_to_parquet/function.yaml b/fileutils/arc_to_parquet/function.yaml index d64cdfdfc..73628d357 100644 --- a/fileutils/arc_to_parquet/function.yaml +++ b/fileutils/arc_to_parquet/function.yaml @@ -4,7 +4,7 @@ metadata: hash: 0a17345fa693f3b0fd5671a8f94e09f97676ded2 project: default spec: - command: /User/repos/functions/fileutils/arc_to_parquet/function.py + command: https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/fileutils/parquet_to_dask/function.py args: [] image: '' volumes: [] @@ -12,5 +12,5 @@ spec: env: [] description: retrieve archive table and save as parquet file build: - base_image: yjbds/mlrun-base:dev + base_image: yjbds/mlrun-daskboost:dev commands: [] diff --git a/fileutils/open_archive/function.yaml b/fileutils/open_archive/function.yaml index 0fb7276fa..1c80bf091 100644 --- a/fileutils/open_archive/function.yaml +++ b/fileutils/open_archive/function.yaml @@ -5,5 +5,5 @@ spec: description: 'retrieve archive and extract all' build: functionSourceCode: IyBHZW5lcmF0ZWQgYnkgbnVjbGlvLmV4cG9ydC5OdWNsaW9FeHBvcnRlciBvbiAyMDIwLTAxLTIxIDA5OjQ3CgppbXBvcnQgbWxydW4KbWxydW4ubWxjb25mLmRicGF0aCA9ICdodHRwOi8vbWxydW4tYXBpOjgwODAnCgppbXBvcnQgdXJsbGliLnJlcXVlc3QKCmltcG9ydCBvcwppbXBvcnQgemlwZmlsZQppbXBvcnQgdXJsbGliCmltcG9ydCB0YXJmaWxlCmltcG9ydCBqc29uCgpmcm9tIG1scnVuLmV4ZWN1dGlvbiBpbXBvcnQgTUxDbGllbnRDdHgKCmRlZiBvcGVuX2FyY2hpdmUoY29udGV4dDogTUxDbGllbnRDdHgsIAogICAgICAgICAgICAgICAgIHRhcmdldF9kaXI6IHN0ciA9ICdjb250ZW50JywKICAgICAgICAgICAgICAgICBhcmNoaXZlX3VybDogc3RyID0gJycpOgogICAgIiIiT3BlbiBhIGZpbGUvb2JqZWN0IGFyY2hpdmUgaW50byBhIHRhcmdldCBkaXJlY3RvcnkKICAgIAogICAgQ3VycmVudGx5IHN1cHBvcnRzIHppcCBhbmQgdGFyLmd6CiAgICAiIiIKICAgIG9zLm1ha2VkaXJzKHRhcmdldF9kaXIsIGV4aXN0X29rPVRydWUpCiAgICBjb250ZXh0LmxvZ2dlci5pbmZvKCdWZXJpZmllZCBkaXJlY3RvcmllcycpCiAgICBwcmludChhcmNoaXZlX3VybCkKICAgIHNwbGl0cyA9IGFyY2hpdmVfdXJsLnNwbGl0KCcuJykKICAgIHByaW50KHNwbGl0cykKICAgIGlmIChzcGxpdHNbLTFdID09ICdneicpOgogICAgICAgIGNvbnRleHQubG9nZ2VyLmluZm8oJ29wZW5pbmcgdGFyX2d6JykKICAgICAgICByZWYgPSB0YXJmaWxlLm9wZW4oZmlsZW9iaj11cmxsaWIucmVxdWVzdC51cmxvcGVuKGFyY2hpdmVfdXJsKSwgbW9kZT0ncnxneicpCiAgICBlbGlmIHNwbGl0c1stMV0gPT0gJ3ppcCc6CiAgICAgICAgY29udGV4dC5sb2dnZXIuaW5mbygnb3BlbmluZyB6aXAnKQogICAgICAgIHJlZiA9IHppcGZpbGUuWmlwRmlsZShhcmNoaXZlX3VybCwgJ3InKQoKICAgIHJlZi5leHRyYWN0YWxsKHRhcmdldF9kaXIpCiAgICByZWYuY2xvc2UoKQoKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KCdjb250ZW50JywgdGFyZ2V0X3BhdGg9dGFyZ2V0X2RpcikKCg== - build_image: yjbds/mlrun-files:latest + build_image: yjbds/mlrun-base:dev commands: [] diff --git a/fileutils/parquet_to_dask/function.py b/fileutils/parquet_to_dask/function.py index 8807d49c5..f5530d1dd 100644 --- a/fileutils/parquet_to_dask/function.py +++ b/fileutils/parquet_to_dask/function.py @@ -41,7 +41,18 @@ def parquet_to_dask( ) -> None: """Load parquet dataset into dask cluster - If no cluster is found loads a new one and persist the data to it + If no cluster is found loads a new one and persist the data to it. It + shouold not be necessary to create a new cluster when the function + is run as a 'dask' job. + + :param context: the function context + :param parquet_url: url of the parquet file or partitioned dataset as either + artifact DataItem, string, or path object (see pandas read_csv) + :param inc_cols: include only these columns (very fast) + :param index_cols: list of index column names (can be a long-running process) + :param shards: number of workers to launch + :param threads_per: number of threads per worker + :param processes: """ if hasattr(context, 'dask_client'): context.logger.info('found cluster...') diff --git a/fileutils/parquet_to_dask/function.yaml b/fileutils/parquet_to_dask/function.yaml index 1d3057db2..c40e87dcd 100644 --- a/fileutils/parquet_to_dask/function.yaml +++ b/fileutils/parquet_to_dask/function.yaml @@ -4,7 +4,7 @@ metadata: hash: 4ed6e4dfc23b35ca9a7a6029b1f08b9a1d786885 project: default spec: - command: /User/repos/functions/fileutils/parquet_to_dask/function.py + command: https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/fileutils/parquet_to_dask/function.py args: [] image: '' volumes: [] diff --git a/tests/arc_to_parquet-airlines.ipynb b/tests/arc_to_parquet-airlines.ipynb index 6071710a7..bf8ab37f7 100644 --- a/tests/arc_to_parquet-airlines.ipynb +++ b/tests/arc_to_parquet-airlines.ipynb @@ -39,8 +39,8 @@ "BASE_IMAGE = 'yjbds/mlrun-base:dev'\n", "JOB_KIND = 'dask'\n", "TASK_NAME = 'user-task-arc-to-part-parq'\n", - "\n", - "CODE_BASE = '/User/repos/functions/fileutils'\n", + "https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/tests/describe.py\n", + "CODE_BASE = 'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/fileutils'\n", "\n", "ARCHIVE_BIG = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears_10.csv\"\n", "ARCHIVE = \"https://s3.amazonaws.com/h2o-airlines-unpacked/allyears.csv\"\n", diff --git a/tests/arc_to_parquet.ipynb b/tests/arc_to_parquet.ipynb index 0c87165bf..ddd165cac 100644 --- a/tests/arc_to_parquet.ipynb +++ b/tests/arc_to_parquet.ipynb @@ -40,7 +40,7 @@ "JOB_KIND = 'job'\n", "TASK_NAME = 'user-task-arc-to-parq'\n", "\n", - "CODE_BASE = '/User/repos/functions/fileutils'\n", + "CODE_BASE = 'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/fileutils'\n", "\n", "TARGET_PATH = '/User/mlrun/models'\n", "\n", diff --git a/tests/create_binary_data.ipynb b/tests/create_binary_data.ipynb index c2f5fcdd6..54bfa384c 100644 --- a/tests/create_binary_data.ipynb +++ b/tests/create_binary_data.ipynb @@ -18,7 +18,8 @@ "metadata": {}, "outputs": [], "source": [ - "TARGET_CODE_PATH = '/User/repos/functions/datagen/classification'\n", + "CODE_BASE = 'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/datagen/classification'\n", + "FUNCTION = ''\n", "N_SAMPLES = 100_000\n", "M_FEATURES = 28\n", "NEG_WEIGHT = 0.5\n", @@ -26,36 +27,14 @@ "KEY = 'simdata'" ] }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[mlrun] 2020-01-26 14:32:06,367 function spec saved to path: /User/repos/functions/datagen/classification/binary.yaml\n" - ] - } - ], - "source": [ - "testfn = mlrun.code_to_function(\n", - " filename=os.path.join(TARGET_CODE_PATH, 'binary.py'), \n", - " kind='job')\n", - "testfn.build_config(base_image='yjbds/mlrun-ds:latest')\n", - "testfn.export(os.path.join(TARGET_CODE_PATH, 'binary.yaml'))" - ] - }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ - "binarydatagen = mlrun.import_function(\n", - " os.path.join(TARGET_CODE_PATH, 'binary.yaml')\n", - ").apply(mlrun.mount_v3io())" + "binarydatagen = mlrun.import_function(os.path.join(CODE_BASE, 'function.yaml'))\n", + "binarydatagen.apply(mlrun.mount_v3io())" ] }, { diff --git a/tests/describe.py b/tests/describe.py index cee522060..d680b6651 100644 --- a/tests/describe.py +++ b/tests/describe.py @@ -36,6 +36,14 @@ def table_summary( key: str = 'table_summary' ) -> None: """Summarize a table + + :param context: the function context + :param dask_client: path to the dask client scheduler json file, as + string or artifact + :param dask_key: key of dataframe in dask client 'datasets' attribute + :param target_path: destimation folder for table summary file + :param name: name of table summary file (with extension like .csv) + :param key: key of table summary in artifact store """ context.dask_client = Client(scheduler_file=str(dask_client)) df = context.dask_client.get_dataset('dask_key') diff --git a/tests/describe.yaml b/tests/describe.yaml index 49f602d06..7095b5597 100644 --- a/tests/describe.yaml +++ b/tests/describe.yaml @@ -1,10 +1,9 @@ kind: job metadata: name: describe - hash: 3f3a5547127800b6351fc4bb15198bd7022e7e99 project: default spec: - command: /User/repos/functions/tests/describe.py + command: https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/tests/describe.py args: [] image: '' volumes: [] diff --git a/tests/parquet_to_dask.ipynb b/tests/parquet_to_dask.ipynb index 450025872..69c667816 100644 --- a/tests/parquet_to_dask.ipynb +++ b/tests/parquet_to_dask.ipynb @@ -40,7 +40,7 @@ "JOB_KIND = 'dask'\n", "TASK_NAME = 'user-task-parq-to-dask'\n", "\n", - "CODE_BASE = '/User/repos/functions/fileutils'\n", + "CODE_BASE = 'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/fileutils'\n", "\n", "SRC_PATH = '/User/mlrun/airlines/dataset-small/partitions'\n", "\n", @@ -758,7 +758,7 @@ ], "source": [ "# write up function in local directory\n", - "summ = mlrun.new_function(command='/User/repos/functions/tests/describe.py', \n", + "summ = mlrun.new_function(command='https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/tests/describe.py', \n", " kind='job')\n", "# specify a base image\n", "summ.spec.build.base_image = BASE_IMAGE\n", @@ -798,54 +798,9 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 17, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/conda/lib/python3.6/site-packages/distributed/client.py:1074: VersionMismatchWarning: Mismatched versions found\n", - "\n", - "dask\n", - "+--------------------------+---------+\n", - "| | version |\n", - "+--------------------------+---------+\n", - "| client | 2.9.2 |\n", - "| scheduler | 2.10.0 |\n", - "| tcp://10.233.64.56:38718 | 2.10.0 |\n", - "| tcp://10.233.64.57:36325 | 2.10.0 |\n", - "| tcp://10.233.64.58:38383 | 2.10.0 |\n", - "| tcp://10.233.64.59:44139 | 2.10.0 |\n", - "+--------------------------+---------+\n", - "\n", - "distributed\n", - "+--------------------------+---------+\n", - "| | version |\n", - "+--------------------------+---------+\n", - "| client | 2.9.3 |\n", - "| scheduler | 2.10.0 |\n", - "| tcp://10.233.64.56:38718 | 2.10.0 |\n", - "| tcp://10.233.64.57:36325 | 2.10.0 |\n", - "| tcp://10.233.64.58:38383 | 2.10.0 |\n", - "| tcp://10.233.64.59:44139 | 2.10.0 |\n", - "+--------------------------+---------+\n", - "\n", - "msgpack\n", - "+--------------------------+---------+\n", - "| | version |\n", - "+--------------------------+---------+\n", - "| client | 0.6.2 |\n", - "| scheduler | 0.6.1 |\n", - "| tcp://10.233.64.56:38718 | 0.6.1 |\n", - "| tcp://10.233.64.57:36325 | 0.6.1 |\n", - "| tcp://10.233.64.58:38383 | 0.6.1 |\n", - "| tcp://10.233.64.59:44139 | 0.6.1 |\n", - "+--------------------------+---------+\n", - " warnings.warn(version_module.VersionMismatchWarning(msg[0][\"warning\"]))\n" - ] - } - ], + "outputs": [], "source": [ "from dask.distributed import Client, LocalCluster\n", "\n", @@ -854,7 +809,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 18, "metadata": {}, "outputs": [], "source": [ @@ -863,7 +818,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 19, "metadata": {}, "outputs": [ { @@ -872,7 +827,7 @@ "175912" ] }, - "execution_count": 16, + "execution_count": 19, "metadata": {}, "output_type": "execute_result" } @@ -881,6 +836,225 @@ "df.shape[0].compute()" ] }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{\"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 7)\": 2912128,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 6)\": 1159726,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 19)\": 2912128,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 14)\": 2912128,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 16)\": 2912128,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 8)\": 2912128,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 9)\": 2912128,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 11)\": 2912128,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 3)\": 2912128,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 5)\": 2912128,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 15)\": 1159726,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 0)\": 2912128,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 1)\": 1159726,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 12)\": 2912128,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 4)\": 2912128,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 18)\": 2912128,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 10)\": 2912128,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 2)\": 2912128,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 17)\": 1159726,\n", + " \"('read-parquet-d8c9ad5a8e529c3979516dc1ad71970c', 13)\": 2912128}" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "client.nbytes(summary=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'tcp://10.233.64.56:38718': 1,\n", + " 'tcp://10.233.64.57:36325': 1,\n", + " 'tcp://10.233.64.58:38383': 1,\n", + " 'tcp://10.233.64.59:44139': 1}" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "client.ncores()" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'tcp://10.233.64.56:38718': 1,\n", + " 'tcp://10.233.64.57:36325': 1,\n", + " 'tcp://10.233.64.58:38383': 1,\n", + " 'tcp://10.233.64.59:44139': 1}" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "client.nthreads()" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'tcp://10.233.64.56:38718': (),\n", + " 'tcp://10.233.64.57:36325': (),\n", + " 'tcp://10.233.64.58:38383': (),\n", + " 'tcp://10.233.64.59:44139': ()}" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "distributed.client - ERROR - Failed to reconnect to scheduler after 3.00 seconds, closing client\n", + "distributed.client - ERROR - Failed to reconnect to scheduler after 3.00 seconds, closing client\n", + "distributed.utils - ERROR - \n", + "Traceback (most recent call last):\n", + " File \"/conda/lib/python3.6/site-packages/distributed/utils.py\", line 662, in log_errors\n", + " yield\n", + " File \"/conda/lib/python3.6/site-packages/distributed/client.py\", line 1311, in _close\n", + " await gen.with_timeout(timedelta(seconds=2), list(coroutines))\n", + " File \"/conda/lib/python3.6/asyncio/tasks.py\", line 250, in _wakeup\n", + " future.result()\n", + "concurrent.futures._base.CancelledError\n", + "distributed.utils - ERROR - \n", + "Traceback (most recent call last):\n", + " File \"/conda/lib/python3.6/site-packages/distributed/utils.py\", line 662, in log_errors\n", + " yield\n", + " File \"/conda/lib/python3.6/site-packages/distributed/client.py\", line 1025, in _reconnect\n", + " await self._close()\n", + " File \"/conda/lib/python3.6/site-packages/distributed/client.py\", line 1311, in _close\n", + " await gen.with_timeout(timedelta(seconds=2), list(coroutines))\n", + " File \"/conda/lib/python3.6/asyncio/tasks.py\", line 250, in _wakeup\n", + " future.result()\n", + "concurrent.futures._base.CancelledError\n", + "distributed.utils - ERROR - \n", + "Traceback (most recent call last):\n", + " File \"/conda/lib/python3.6/site-packages/distributed/utils.py\", line 662, in log_errors\n", + " yield\n", + " File \"/conda/lib/python3.6/site-packages/distributed/client.py\", line 1311, in _close\n", + " await gen.with_timeout(timedelta(seconds=2), list(coroutines))\n", + " File \"/conda/lib/python3.6/asyncio/tasks.py\", line 250, in _wakeup\n", + " future.result()\n", + "concurrent.futures._base.CancelledError\n", + "distributed.utils - ERROR - \n", + "Traceback (most recent call last):\n", + " File \"/conda/lib/python3.6/site-packages/distributed/utils.py\", line 662, in log_errors\n", + " yield\n", + " File \"/conda/lib/python3.6/site-packages/distributed/client.py\", line 1025, in _reconnect\n", + " await self._close()\n", + " File \"/conda/lib/python3.6/site-packages/distributed/client.py\", line 1311, in _close\n", + " await gen.with_timeout(timedelta(seconds=2), list(coroutines))\n", + " File \"/conda/lib/python3.6/asyncio/tasks.py\", line 250, in _wakeup\n", + " future.result()\n", + "concurrent.futures._base.CancelledError\n", + "distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client\n", + "distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client\n", + "distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client\n", + "distributed.utils - ERROR - \n", + "Traceback (most recent call last):\n", + " File \"/conda/lib/python3.6/site-packages/distributed/utils.py\", line 662, in log_errors\n", + " yield\n", + " File \"/conda/lib/python3.6/site-packages/distributed/client.py\", line 1311, in _close\n", + " await gen.with_timeout(timedelta(seconds=2), list(coroutines))\n", + " File \"/conda/lib/python3.6/asyncio/tasks.py\", line 250, in _wakeup\n", + " future.result()\n", + "concurrent.futures._base.CancelledError\n", + "distributed.utils - ERROR - \n", + "Traceback (most recent call last):\n", + " File \"/conda/lib/python3.6/site-packages/distributed/utils.py\", line 662, in log_errors\n", + " yield\n", + " File \"/conda/lib/python3.6/site-packages/distributed/client.py\", line 1311, in _close\n", + " await gen.with_timeout(timedelta(seconds=2), list(coroutines))\n", + " File \"/conda/lib/python3.6/asyncio/tasks.py\", line 250, in _wakeup\n", + " future.result()\n", + "concurrent.futures._base.CancelledError\n", + "distributed.utils - ERROR - \n", + "Traceback (most recent call last):\n", + " File \"/conda/lib/python3.6/site-packages/distributed/utils.py\", line 662, in log_errors\n", + " yield\n", + " File \"/conda/lib/python3.6/site-packages/distributed/client.py\", line 1025, in _reconnect\n", + " await self._close()\n", + " File \"/conda/lib/python3.6/site-packages/distributed/client.py\", line 1311, in _close\n", + " await gen.with_timeout(timedelta(seconds=2), list(coroutines))\n", + " File \"/conda/lib/python3.6/asyncio/tasks.py\", line 250, in _wakeup\n", + " future.result()\n", + "concurrent.futures._base.CancelledError\n", + "distributed.utils - ERROR - \n", + "Traceback (most recent call last):\n", + " File \"/conda/lib/python3.6/site-packages/distributed/utils.py\", line 662, in log_errors\n", + " yield\n", + " File \"/conda/lib/python3.6/site-packages/distributed/client.py\", line 1025, in _reconnect\n", + " await self._close()\n", + " File \"/conda/lib/python3.6/site-packages/distributed/client.py\", line 1311, in _close\n", + " await gen.with_timeout(timedelta(seconds=2), list(coroutines))\n", + " File \"/conda/lib/python3.6/asyncio/tasks.py\", line 250, in _wakeup\n", + " future.result()\n", + "concurrent.futures._base.CancelledError\n", + "distributed.utils - ERROR - \n", + "Traceback (most recent call last):\n", + " File \"/conda/lib/python3.6/site-packages/distributed/utils.py\", line 662, in log_errors\n", + " yield\n", + " File \"/conda/lib/python3.6/site-packages/distributed/client.py\", line 1311, in _close\n", + " await gen.with_timeout(timedelta(seconds=2), list(coroutines))\n", + " File \"/conda/lib/python3.6/asyncio/tasks.py\", line 250, in _wakeup\n", + " future.result()\n", + "concurrent.futures._base.CancelledError\n", + "distributed.utils - ERROR - \n", + "Traceback (most recent call last):\n", + " File \"/conda/lib/python3.6/site-packages/distributed/utils.py\", line 662, in log_errors\n", + " yield\n", + " File \"/conda/lib/python3.6/site-packages/distributed/client.py\", line 1025, in _reconnect\n", + " await self._close()\n", + " File \"/conda/lib/python3.6/site-packages/distributed/client.py\", line 1311, in _close\n", + " await gen.with_timeout(timedelta(seconds=2), list(coroutines))\n", + " File \"/conda/lib/python3.6/asyncio/tasks.py\", line 250, in _wakeup\n", + " future.result()\n", + "concurrent.futures._base.CancelledError\n" + ] + } + ], + "source": [ + "client.processing()" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -892,7 +1066,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 20, "metadata": {}, "outputs": [], "source": [ diff --git a/tests/test_classifier.ipynb b/tests/test_classifier.ipynb index f91df44fc..2bba2b6c0 100644 --- a/tests/test_classifier.ipynb +++ b/tests/test_classifier.ipynb @@ -34,7 +34,7 @@ "metadata": {}, "outputs": [], "source": [ - "CODE_BASE = '/User/repos/functions'\n", + "CODE_BASE = 'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving'\n", "\n", "MODEL_FILE = '/User/mlrun/models/lgb-classifier.pkl'\n", "\n", diff --git a/tests/train_classifier.ipynb b/tests/train_classifier.ipynb index b552e71f6..ee5e62e20 100644 --- a/tests/train_classifier.ipynb +++ b/tests/train_classifier.ipynb @@ -47,7 +47,7 @@ "metadata": {}, "outputs": [], "source": [ - "CODE_BASE = '/User/repos/functions/' \n", + "CODE_BASE = 'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/' \n", "TARGET_DATA_PATH = '/User/mlrun/models'\n", "\n", "SKLEARN_CLASSIFIER = 'lightgbm.sklearn.LGBMClassifier'\n", diff --git a/tests/train_valid_test_split.ipynb b/tests/train_valid_test_split.ipynb index d83636882..891423a4e 100644 --- a/tests/train_valid_test_split.ipynb +++ b/tests/train_valid_test_split.ipynb @@ -41,7 +41,7 @@ "JOB_KIND = 'job'\n", "TASK_NAME = 'user-task-data-splits'\n", "\n", - "CODE_BASE = '/User/repos/functions/datagen/splitters'\n", + "CODE_BASE = 'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/datagen/splitters'\n", "PROJECT = 'splitters'\n", "\n", "RNG = 1\n", @@ -57,31 +57,6 @@ "## split the data" ] }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "ename": "NameError", - "evalue": "name 'yaml_name' is not defined", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 3\u001b[0m filename=os.path.join(CODE_BASE, 'datagen/splitters', 'train_valid_test.py'))\n\u001b[1;32m 4\u001b[0m \u001b[0mtestfn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuild_config\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbase_image\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'yjbds/mlrun-ds:latest'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcommands\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mtestfn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexport\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0myaml_name\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;31mNameError\u001b[0m: name 'yaml_name' is not defined" - ] - } - ], - "source": [ - "testfn = mlrun.new_function(\n", - " command=os.path.join(CODE_BASE, FUNCTION, 'function.py'),\n", - " kind=JOB_KIND)\n", - "testfn.build_config(base_image=BASE_IMAGE, commands=[])\n", - "testfn.export(os.path.join(CODE_BASE, FUNCTION, 'function.yaml'))" - ] - }, { "cell_type": "code", "execution_count": 55, diff --git a/train/sklearn-classifier.yaml b/train/sklearn-classifier.yaml index 668f0aac0..bbe34c8c3 100644 --- a/train/sklearn-classifier.yaml +++ b/train/sklearn-classifier.yaml @@ -13,6 +13,6 @@ spec: description: '' build: functionSourceCode: aW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBwYW5kYXMgYXMgcGQKCmltcG9ydCBtYXRwbG90bGliLnB5cGxvdCBhcyBwbHQKZnJvbSBtYXRwbG90bGliLmZpZ3VyZSBpbXBvcnQgRmlndXJlCmltcG9ydCBzZWFib3JuIGFzIHNucwoKZnJvbSB0eXBpbmcgaW1wb3J0IE9wdGlvbmFsLCBVbmlvbgppbXBvcnQgb3MKaW1wb3J0IGltcG9ydGxpYgpmcm9tIGNsb3VkcGlja2xlIGltcG9ydCBkdW1wCgpmcm9tIG1scnVuLmV4ZWN1dGlvbiBpbXBvcnQgTUxDbGllbnRDdHgKZnJvbSBtbHJ1bi5kYXRhc3RvcmUgaW1wb3J0IERhdGFJdGVtCmZyb20gbWxydW4uYXJ0aWZhY3RzIGltcG9ydCBUYWJsZUFydGlmYWN0LCBQbG90QXJ0aWZhY3QKCmltcG9ydCB3YXJuaW5ncwp3YXJuaW5ncy5zaW1wbGVmaWx0ZXIoYWN0aW9uPSdpZ25vcmUnLCBjYXRlZ29yeT1GdXR1cmVXYXJuaW5nKQoKZGVmIHRyYWluKAogICAgY29udGV4dDogT3B0aW9uYWxbTUxDbGllbnRDdHhdID0gTm9uZSwKICAgIFNLQ2xhc3NpZmllcjogc3RyICA9ICcnLAogICAgY2FsbGJhY2tzICA9IFtdLAogICAgeHRyYWluOiBVbmlvbltEYXRhSXRlbSwgc3RyXSA9ICcnLAogICAgeXRyYWluOiBVbmlvbltEYXRhSXRlbSwgc3RyXSA9ICcnLAogICAgeHZhbGlkOiBVbmlvbltEYXRhSXRlbSwgc3RyXSA9ICcnLAogICAgeXZhbGlkOiBVbmlvbltEYXRhSXRlbSwgc3RyXSA9ICcnLAogICAgdGFyZ2V0X3BhdGg6IHN0ciA9ICcnLAogICAgbmFtZTogc3RyID0gJycsCiAgICBrZXk6IHN0ciA9ICcnLAogICAgdmVyYm9zZTogYm9vbCA9IEZhbHNlLAogICAgcmFuZG9tX3N0YXRlID0gMQopIC0+IE5vbmU6CiAgICAiIiJUcmFpbiBhbmQgc2F2ZSBhbiBTY2lraXRsZWFybiBtb2RlbC4KICAgIAogICAgVGhlIGRhdGEgc291cmNlIGNhbiBlaXRoZXIgYmUgYSBzdHJpbmcgZmlsZSBuYW1lIG9yIGFuIGFydGlmYWN0IGl0ZW0uCiAgICAKICAgIFRoZSBoZWFkZXIgaXMgZWl0aCBhIGxpc3Qgb2YgY29sdW1uIG5hbWVzLCBhbiBhcnRpZmFjdCBoZWFkZXIgaXRlbSwgb3IgTm9uZS4KICAgIAogICAgCiAgICA6cGFyYW0gY29udGV4dDogICAgICAgICB0aGUgZnVuY3Rpb24gY29udGV4dAogICAgOnBhcmFtIFNLQ2xhc3NpZmllcjogICAgc3RyaW5nIG1vZHVsZSBhbmQgY2xhc3NuYW1lIG9mIGNsYXNzaWZpZXIKICAgIDpwYXJhbSBjYWxsYmFja3M6ICAgICAgIHNrbGVhcm4gY2xhc3NpZmllciBmaXQgZnVuY3Rpb24gY2FsbGJhY2tzCiAgICA6cGFyYW0geHRyYWluOiAgICAgICAgICAKICAgIDpwYXJhbSB5dHJhaW46CiAgICA6cGFyYW0geHZhbGlkOgogICAgOnBhcmFtIHl2YWxpZDoKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogICAgIGZvbGRlciBsb2NhdGlvbiBvZiBmaWxlcwogICAgOnBhcmFtIG5hbWU6ICAgICAgICAgICAgZGVzdGluYXRpb24gbmFtZSBmb3IgbW9kZWwgZmlsZQogICAgOnBhcmFtIGtleTogICAgICAgICAgICAga2V5IGZvciBtb2RlbCBhcnRpZmFjdAogICAgOnBhcmFtIHZlcmJvc2UgOiAgICAgICAgKEZhbHNlKSBzaG93IG1ldHJpY3MgZm9yIHRyYWluaW5nL3ZhbGlkYXRpb24gc3RlcHMuCiAgICA6cGFyYW0gcmFuZG9tX3N0YXRlOiAgICAoMSkgc2tsZWFybiBybmcgc2VlZAogICAgCiAgICBleGFtcGxlIGNhbGxiYWNrczoKICAgIGBgYAogICAgZnJvbSBsaWdodGdibSBpbXBvcnQgcmVjb3JkX2V2YWx1YXRpb24KICAgIGV2YWxfcmVzdWx0cyA9IGRpY3QoKQogICAgY2FsbGJhY2tzID0gW3JlY29yZF9ldmFsdWF0aW9uKGV2YWxfcmVzdWx0cyldCiAgICBgYGAKICAgICIiIgogICAgIyBsb2FkIGRhdGEKICAgIHh0cmFpbiA9IHBkLnJlYWRfcGFycXVldChzdHIoeHRyYWluKSwgZW5naW5lPSdweWFycm93JykKICAgIHl0cmFpbiA9IHBkLnJlYWRfcGFycXVldChzdHIoeXRyYWluKSwgZW5naW5lPSdweWFycm93JykKICAgIHh2YWxpZCA9IHBkLnJlYWRfcGFycXVldChzdHIoeHZhbGlkKSwgZW5naW5lPSdweWFycm93JykKICAgIHl2YWxpZCA9IHBkLnJlYWRfcGFycXVldChzdHIoeXZhbGlkKSwgZW5naW5lPSdweWFycm93JykKCiAgICAjIGNyZWF0ZSBjbGFzc2lmaWVyIGNsYXNzIGZyb20gc3RyaW5nIGFuZCBpbnN0YW50aWF0ZQogICAgc3BsaXRzID0gU0tDbGFzc2lmaWVyLnNwbGl0KCIuIikKICAgIGNsZmNsYXNzID0gZ2V0YXR0cihpbXBvcnRsaWIuaW1wb3J0X21vZHVsZSgiLiIuam9pbihzcGxpdHNbOi0xXSkpLCBzcGxpdHNbLTFdKQogICAgbW9kZWwgPSBjbGZjbGFzcyhyYW5kb21fc3RhdGU9cmFuZG9tX3N0YXRlLCB2ZXJib3NlPWludCh2ZXJib3NlID09IFRydWUpKQoKICAgIG1vZGVsLmZpdCh4dHJhaW4sIAogICAgICAgICAgICAgIHl0cmFpbiwKICAgICAgICAgICAgICBldmFsX3NldD1bKHh2YWxpZCwgeXZhbGlkKSwgKHh0cmFpbiwgeXRyYWluKV0sCiAgICAgICAgICAgICAgZXZhbF9uYW1lcz1bJ3ZhbGlkJywgJ3RyYWluJ10sCiAgICAgICAgICAgICAgY2FsbGJhY2tzPWNhbGxiYWNrcywKICAgICAgICAgICAgICB2ZXJib3NlPXZlcmJvc2UpCiAgICAgCiAgICBjb250ZXh0LmxvZ19yZXN1bHQoInRyYWluX2FjY3VyYWN5IiwgZmxvYXQobW9kZWwuc2NvcmUoeHRyYWluLCB5dHJhaW4pKSkKICAgIAogICAgIyBwbG90IHRyYWluIGFuZCB2YWxpZGF0aW9uIGhpc3RvcnksIHNhdmUgYW5kIGxvZwogICAgbG9zcyA9IG5wLmFzYXJyYXkobW9kZWwuZXZhbHNfcmVzdWx0X1sndHJhaW4nXVsnYmluYXJ5X2xvZ2xvc3MnXSwgZHR5cGU9bnAuZmxvYXQpCiAgICB2YWxfbG9zcyA9IG5wLmFzYXJyYXkobW9kZWwuZXZhbHNfcmVzdWx0X1sndmFsaWQnXVsnYmluYXJ5X2xvZ2xvc3MnXSwgZHR5cGU9bnAuZmxvYXQpCiAgICBwbG90X3ZhbGlkYXRpb24oY29udGV4dCwgbG9zcywgdmFsX2xvc3MsIHRhcmdldF9wYXRoKQogICAgCiAgICAjIHNhdmUgbW9kZWwKICAgIGZpbGVwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgZHVtcChtb2RlbCwgb3BlbihmaWxlcGF0aCwgJ3diJykpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWZpbGVwYXRoKQogICAgICAgIApkZWYgcGxvdF92YWxpZGF0aW9uKAogICAgY29udGV4dDogTUxDbGllbnRDdHgsCiAgICB0cmFpbl9tZXRyaWMsCiAgICB2YWxpZF9tZXRyaWMsCiAgICB0YXJnZXRfcGF0aDogc3RyID0gJycsCiAgICBuYW1lOiBzdHIgPSAiaGlzdG9yeS5wbmciLAogICAga2V5OiBzdHIgPSAndHJhaW5pbmctdmFsaWRhdGlvbi1wbG90JwopOgogICAgIiIiUGxvdCB0cmFpbiBhbmQgdmFsaWRhdGlvbiBsb3NzIGN1cnZlcwoKICAgIFRoZXNlIGN1cnZlcyByZXByZXNlbnQgdGhlIHRyYWluaW5nIHJvdW5kIGxvc3NlcyBmcm9tIHRoZSB0cmFpbmluZwogICAgYW5kIHZhbGlkYXRpb24gc2V0cy4KICAgIAogICAgOnBhcmFtIGNvbnRleHQ6ICAgICAgICAgdGhlIGZ1bmN0aW9uIGNvbnRleHQKICAgIDpwYXJhbSB0cmFpbl9tZXRyaWM6ICAgIHRyYWluIG1ldHJpYwogICAgOnBhcmFtIHZhbGlkX21ldHJpYzogICAgdmFsaWRhdGlvbiBtZXRyaWMKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogICAgIGRlc3RpbmF0aW4gcGF0aCBmb3IgdHJhaW4vdm9saWRhdGlvbiBoaXN0b3J5IHBsb3QgYXJ0aWZhY3QKICAgICIiIgogICAgIyBnZW5lcmF0ZSBwbG90CiAgICBwbHQucGxvdCh0cmFpbl9tZXRyaWMpCiAgICBwbHQucGxvdCh2YWxpZF9tZXRyaWMpCiAgICBwbHQudGl0bGUoInRyYWluaW5nIHZhbGlkYXRpb24gcmVzdWx0cyIpCiAgICBwbHQueGxhYmVsKCJlcG9jaCIpCiAgICBwbHQueWxhYmVsKCIiKQogICAgcGx0LmxlZ2VuZChbInRyYWluIiwgInZhbGlkIl0pCiAgICBmaWcgPSBwbHQuZ2NmKCkKCiAgICAjIHNhdmUgZmlndXJlIGFuZCBsb2cgYXJ0aWZhY3QKICAgIHBsb3RwYXRoID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lKQogICAgcGx0LnNhdmVmaWcocGxvdHBhdGgpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChQbG90QXJ0aWZhY3Qoa2V5LCBib2R5PWZpZykpCgogICAgIyBwbG90IGNsZWFudXAKICAgIHBsdC5jbGEoKQogICAgcGx0LmNsZigpCiAgICBwbHQuY2xvc2UoKSAgICAgICAgCg== - base_image: yjbds/mlrun-ds:latest + base_image: yjbds/mlrun-daskboost:dev commands: [] code_origin: https://github.com/yjb-ds/functions.git#e613e55761fd1ed325ad88155877924aa5b49ccc:/User/repos/functions/train/sklearn-classifier.py