Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
80b3a37
added lgbm server
Jan 20, 2020
1c51ea0
removed vscode
Jan 20, 2020
db8da20
add log header too to arc_2parq
Jan 20, 2020
874e777
add header to artifacts in arc_to_parquet
Jan 21, 2020
5a28adb
mf
Jan 21, 2020
0aba6b3
buggy open_archive, implicit download and name mangling issue
Jan 21, 2020
6deeed1
buggy open_archive, implicit download and name mangling issue
Jan 21, 2020
5cd1322
name mangling restricted to 'inputs' paramater, running
Jan 21, 2020
208ab7f
adjusted yaml link
Jan 21, 2020
94a4aab
arc-to-parq fixes
Jan 21, 2020
464551f
refactor, incomplete
Jan 21, 2020
3f7e0c7
eod, backup, refactor incomplete
Jan 21, 2020
8814f73
lightgbm/sklearn classifier running, load yaml only
Jan 22, 2020
d66ee33
minor fixes, debugged & running
Jan 22, 2020
8f99ee9
added simple splitter function
Jan 23, 2020
767dea3
acquire-train-test completed
Jan 26, 2020
25e611e
rename file
Jan 26, 2020
e4d74d7
minor fixes
Jan 26, 2020
e613e55
tests: all output to models folder
Jan 26, 2020
98867b1
resolved pyarrow versions issue
Jan 26, 2020
9f080fe
gitignore issue
Jan 26, 2020
f6fd45c
fix image source arc-to-parquet
Jan 27, 2020
eb009da
fix image source arc-to-parquet
Jan 27, 2020
e67bbbb
add partitioning to parquet save, arc-to-parq
Jan 27, 2020
5a80835
add dtype param for partitioning
Jan 27, 2020
dba3bc1
parquet partitioning passes test
Jan 27, 2020
2561059
added fileutils/parquet-to-dask function
Jan 28, 2020
c16b089
added fileutils/parquet-to-dask function
Jan 28, 2020
aa23d8d
update mlrunapi, parq-to-dask dask job running
Jan 29, 2020
dcb6bcb
eod, backup
Jan 30, 2020
9c0c9da
added table-summary artifact to dask workflow
Jan 30, 2020
f781512
temp backup
Jan 30, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,6 @@ models/
.ipynb_checkpoints
*.gz
*.csv
*.pyc
*.swp
dask-worker-space
3 changes: 0 additions & 3 deletions .vscode/settings.json

This file was deleted.

9 changes: 9 additions & 0 deletions datagen/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# data generators

## classification

**`binary`** generate binary classification data

## splitters

**`train_valid_test`** given a raw dataset, create 3 splits and save the results
31 changes: 15 additions & 16 deletions datagen/binary_classes/binary.py → datagen/binary/function.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
n_samp# Copyright 2019 Iguazio
# Copyright 2019 Iguazio
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -22,21 +22,21 @@


def create_binary_classification(
context: MLClientCtx = None,
n_samples: int = 100_000,
m_features: int = 20,
features_hdr: Optional[List[str]] = None,
weight: float = 0.50,
random_state=1,
filename: Optional[str] = None,
target_path: str = "",
key: str = "",
**sk_params,
context : MLClientCtx = None,
n_samples : int = 100_000,
m_features : int = 20,
features_hdr : Optional[List[str]] = None,
weight : float = 0.50,
random_state : int =1,
filename : Optional[str] = None,
target_path : str = "",
key : str = ""
):
"""Create a binary classification sample dataset and save.
If no filename is given it will default to:
'simdata-{n_samples}X{m_features}.parquet'.
All of the scikit-learn parameters can be set using **sk_params

:param context: function context
:param n_samples: number of rows/samples
:param m_features: number of cols/features
Expand All @@ -46,23 +46,22 @@ def create_binary_classification(
:param filename: optional name for stored data file
:param target_path: destimation for file
:param key: key of data in artifact store
:param sk_params: keyword arguments for scikit-learn's 'make_classification'
Returns filename of created data (includes path).
"""
# check directories exist and create filename if None:
os.makedirs(target_path, exist_ok=True)
if not filename:
name = f"simdata-{n_samples:0.0e}X{m_features}.parquet".replace("+", "")
filename = os.path.join(target_path, name)

else:
filename = os.path.join(target_path, filename)

features, labels = make_classification(
n_samples=n_samples,
n_features=m_features,
weights=[weight], # False
n_classes=2,
random_state=random_state,
**sk_params,
)
random_state=random_state)

# make dataframes, add column names, concatenate (X, y)
X = pd.DataFrame(features)
Expand Down
18 changes: 18 additions & 0 deletions datagen/binary/function.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
kind: job
metadata:
name: binary
tag: ''
hash: 0527f27939f7f6b39d435d9e62d484c0bab308c8
project: ''
spec:
command: ''
args: []
volumes: []
volume_mounts: []
env: []
description: ''
build:
functionSourceCode: IyBDb3B5cmlnaHQgMjAxOSBJZ3VhemlvCiMKIyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKIyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCiMgWW91IG1heSBvYnRhaW4gYSBjb3B5IG9mIHRoZSBMaWNlbnNlIGF0CiMKIyAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAojCiMgVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQojIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCiMgV0lUSE9VVCBXQVJSQU5USUVTIE9SIENPTkRJVElPTlMgT0YgQU5ZIEtJTkQsIGVpdGhlciBleHByZXNzIG9yIGltcGxpZWQuCiMgU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAojIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgppbXBvcnQgb3MKaW1wb3J0IHBhbmRhcyBhcyBwZAppbXBvcnQgcHlhcnJvdyBhcyBwYQppbXBvcnQgcHlhcnJvdy5wYXJxdWV0IGFzIHBxCmZyb20gdHlwaW5nIGltcG9ydCBPcHRpb25hbCwgTGlzdCwgQW55CmZyb20gc2tsZWFybi5kYXRhc2V0cyBpbXBvcnQgbWFrZV9jbGFzc2lmaWNhdGlvbgoKZnJvbSBtbHJ1bi5leGVjdXRpb24gaW1wb3J0IE1MQ2xpZW50Q3R4CgoKZGVmIGNyZWF0ZV9iaW5hcnlfY2xhc3NpZmljYXRpb24oCiAgICBjb250ZXh0IDogTUxDbGllbnRDdHggPSBOb25lLAogICAgbl9zYW1wbGVzIDogaW50ID0gMTAwXzAwMCwKICAgIG1fZmVhdHVyZXMgOiBpbnQgPSAyMCwKICAgIGZlYXR1cmVzX2hkciA6IE9wdGlvbmFsW0xpc3Rbc3RyXV0gPSBOb25lLAogICAgd2VpZ2h0IDogZmxvYXQgPSAwLjUwLAogICAgcmFuZG9tX3N0YXRlIDogaW50ID0xLAogICAgZmlsZW5hbWUgOiBPcHRpb25hbFtzdHJdID0gTm9uZSwKICAgIHRhcmdldF9wYXRoIDogc3RyID0gIiIsCiAgICBrZXkgOiBzdHIgPSAiIgopOgogICAgIiIiQ3JlYXRlIGEgYmluYXJ5IGNsYXNzaWZpY2F0aW9uIHNhbXBsZSBkYXRhc2V0IGFuZCBzYXZlLgogICAgSWYgbm8gZmlsZW5hbWUgaXMgZ2l2ZW4gaXQgd2lsbCBkZWZhdWx0IHRvOgogICAgJ3NpbWRhdGEte25fc2FtcGxlc31Ye21fZmVhdHVyZXN9LnBhcnF1ZXQnLgogICAgQWxsIG9mIHRoZSBzY2lraXQtbGVhcm4gcGFyYW1ldGVycyBjYW4gYmUgc2V0IHVzaW5nICoqc2tfcGFyYW1zCiAgICAKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgICBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gbl9zYW1wbGVzOiAgICAgbnVtYmVyIG9mIHJvd3Mvc2FtcGxlcwogICAgOnBhcmFtIG1fZmVhdHVyZXM6ICAgIG51bWJlciBvZiBjb2xzL2ZlYXR1cmVzCiAgICA6cGFyYW0gZmVhdHVyZXNfaGRyOiAgaGVhZGVyIGZvciBmZWF0dXJlcyBhcnJheQogICAgOnBhcmFtIHdlaWdodDogICAgICAgIGZyYWN0aW9uIG9mIHNhbXBsZSAobmVnKQogICAgOnBhcmFtIHJhbmRvbV9zdGF0ZTogIHJuZyBzZWVkIChzZWUgaHR0cHM6Ly9zY2lraXQtbGVhcm4ub3JnL3N0YWJsZS9nbG9zc2FyeS5odG1sI3Rlcm0tcmFuZG9tLXN0YXRlKQogICAgOnBhcmFtIGZpbGVuYW1lOiAgICAgIG9wdGlvbmFsIG5hbWUgZm9yIHN0b3JlZCBkYXRhIGZpbGUKICAgIDpwYXJhbSB0YXJnZXRfcGF0aDogICBkZXN0aW1hdGlvbiBmb3IgZmlsZQogICAgOnBhcmFtIGtleTogICAgICAgICAgIGtleSBvZiBkYXRhIGluIGFydGlmYWN0IHN0b3JlCiAgICBSZXR1cm5zIGZpbGVuYW1lIG9mIGNyZWF0ZWQgZGF0YSAoaW5jbHVkZXMgcGF0aCkuCiAgICAiIiIKICAgICMgY2hlY2sgZGlyZWN0b3JpZXMgZXhpc3QgYW5kIGNyZWF0ZSBmaWxlbmFtZSBpZiBOb25lOgogICAgb3MubWFrZWRpcnModGFyZ2V0X3BhdGgsIGV4aXN0X29rPVRydWUpCiAgICBpZiBub3QgZmlsZW5hbWU6CiAgICAgICAgbmFtZSA9IGYic2ltZGF0YS17bl9zYW1wbGVzOjAuMGV9WHttX2ZlYXR1cmVzfS5wYXJxdWV0Ii5yZXBsYWNlKCIrIiwgIiIpCiAgICAgICAgZmlsZW5hbWUgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUpCiAgICBlbHNlOgogICAgICAgIGZpbGVuYW1lID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBmaWxlbmFtZSkKICAgIAogICAgZmVhdHVyZXMsIGxhYmVscyA9IG1ha2VfY2xhc3NpZmljYXRpb24oCiAgICAgICAgbl9zYW1wbGVzPW5fc2FtcGxlcywKICAgICAgICBuX2ZlYXR1cmVzPW1fZmVhdHVyZXMsCiAgICAgICAgd2VpZ2h0cz1bd2VpZ2h0XSwgICMgRmFsc2UKICAgICAgICBuX2NsYXNzZXM9MiwKICAgICAgICByYW5kb21fc3RhdGU9cmFuZG9tX3N0YXRlKQoKICAgICMgbWFrZSBkYXRhZnJhbWVzLCBhZGQgY29sdW1uIG5hbWVzLCBjb25jYXRlbmF0ZSAoWCwgeSkKICAgIFggPSBwZC5EYXRhRnJhbWUoZmVhdHVyZXMpCiAgICBpZiBub3QgZmVhdHVyZXNfaGRyOgogICAgICAgIFguY29sdW1ucyA9IFsiZmVhdF8iICsgc3RyKHgpIGZvciB4IGluIHJhbmdlKG1fZmVhdHVyZXMpXQogICAgZWxzZToKICAgICAgICBYLmNvbHVtbnMgPSBmZWF0dXJlc19oZHIKCiAgICB5ID0gcGQuRGF0YUZyYW1lKGxhYmVscywgY29sdW1ucz1bImxhYmVscyJdKQogICAgZGF0YSA9IHBkLmNvbmNhdChbWCwgeV0sIGF4aXM9MSkKCiAgICBwcS53cml0ZV90YWJsZShwYS5UYWJsZS5mcm9tX3BhbmRhcyhkYXRhKSwgZmlsZW5hbWUpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdChrZXksIHRhcmdldF9wYXRoPWZpbGVuYW1lKQo=
base_image: yjbds/mlrun-intel:dev
commands: []
code_origin: https://github.com/yjb-ds/functions.git#e4d74d784d42fb25cc75cbcab6d817bb1d2b150c:/User/repos/functions/datagen/classification/binary.py
102 changes: 102 additions & 0 deletions datagen/train_valid_test/function.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
import pandas as pd
import os
import numpy as np
import pyarrow.parquet as pq
import pyarrow as pa
from cloudpickle import dump

import pyarrow.parquet as pq
import pyarrow as pa

from sklearn.model_selection import train_test_split
from typing import Optional, Union
from mlrun.execution import MLClientCtx
from mlrun.datastore import DataItem

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

def train_valid_test_splitter(
context: Optional[MLClientCtx] = None,
src_file: Union[DataItem, str] = '',
header: Union[DataItem, str, list] = '',
sample: int = -1,
label_column: str = 'labels',
test_size: float = 0.1,
train_val_split: float = 0.75,
target_path: str = '',
name: str = '',
key: str = '',
random_state = 1
) -> None:
"""Split raw data input into train, validation and test sets.

:param context: the function context
:param src_file: ('raw') name of raw data file
:param header: (None) header artifact or list of column names.
:param sample: (-1). Selects the first n rows, or select a sample starting
from the first. If negative <-1, select a random sample from
the entire file
:param label_column: ground-truth (y) labels
:param test_size: (0.1) test set size
:param train_val_split: (0.75) Once the test set has been removed the
training set gets this proportion.
:param target_path: folder location of files
:param name: destination prefix name for model files
:param key: key for model artifact
:param random_state: (1) sklearn rng seed
"""
srcfilepath = os.path.join(target_path, str(src_file))

if (sample == -1) or (sample >= 1):
# get all rows, or contiguous sample starting at row 1.
raw = pq.read_table(srcfilepath).to_pandas()
labels = raw.pop(label_column)
raw = raw.iloc[:sample, :]
labels = labels.iloc[:sample]
else:
# grab a random sample
#raw = pd.read_parquet(srcfilepath, engine='pyarrow').sample(sample*-1)
raw = pq.read_table(srcfilepath).to_pandas().sample(sample*-1)
labels = raw.pop(label_column)

# double split tp generate 3 data sets: train, validation and test
x, xtest, y, ytest = train_test_split(raw, labels, test_size=test_size,
random_state=random_state)

xtrain, xvalid, ytrain, yvalid = train_test_split(x, y,
train_size=train_val_split,
random_state=random_state)

if name:
name = '-' + name

# save header
f = os.path.join(target_path, name + 'header.pkl')
dump(raw.columns.values, open(f, 'wb'))
context.log_artifact('header', target_path=f)

# save data sets
f = os.path.join(target_path, name + 'xtrain.pqt')
xtrain.to_parquet(f)
context.log_artifact('xtrain', target_path=f)

f = os.path.join(target_path, name + 'xvalid.pqt')
xvalid.to_parquet(f)
context.log_artifact('xvalid', target_path=f)

f = os.path.join(target_path, name + 'xtest.pqt')
xtest.to_parquet(f)
context.log_artifact('xtest', target_path=f)

f = os.path.join(target_path, name + 'ytrain.pqt')
pd.DataFrame({'labels': ytrain}).to_parquet(f)
context.log_artifact('ytrain', target_path=f)

f = os.path.join(target_path, name + 'yvalid.pqt')
pd.DataFrame({'labels': yvalid}).to_parquet(f)
context.log_artifact('yvalid', target_path=f)

f = os.path.join(target_path, name + 'ytest.pqt')
pd.DataFrame({'labels': ytest}).to_parquet(f)
context.log_artifact('ytest', target_path=f)
18 changes: 18 additions & 0 deletions datagen/train_valid_test/function.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
kind: job
metadata:
name: train-valid-test
tag: ''
hash: a20a8322b51297f4491727c3a2beb3b3ec505999
project: ''
spec:
command: ''
args: []
volumes: []
volume_mounts: []
env: []
description: ''
build:
functionSourceCode: aW1wb3J0IHBhbmRhcyBhcyBwZAppbXBvcnQgb3MKaW1wb3J0IG51bXB5IGFzIG5wCmltcG9ydCBweWFycm93LnBhcnF1ZXQgYXMgcHEKaW1wb3J0IHB5YXJyb3cgYXMgcGEKZnJvbSBjbG91ZHBpY2tsZSBpbXBvcnQgZHVtcAoKaW1wb3J0IHB5YXJyb3cucGFycXVldCBhcyBwcQppbXBvcnQgcHlhcnJvdyBhcyBwYQoKZnJvbSBza2xlYXJuLm1vZGVsX3NlbGVjdGlvbiBpbXBvcnQgdHJhaW5fdGVzdF9zcGxpdApmcm9tIHR5cGluZyBpbXBvcnQgT3B0aW9uYWwsIFVuaW9uCmZyb20gbWxydW4uZXhlY3V0aW9uIGltcG9ydCBNTENsaWVudEN0eApmcm9tIG1scnVuLmRhdGFzdG9yZSBpbXBvcnQgRGF0YUl0ZW0KCmltcG9ydCB3YXJuaW5ncwp3YXJuaW5ncy5zaW1wbGVmaWx0ZXIoYWN0aW9uPSdpZ25vcmUnLCBjYXRlZ29yeT1GdXR1cmVXYXJuaW5nKQoKZGVmIHRyYWluX3ZhbGlkX3Rlc3Rfc3BsaXR0ZXIoCiAgICBjb250ZXh0OiBPcHRpb25hbFtNTENsaWVudEN0eF0gPSBOb25lLAogICAgc3JjX2ZpbGU6IFVuaW9uW0RhdGFJdGVtLCBzdHJdID0gJycsCiAgICBoZWFkZXI6IFVuaW9uW0RhdGFJdGVtLCBzdHIsIGxpc3RdID0gJycsCiAgICBzYW1wbGU6IGludCA9IC0xLAogICAgbGFiZWxfY29sdW1uOiBzdHIgPSAnbGFiZWxzJywKICAgIHRlc3Rfc2l6ZTogZmxvYXQgPSAwLjEsCiAgICB0cmFpbl92YWxfc3BsaXQ6IGZsb2F0ID0gMC43NSwKICAgIHRhcmdldF9wYXRoOiBzdHIgPSAnJywKICAgIG5hbWU6IHN0ciA9ICcnLAogICAga2V5OiBzdHIgPSAnJywKICAgIHJhbmRvbV9zdGF0ZSA9IDEKKSAtPiBOb25lOgogICAgIiIiU3BsaXQgcmF3IGRhdGEgaW5wdXQgaW50byB0cmFpbiwgdmFsaWRhdGlvbiBhbmQgdGVzdCBzZXRzLgoKICAgIDpwYXJhbSBjb250ZXh0OiAgICAgICAgIHRoZSBmdW5jdGlvbiBjb250ZXh0CiAgICA6cGFyYW0gc3JjX2ZpbGU6ICAgICAgICAoJ3JhdycpIG5hbWUgb2YgcmF3IGRhdGEgZmlsZQogICAgOnBhcmFtIGhlYWRlcjogICAgICAgICAgKE5vbmUpIGhlYWRlciBhcnRpZmFjdCBvciBsaXN0IG9mIGNvbHVtbiBuYW1lcy4KICAgIDpwYXJhbSBzYW1wbGU6ICAgICAgICAgICgtMSkuIFNlbGVjdHMgdGhlIGZpcnN0IG4gcm93cywgb3Igc2VsZWN0IGEgc2FtcGxlIHN0YXJ0aW5nCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBmcm9tIHRoZSBmaXJzdC4gSWYgbmVnYXRpdmUgPC0xLCBzZWxlY3QgYSByYW5kb20gc2FtcGxlIGZyb20gCiAgICAgICAgICAgICAgICAgICAgICAgICAgICB0aGUgZW50aXJlIGZpbGUKICAgIDpwYXJhbSBsYWJlbF9jb2x1bW46ICAgIGdyb3VuZC10cnV0aCAoeSkgbGFiZWxzCiAgICA6cGFyYW0gdGVzdF9zaXplOiAgICAgICAoMC4xKSB0ZXN0IHNldCBzaXplCiAgICA6cGFyYW0gdHJhaW5fdmFsX3NwbGl0OiAoMC43NSkgT25jZSB0aGUgdGVzdCBzZXQgaGFzIGJlZW4gcmVtb3ZlZCB0aGUgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICB0cmFpbmluZyBzZXQgZ2V0cyB0aGlzIHByb3BvcnRpb24uCiAgICA6cGFyYW0gdGFyZ2V0X3BhdGg6ICAgICBmb2xkZXIgbG9jYXRpb24gb2YgZmlsZXMKICAgIDpwYXJhbSBuYW1lOiAgICAgICAgICAgIGRlc3RpbmF0aW9uIHByZWZpeCBuYW1lIGZvciBtb2RlbCBmaWxlcwogICAgOnBhcmFtIGtleTogICAgICAgICAgICAga2V5IGZvciBtb2RlbCBhcnRpZmFjdAogICAgOnBhcmFtIHJhbmRvbV9zdGF0ZTogICAgKDEpIHNrbGVhcm4gcm5nIHNlZWQKICAgICIiIgogICAgc3JjZmlsZXBhdGggPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIHN0cihzcmNfZmlsZSkpCgogICAgaWYgKHNhbXBsZSA9PSAtMSkgb3IgKHNhbXBsZSA+PSAxKToKICAgICAgICAjIGdldCBhbGwgcm93cywgb3IgY29udGlndW91cyBzYW1wbGUgc3RhcnRpbmcgYXQgcm93IDEuCiAgICAgICAgcmF3ID0gcHEucmVhZF90YWJsZShzcmNmaWxlcGF0aCkudG9fcGFuZGFzKCkKICAgICAgICBsYWJlbHMgPSByYXcucG9wKGxhYmVsX2NvbHVtbikKICAgICAgICByYXcgPSByYXcuaWxvY1s6c2FtcGxlLCA6XQogICAgICAgIGxhYmVscyA9IGxhYmVscy5pbG9jWzpzYW1wbGVdCiAgICBlbHNlOgogICAgICAgICMgZ3JhYiBhIHJhbmRvbSBzYW1wbGUKICAgICAgICAjcmF3ID0gcGQucmVhZF9wYXJxdWV0KHNyY2ZpbGVwYXRoLCBlbmdpbmU9J3B5YXJyb3cnKS5zYW1wbGUoc2FtcGxlKi0xKQogICAgICAgIHJhdyA9IHBxLnJlYWRfdGFibGUoc3JjZmlsZXBhdGgpLnRvX3BhbmRhcygpLnNhbXBsZShzYW1wbGUqLTEpCiAgICAgICAgbGFiZWxzID0gcmF3LnBvcChsYWJlbF9jb2x1bW4pCiAgICAKICAgICMgZG91YmxlIHNwbGl0IHRwIGdlbmVyYXRlIDMgZGF0YSBzZXRzOiB0cmFpbiwgdmFsaWRhdGlvbiBhbmQgdGVzdAogICAgeCwgeHRlc3QsIHksIHl0ZXN0ID0gdHJhaW5fdGVzdF9zcGxpdChyYXcsIGxhYmVscywgdGVzdF9zaXplPXRlc3Rfc2l6ZSwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHJhbmRvbV9zdGF0ZT1yYW5kb21fc3RhdGUpCiAgIAogICAgeHRyYWluLCB4dmFsaWQsIHl0cmFpbiwgeXZhbGlkID0gdHJhaW5fdGVzdF9zcGxpdCh4LCB5LCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgdHJhaW5fc2l6ZT10cmFpbl92YWxfc3BsaXQsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICByYW5kb21fc3RhdGU9cmFuZG9tX3N0YXRlKSAgICAgICAgCgogICAgaWYgbmFtZToKICAgICAgICBuYW1lID0gJy0nICsgbmFtZQogICAgCiAgICAjIHNhdmUgaGVhZGVyCiAgICBmID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lICsgJ2hlYWRlci5wa2wnKQogICAgZHVtcChyYXcuY29sdW1ucy52YWx1ZXMsIG9wZW4oZiwgJ3diJykpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgnaGVhZGVyJywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgIyBzYXZlIGRhdGEgc2V0cwogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICd4dHJhaW4ucHF0JykKICAgIHh0cmFpbi50b19wYXJxdWV0KGYpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgneHRyYWluJywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICd4dmFsaWQucHF0JykKICAgIHh2YWxpZC50b19wYXJxdWV0KGYpCiAgICBjb250ZXh0LmxvZ19hcnRpZmFjdCgneHZhbGlkJywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICd4dGVzdC5wcXQnKQogICAgeHRlc3QudG9fcGFycXVldChmKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ3h0ZXN0JywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgZiA9IG9zLnBhdGguam9pbih0YXJnZXRfcGF0aCwgbmFtZSArICd5dHJhaW4ucHF0JykKICAgIHBkLkRhdGFGcmFtZSh7J2xhYmVscyc6IHl0cmFpbn0pLnRvX3BhcnF1ZXQoZikKICAgIGNvbnRleHQubG9nX2FydGlmYWN0KCd5dHJhaW4nLCB0YXJnZXRfcGF0aD1mKQogICAgCiAgICBmID0gb3MucGF0aC5qb2luKHRhcmdldF9wYXRoLCBuYW1lICsgJ3l2YWxpZC5wcXQnKQogICAgcGQuRGF0YUZyYW1lKHsnbGFiZWxzJzogeXZhbGlkfSkudG9fcGFycXVldChmKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ3l2YWxpZCcsIHRhcmdldF9wYXRoPWYpCiAgICAKICAgIGYgPSBvcy5wYXRoLmpvaW4odGFyZ2V0X3BhdGgsIG5hbWUgKyAneXRlc3QucHF0JykKICAgIHBkLkRhdGFGcmFtZSh7J2xhYmVscyc6IHl0ZXN0fSkudG9fcGFycXVldChmKQogICAgY29udGV4dC5sb2dfYXJ0aWZhY3QoJ3l0ZXN0JywgdGFyZ2V0X3BhdGg9ZikKICAgIAogICAgY29udGV4dC5sb2dnZXIuaW5mbygnbnVtcHknLCBucC5fX3ZlcnNpb25fXykKICAgIGNvbnRleHQubG9nZ2VyLmluZm8oJ3BhbmRhcyAnLCBwZC5fX3ZlcnNpb25fXykKICAgIGNvbnRleHQubG9nZ2VyLmluZm8oJ3B5YXJyb3cnLCBwYS5fX3ZlcnNpb25fXyk=
base_image: yjbds/mlrun-intel:dev
commands: []
code_origin: https://github.com/yjb-ds/functions.git#e613e55761fd1ed325ad88155877924aa5b49ccc:/User/repos/functions/datagen/splitters/train_valid_test.py
Loading