Subset Demo Generator

This package is to facilitate high quality synthetic data ETLs. It works by declaratively structuring dataclasses, which then can be serialized to a csv, pushed to GCS and pushed into multiple databases. It offers methods to generate unique values, run over run, as well as generation based on poisson and other methods.

To create a new table:

@dataclass
class Person(metaclass=Table):
    # will result in a file named person.csv and table named person
    # dataclass fields will become csv columns and table columns with the correct types
    # this column will increment smoothly run over run
    id:int = field(default_factory=itertools.count(1).__next__) 
    #if more logic is needed than a simple default factory func, handle it in the __post_init__ function
    gender: str = field(init=False)
    # if you have a parent object, you can pass it in, for conditionality
    parent_object: InitVar[type] = None

    def __post_init__(self, parent_object:type):
        self.gender = random.choice(['M','F'])
        if self.gender == 'M':
            ...
        else:
            ...
        
        if parent_object.foo > 10:
            #conditional logic based off parent
            self.gender == 'N/A'

#smooth incrementing ids even for adding

    id:int = field(default_factory=itertools.count(1).__next__)

Snowflake Staging Setup

First go into snowflake, in the correct destination schema and create the stage from GCS

create stage my_gcs_stage
  url = 'gcs://name_of_gcs_bucket'
  storage_integration = gcp_int;

Snowflake generates a GCP service account which will need read privledges on the bucket. You can see the service account by running the following command:

DESC STORAGE INTEGRATION GCP_INT;

the service account to enable with GCP storage reader privledges will be in the STORAGE_GCP_SERVICE_ACCOUNT property.

The name of this stage will need to be placed in your environment variables .env file too

SF_STAGE_NAME=my_gcs_stage

Licensing Considerations

considered the licensable data from here: https://www.themoviedb.org/

am using rapid api instead, gives clips from imdb videos

Also use only the population column from https://simplemaps.com/data/us-zips to accurately generate us adresses based on population

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.vscode		.vscode
lib		lib
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
sfdc.py		sfdc.py
vidly_core.py		vidly_core.py
vidly_plan.txt		vidly_plan.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Subset Demo Generator

To create a new table:

Snowflake Staging Setup

Licensing Considerations

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Subset Demo Generator

To create a new table:

Snowflake Staging Setup

Licensing Considerations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages