Forgetting by Pruning: Data Deletion in Join Cardinality Estimation

This repository contains the code for experiments conducted on the following datasets:

JOB-light
TPC-H

The models used in the experiments are:

NeuroCard
FACE

Installation

fisrt create a conda environment with python 3.7, then install the required packages using pip.

conda create -n cep python=3.7
pip install -r requirements.txt

then install Face needs to be installed from source:

cd torchquadMy
pip install .

Dataset Download

Download datasets into the datasets directory:

IMDb dataset: IMDb Dataset, You can see scripts/ directory for the download script, and it is necessary to run prepend_imdb_header.py to add the header to the downloaded files.
TPC-H dataset: Run the following command to generate 10GB of data:

bash scripts/tpch.sh

Running Experiments

Step 1: Generate Models

Before running unlearning tasks, you need to generate the initial models. Use the following commands:

python run.py --run job-light --model neurocard
python run.py --run job-light --model face
python run.py --run tpch --model neurocard
python run.py --run tpch --model face

Step 2: Unlearning Tasks

Run unlearning tasks and optionally evaluate the results by adding --eval:

python run_unlearning.py --run job-light --filter imdb-A2-1-0.5 imdb-A6-1-1 --ul-method stale retrain fine-tune cep

--run: Specifies the workload (e.g., job-light, tpch).
--filter: Defines how unlearning deletes data (e.g., R-1-0.1, R-1-0.3, etc.).
--model: Specifies the model to use (e.g., neurocard, face).
--ul-method: Specifies the unlearning methods, including baselines (stale, retrain, fine-tune) and our method (cep).
--eval: Enables evaluation during unlearning tasks.

Evaluation

To modify the evaluation checkpoints, update the configuration files located in config/eval/<method>/checkpoint_to_load.

Configuration

Configuration files for the experiments are located in the config directory. You can modify these files to adjust parameters.

Results

All known results are saved in the CACHE folder. You can check your configured cache_dir to locate the corresponding results and cache files. By default, the results are stored in the cache folder within the current directory. Different tasks generate separate subfolders for isolation.

References

This repository is based on the following projects:

NeuroCard: GitHub Repository
FACE: GitHub Repository
Model Sparsity Can Simplify Machine Unlearning: Paper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forgetting by Pruning: Data Deletion in Join Cardinality Estimation

Installation

Dataset Download

Running Experiments

Step 1: Generate Models

Step 2: Unlearning Tasks

Evaluation

Configuration

Results

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 451 Commits
ar		ar
config		config
nf		nf
queries		queries
scripts		scripts
torchquadMy		torchquadMy
unlearning		unlearning
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
make_queries.py		make_queries.py
requirements.txt		requirements.txt
run.py		run.py
run_unlearning.py		run_unlearning.py

Folders and files

Latest commit

History

Repository files navigation

Forgetting by Pruning: Data Deletion in Join Cardinality Estimation

Installation

Dataset Download

Running Experiments

Step 1: Generate Models

Step 2: Unlearning Tasks

Evaluation

Configuration

Results

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages