Skip to content

heriec/CEP

Repository files navigation

Forgetting by Pruning: Data Deletion in Join Cardinality Estimation

This repository contains the code for experiments conducted on the following datasets:

  • JOB-light
  • TPC-H

The models used in the experiments are:

  • NeuroCard
  • FACE

Installation

fisrt create a conda environment with python 3.7, then install the required packages using pip.

conda create -n cep python=3.7
pip install -r requirements.txt

then install Face needs to be installed from source:

cd torchquadMy
pip install .

Dataset Download

Download datasets into the datasets directory:

  • IMDb dataset: IMDb Dataset, You can see scripts/ directory for the download script, and it is necessary to run prepend_imdb_header.py to add the header to the downloaded files.

  • TPC-H dataset: Run the following command to generate 10GB of data:

bash scripts/tpch.sh

Running Experiments

Step 1: Generate Models

Before running unlearning tasks, you need to generate the initial models. Use the following commands:

python run.py --run job-light --model neurocard
python run.py --run job-light --model face
python run.py --run tpch --model neurocard
python run.py --run tpch --model face

Step 2: Unlearning Tasks

Run unlearning tasks and optionally evaluate the results by adding --eval:

python run_unlearning.py --run job-light --filter imdb-A2-1-0.5 imdb-A6-1-1 --ul-method stale retrain fine-tune cep
  • --run: Specifies the workload (e.g., job-light, tpch).
  • --filter: Defines how unlearning deletes data (e.g., R-1-0.1, R-1-0.3, etc.).
  • --model: Specifies the model to use (e.g., neurocard, face).
  • --ul-method: Specifies the unlearning methods, including baselines (stale, retrain, fine-tune) and our method (cep).
  • --eval: Enables evaluation during unlearning tasks.

Evaluation

To modify the evaluation checkpoints, update the configuration files located in config/eval/<method>/checkpoint_to_load.

Configuration

Configuration files for the experiments are located in the config directory. You can modify these files to adjust parameters.

Results

All known results are saved in the CACHE folder. You can check your configured cache_dir to locate the corresponding results and cache files. By default, the results are stored in the cache folder within the current directory. Different tasks generate separate subfolders for isolation.

References

This repository is based on the following projects:

About

[AAAI26] "Forgetting by Pruning: Data Deletion in Join Cardinality Estimation" by Chaowei He, Yuanjun Liu, Qingzhi Ma, Shenyuan Ren, Xizhao Luo, Lei Zhao, An Liu

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors