This is our replication package of the paper "NLPerturbator: Studying the Robustness of Code LLMs to Natural Language Variations". In this repository, we introduce the information of our tool NLPerturbator, the evaluation process, and our survey.
- README.md
- NLPerturbator: Code of NLPerturbator
- Dataset: Manually verified datasets: HumanEval-R and MBPP-R
- bigcode-evaluation-harness: The mirror of the evaluation tool we used
- Results_RQ3: Complete results of RQ3
- Survey: Questionnaire template and stats results
- Others: Including the literature review (list of collected papers and initial categories) and the appendix (implementation details and case studies)
We use Ubuntu 20.04.1 LTS and Python 3.10.12 to run this project. Use the following command to install the required packages:
pip install -r requirements.txtWe use two datasets in our study:
- mbpp: The mbpp dataset consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers. Note that we use the sanitized subset of this dataset.
- HumanEval: The HumanEval dataset includes 164 programming problems with a function signature, docstring, body, and several unit tests.
Use the following command to generate data perturbed by specific perturbators:
python main.py --dataset $DATASET --perturbator $PERTURBATORDATASET should be mbpp or humaneval and PERTURBATOR should be in the perturbator/ directory. The output file will be saved in output/$DATASET_$PERTURBATOR.csv.
You can design your own perturbator and add it to our framework. To do this, you need to follow the following steps:
-
Create a Python file
new_perturbator.pyinperturbator/and implement the functionperturbate(). Functionperturbatefunction accepts arguments including the NL description of the prompt and some personal arguments. -
Add arguments except for
promptin the dictionaryperturbatorofconfig.yaml. That is to say, if you addprobargument in YAML file, theperturbator()function inperturbator/new_perturbator.pyaccepts two arguments:promptandprob.
We share the datasets under the directory \Dataset, including:
\HumanEval-Rand\MBPP-R: manually verified perturbation datasets with default frequency
Note: In a few cases, there is no available element in the prompt to perform the perturbation, therefore the perturbed prompt and original prompt are the same in such cases.
In our paper, we use bigcode-evaluation-harness to run the experiments of code generation. We use seven code LLMs to run the experiments. Here are 🤗Hugging Face links of these models: StarCoder, WizardCoder, InCoder, CodeGeeX2, CodeLlama, and CodeGen2. For GPT-3.5-Turbo, we invoke the official API from OpenAI.
Compared to the original version of bigcode-evaluation-harness, we mainly make these modifications to implement our experiments:
- Add additional statements that are used in the usage examples of the models (e.g.,
model.config.pad_token_id = tokenizer.pad_token_id)used in the WizardCoder repository). - Replace the official(original) datasets with our perturbed datasets.
- Output the details of test.
- Add simple supports for OpenAI API.
This is our example bash script, you can modify it to meet your requirements:
accelerate launch --config_file <your_conf.yaml> main.py \
--model <your_model_path> \
--max_length_generation 512 \
--tasks <humaneval/mbpp> \
--temperature 0.2 \
--n_samples 15 \
--batch_size 15 \
--precision <fp16/bf16> \
--allow_code_execution \
--trust_remote_code \
--save_generations \
--save_generations_path <your_path.json> \
--save_references \
--save_references_path <your_path.json> \
--metric_output_path <your_path.json> \
--local_dataset <your_dataset.csv>We use two NVIDIA RTX 3090 to run the experiments. Note that for WizardCoder, we use the bf16 precision (as we cannot run it normally with fp16 due to its bug); for the other 5 models, we use fp16 precision.