NLPerturbator: Studying the Robustness of Code LLMs to Natural Language Variations

This is our replication package of the paper "NLPerturbator: Studying the Robustness of Code LLMs to Natural Language Variations". In this repository, we introduce the information of our tool NLPerturbator, the evaluation process, and our survey.

List of Materials:

README.md
NLPerturbator: Code of NLPerturbator
Dataset: Manually verified datasets: HumanEval-R and MBPP-R
bigcode-evaluation-harness: The mirror of the evaluation tool we used
Results_RQ3: Complete results of RQ3
Survey: Questionnaire template and stats results
Others: Including the literature review (list of collected papers and initial categories) and the appendix (implementation details and case studies)

Framework for Natural Language Perturbations: NLPerturbator

Environments

We use Ubuntu 20.04.1 LTS and Python 3.10.12 to run this project. Use the following command to install the required packages:

pip install -r requirements.txt

Original Datasets

We use two datasets in our study:

mbpp: The mbpp dataset consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers. Note that we use the sanitized subset of this dataset.
HumanEval: The HumanEval dataset includes 164 programming problems with a function signature, docstring, body, and several unit tests.

Generate Data Perturbed by Specific Perturbators

Use the following command to generate data perturbed by specific perturbators:

python main.py --dataset $DATASET --perturbator $PERTURBATOR

DATASET should be mbpp or humaneval and PERTURBATOR should be in the perturbator/ directory. The output file will be saved in output/$DATASET_$PERTURBATOR.csv.

Add a New Perturbator

You can design your own perturbator and add it to our framework. To do this, you need to follow the following steps:

Create a Python file new_perturbator.py in perturbator/ and implement the function perturbate(). Function perturbate function accepts arguments including the NL description of the prompt and some personal arguments.
Add arguments except for prompt in the dictionary perturbator of config.yaml. That is to say, if you add prob argument in YAML file, the perturbator() function in perturbator/new_perturbator.py accepts two arguments: prompt and prob.

Datasets

We share the datasets under the directory \Dataset, including:

\HumanEval-R and \MBPP-R: manually verified perturbation datasets with default frequency

Note: In a few cases, there is no available element in the prompt to perform the perturbation, therefore the perturbed prompt and original prompt are the same in such cases.

Evaluation Process with bigcode-evaluation-harness

Models

In our paper, we use bigcode-evaluation-harness to run the experiments of code generation. We use seven code LLMs to run the experiments. Here are 🤗Hugging Face links of these models: StarCoder, WizardCoder, InCoder, CodeGeeX2, CodeLlama, and CodeGen2. For GPT-3.5-Turbo, we invoke the official API from OpenAI.

bigcode-evaluation-harness

Compared to the original version of bigcode-evaluation-harness, we mainly make these modifications to implement our experiments:

Add additional statements that are used in the usage examples of the models (e.g., model.config.pad_token_id = tokenizer.pad_token_id) used in the WizardCoder repository).
Replace the official(original) datasets with our perturbed datasets.
Output the details of test.
Add simple supports for OpenAI API.

This is our example bash script, you can modify it to meet your requirements:

accelerate launch --config_file <your_conf.yaml> main.py \
  --model <your_model_path> \
  --max_length_generation 512 \
  --tasks <humaneval/mbpp> \
  --temperature 0.2 \
  --n_samples 15 \
  --batch_size 15 \
  --precision <fp16/bf16> \
  --allow_code_execution \
  --trust_remote_code \
  --save_generations \
  --save_generations_path <your_path.json> \
  --save_references \
  --save_references_path <your_path.json> \
  --metric_output_path <your_path.json> \
  --local_dataset <your_dataset.csv>

We use two NVIDIA RTX 3090 to run the experiments. Note that for WizardCoder, we use the bf16 precision (as we cannot run it normally with fp16 due to its bug); for the other 5 models, we use fp16 precision.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLPerturbator: Studying the Robustness of Code LLMs to Natural Language Variations

List of Materials:

Framework for Natural Language Perturbations: NLPerturbator

Environments

Original Datasets

Generate Data Perturbed by Specific Perturbators

Add a New Perturbator

Datasets

Evaluation Process with bigcode-evaluation-harness

Models

bigcode-evaluation-harness

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Dataset		Dataset
NLPerturbator		NLPerturbator
Others		Others
Results_RQ3		Results_RQ3
Survey		Survey
bigcode-evaluation-harness		bigcode-evaluation-harness
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

NLPerturbator: Studying the Robustness of Code LLMs to Natural Language Variations

List of Materials:

Framework for Natural Language Perturbations: NLPerturbator

Environments

Original Datasets

Generate Data Perturbed by Specific Perturbators

Add a New Perturbator

Datasets

Evaluation Process with bigcode-evaluation-harness

Models

bigcode-evaluation-harness

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages