This repository contains the data and scripts used for the training and validation of the experiments described in the paper ParserHunter: Identify Parsing Functions in Binary Code.
This project requires three Conda environments for different functionalities. Below are the setup instructions:
-
Function Extraction Environment (
test-3.9-env):- Python Version:
3.9.15 - Key Packages:
angr,r2pipe,timeout-decorator - Setup:
conda create -n test-3.9-env python=3.9.15 conda activate test-3.9-env conda install angr r2pipe timeout-decorator
- Python Version:
-
Geometric Data Creation and GNN Model Inference Environment (
test-3.10.0-env):- Python Version:
3.10.0 - Key Packages:
torch,torch-geometric,pandas,matplotlib,flask - Setup:
conda create -n test-3.10.0-env python=3.10.0 conda activate test-3.10.0-env conda install pytorch -c pytorch conda install torch-geometric pandas matplotlib flask
- Python Version:
-
Asm2Vec Model Inference Environment (
asm2vec):- Python Version:
3.8.19 - Key Packages:
gensim,asm2vec,numpy,scipy - Setup:
conda create -n asm2vec python=3.8.19 conda activate asm2vec conda install gensim numpy scipy
- Python Version:
-
Prepare Function List
Extract a list of functions from the binary executable files and save it inAsm2Vec/Dictionaries_list_of_functionsas.ptfiles.- Use:
Binaries/extract_list_functions.py
- Use:
-
Extract Assembly Instructions
Extract all assembly instructions from the basic blocks of the Control Flow Graphs (CFGs) and save them as a single file (Asm2Vec/assembly_codes.txt).- Use:
Asm2Vec/asm2Vec_extract_assembly_instructions.py
- Use:
-
Train Asm2Vec Model
Clean the assembly instructions and train the Asm2Vec model, saving it inAsm2Vec/asm2vec_model.- Use:
Asm2Vec/asm2Vec_training.py
- Use:
-
Infer Vectors
Use the trained Asm2Vec model to infer vector embeddings for the assembly instructions of functions.- Use:
Asm2Vec/asm2vec_inference.py - Input: Function assembly instructions
- Use:
Replicating the GNN Training and Validation Process using the an embedding moldel (Asm2Vec or SafeTorch):
-
Data Preparation
Extract and manually label function data, saving the labeled datasets inDictionaries_Labeled_Datas/.- Use:
Binaries/manual_labelling.py
- Use:
-
Create Geometric Data
From the list of labeled data creates PyTorch Geometric data objects and save them inSaved_Geometric_Datas/.- Steps:
a. Extract CFGs for each function.
b. Enrich CFG nodes with features using the trained Asm2Vec or SAFETorch embedding model.
c. Convert to PyTorch Geometric Data objects.
d. Save as.ptfiles. - Use:
asm2vec_experiments/create_geometric_datas.pyfor the Asm2Vec embedding model orsafetorch_experiments/create_geometric_datas.pyfor the SafeTorch embedding model.
- Steps:
-
Train GNN Model
Train a GNN model using a grid search to find the best hyperparameters. Save the following results:- Best model:
Results/embedding_model_name/model_name/best_model.pth - Best hyperparameters:
Results/embedding_model_name/model_name/best_params.json - Grid search results:
Results/embedding_model_name/model_name/result_gridsearch.csv - Use:
asm2vec_experiments/gnn_training_and_test.pyfor the Asm2Vec embedding model orsafetorch_experiments/gnn_training_and_test.pyfor the SafeTorch embedding model.
- Best model:
The following scripts and notebooks can be used to replicate the plots and tables presented in the paper:
-
Table 2: Distribution of functions per parser library and other training data statistics.
- Run:
Plots/analyze_training_data.py
- Run:
-
Table 4, 5 and 9: Performance comparison of GCN,GAT, GraphSAGE and R-GCN across different metrics and embedding models.
- Run:
Plots/analyze_validation_data.pyandPlots/analyze_grid_search_best_model.pyto produce the csv tablevalidation.csvinside each model folder, then filter out by the best parameters (found inside the filebest_params.json).
- Run:
-
Table 6: Recall of 3 PIE-based baselines.
- Run
Plots/static_code_exploration.ipynb
- Run
-
Table 7 and Table 8: Recall of SAFE embeddings using KNN and performance comparison of XGBoost and a Neural Network Classifier.
- Run
Plots/safetorch_data_exploration.ipynb
- Run
-
Figure 6: Reverse cumulative distribution of the prediction consistency across all compilation settings
- Run:
Plots/robustness.ipynb
- Run:
-
Figure 7 and 8: Recall results for different optimization levels for x86-32 and 64 architectures
- Run:
Plots/compilers_recall_validation_results.ipynb
- Run:
-
Table 10: Mean Recall results for different LLM and input code representations
- Run:
Plots/analyzed_llm_classifications.ipynb
- Run:
- Binaries/: Executable and tools for function extraction.
- Asm2Vec/: Asm2Vec training and inference files.
- Asm2Vec/Dictionaries_list_of_functions/: Function lists extracted from binaries.
- safetorch/: SafeTorch model files.
- safetorch/outputs: SafeTorch model outputs for baselines (comparison with SAFE paper)..
- code_counting_features/: Code and saved results for manual code feature experiments (comparison with PIE paper).
- asm2vec_experiments/: Create geometric data and GNN training using Asm2Vec embeddings.
- safetorch_experiments/: Create geometric data and GNN training using SafeTorch embeddings.
- Case_Studies/: Geometric data, function lists for case studies and tools for create case studies geometric data.
- Dictionaries_Labeled_Datas/: Labeled function lists.
- GNNs_Models/: GNN model definitions.
- Plots/: Analysis and Plot generation scripts.
- Results/: Results from GNN training and test.
- Saved_Geometric_Datas/: Saved PyTorch Geometric data objects used for training and test the GNN models.
create_geometric_datas.py: Create PyTorch Geometric data from labeled datasets.from_CFG_to_DataGeometric.py: Support script for data creation.gnn_training_and_test.py: Train/validate GNN model.
This project is licensed under the MIT License - see the LICENSE file for details.