Accepted to AAAI 2026 Main Technical Track
This repository contains the code and analysis notebooks for SPIRAL, our framework that embeds a tri-agent cognitive architecture into an MCTS loop to enable more robust, grounded, and reflective planning with large language models.
π Paper: SPIRAL: Symbolic LLM Planning via Grounded and Reflective Search
π Technical Appendix: Included in the arXiv paper
Note: The experiments in the paper were conducted using IBM's internal infrastructure (WatsonX/RITS). For public use, we provide a Hugging Face-based implementation in
utils/generic_client.pythat allows you to run SPIRAL with open-source models. Results may vary from those reported in the paper due to differences in model versions and inference infrastructure.
- Repository Structure
- Getting Started
- Dependencies
- Data
- Running Experiments
- Scripts & Agents
- Analysis Notebooks
- Configuration & Hyperparameters
- License
Note: Enter SPIRAL folder to follow the next information.
βββ analysis/
β βββ ablation/
β β βββ analysis_ablations.ipynb
β βββ baseline/
β β βββ cot_k1/
β β βββ cot_k3/
β β βββ cot_k5/
β β βββ spiral/
β β βββ analysis_baseline_performance.ipynb
β β βββ analysis_cost_benefit.ipynb
β β βββ cost_comparison_api_calls.pdf
β β βββ cost_comparison_tokens.pdf
β βββ sota/
β βββ sota_performance/
β βββ tot_hyper_params_performance/
β
βββ scripts/
β βββ taskbench_ablation.py
β βββ taskbench_cot_baseline.py
β βββ taskbench_lats_baseline.py
β βββ taskbench_rafa_baseline.py
β βββ taskbench_react_baseline.py
β βββ taskbench_react_rafa_baseline.py
β βββ taskbench_spiral.py
β βββ taskbench_tot_baseline.py
β
βββ Taskbench/
β βββ data_dailylifeapis/
β βββ data_huggingface/
β
βββ utils/
β βββ generic_client.py
β
βββ environment.yml
βββ run_all_baseline_experiments.sh
βββ run_all_ablation_experiments.sh
βββ run_all_sota_experiments.sh
βββ LICENSE
- Python 3.10 or 3.11
- CUDA-compatible GPU (recommended for faster inference)
- Conda or virtualenv
-
Clone the repository
git clone https://github.com/IBM/SPIRAL.git cd SPIRAL -
Create and activate the environment
Using Conda (recommended):
conda create -n spiral python=3.11 conda activate spiral pip install -r requirements.txt
Or using virtualenv:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
Download TaskBench datasets
cd scripts/ huggingface-cli download microsoft/Taskbench --local-dir Taskbench --repo-type datasetThis will place the
dailylifeapisandhuggingfacebenchmark data underscripts/Taskbench/. -
Set up environment variables (optional, for specific model APIs)
cp .env.example .env # Edit .env with your API keys if using external model providers
All required packages are listed in requirements.txt. Key dependencies include:
| Package | Version | Purpose |
|---|---|---|
torch |
β₯2.0.0 | Deep learning framework |
transformers |
β₯4.35.0 | Hugging Face model implementations |
datasets |
β₯2.14.0 | Dataset loading and processing |
langchain |
β₯0.1.0 | LLM orchestration framework |
litellm |
β₯1.0.0 | Unified LLM API interface |
numpy |
β₯1.24.0 | Numerical operations |
tqdm |
β₯4.65.0 | Progress bars |
For analysis notebooks:
pandas,matplotlib,seaborn,jupyter
We evaluate on two TaskBench tool-use benchmarks:
- DailyLifeAPIs (
Taskbench/data_dailylifeapis/) - HuggingFace (
Taskbench/data_huggingface/)
Each dataset should be organized as in the original TaskBench release:
Taskbench/data_dailylifeapis/
βββ problems.jsonl
Taskbench/data_huggingface/
βββ problems.jsonl
cd scripts/
python taskbench_spiral_method_final.py \
--run_name my_experiment \
--api_family dailylifeapis \
--num_problems 10 \
--seed 42 \
--model_name mistral \
--debug_llm_outputNOTE: these won't work, they require a different library structure
./run_all_baseline_experiments.shThis will run Chain-of-Thought (k=1,3,5), ReAct, RAFA, ToT, LATS, etc., via the corresponding taskbench_*_baseline.py scripts.
cd scripts/
python taskbench_spiral_method_final.py \
--run_name test \
--api_family dailylifeapis \
--num_problems 10 \
--seed 50 \
--model_name mistral \
--debug_llm_outputAvailable arguments:
--run_name: Name for the experiment run--api_family: Dataset to use (dailylifeapisorhuggingface)--num_problems: Number of problems to evaluate--seed: Random seed for reproducibility--model_name: Model to use (e.g.,mistral,llama_3,phi)--debug_llm_output: Enable verbose LLM output logging
Or run both benchmarks end-to-end:
./run_all_sota_experiments.sh./run_all_ablation_experiments.shThis will sweep over standard MCTS budgets and disable components (Planner, Simulator, Critic) to quantify their impact.
-
scripts/taskbench_spiral.py
Implements the SPIRAL agent:- Planner: proposes actions via LLM prompts
- Simulator: predicts next observation
- Critic: scores plan progress
-
Baseline scripts (
taskbench_cot_baseline.py,taskbench_react_baseline.py, etc.)
Wrap existing state-of-the-art methods for fair comparison. -
utils/generic_client.py
For public use: A Hugging Face-based implementation providing aHuggingFaceChatClientto interface with open-source LLMs. Use this for running experiments without IBM infrastructure. -
utils/ritz_client.py
IBM internal client for RITS/WatsonX endpoints (used in paper experiments).
All result aggregation, tables, and figures are in analysis/:
analysis_baseline_performance.ipynbanalysis_cost_benefit.ipynbanalysis_ablations.ipynbanalysis/sota_*
Use these notebooks to reproduce the tables and plots in the paper and appendix.
Detailed hyperparameters are in Appendix B of the paper:
| Component | Default Value | CLI Argument |
|---|---|---|
| MCTS Budget (K) | 50 iterations | --mcts_iterations |
| Max Tree Depth | 8 | --max_depth |
| Exploration Constant C | 1.0 (UCT) | β |
| Planner Temperature | 0.0 | β |
| Simulator Temperature | 0.2 | β |
| Random Seeds | 42 (default) | --seed |
| Max Workers | CPU count | --max_workers |
For public use, models are accessed via Hugging Face Transformers. See utils/generic_client.py for the implementation.
| Model Name | Hugging Face Model ID |
|---|---|
llama_3 |
meta-llama/Meta-Llama-3-70B-Instruct |
mistral |
mistralai/Mistral-7B-Instruct-v0.3 |
phi |
microsoft/Phi-3-mini-4k-instruct |
deepseek_v2_5 |
deepseek-ai/DeepSeek-V2-Lite |
qwen2_5_72b_instruct |
Qwen/Qwen2-72B-Instruct |
Paper experiments: Results reported in the paper used IBM's internal RITS/WatsonX infrastructure with models including Llama 4 Maverick 17B, Mistral Large, and other proprietary endpoints.
See Appendix B in the arXiv paper for full details.
If you find this work useful, please cite our paper:
@article{zhang2025spiral,
title={SPIRAL: Symbolic LLM Planning via Grounded and Reflective Search},
author={Zhang, Yifan and Ganapavarapu, Giridhar and Jayaraman, Srideepika and Agrawal, Bhavna and Patel, Dhaval and Fokoue, Achille},
journal={arXiv preprint arXiv:2512.23167},
year={2025}
}This project is released under the MIT License. See LICENSE for details.