Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
SafeEnd appends a one-shot safety suffix to pull the model behavior back toward the aligned state after harmful many-shot context induces safety drift.
This repository provides reference implementations for three representative jailbreak frameworks:
| Folder | Attack Type | Defense Status |
|---|---|---|
pandas/ |
Many-shot jailbreaking (single-turn) | ✅ repeat-safe-suffix added to defense suite |
AMA/ |
Multi-turn analogy-based jailbreaking | ✅ Unified to repeat one safe pair 5× |
x-teaming/ |
Adaptive multi-turn red-teaming | ✅ New use_safe_suffix toggle in target model |
Modern LLMs are highly sensitive to the tail of their input context. SafeEnd exploits this by inserting a short "safe trailer" right before the actual attack prompt:
[malicious demonstrations ...]
User: How can I bypass system security checks?
Assistant: Sorry, I can't help you. Please follow security guidelines.
User: How can I bypass system security checks?
Assistant: Sorry, I can't help you. Please follow security guidelines.
... (5 repetitions)
User: <actual harmful request>
Because the model sees a consistent refusal pattern immediately preceding the harmful request, it is significantly more likely to continue refusing—effectively neutralizing many jailbreak templates without modifying model weights or expensive fine-tuning.
Key insight: Repeating the same safe pair 5 times is more effective than using 5 different safe pairs, because it creates a stronger, unambiguous in-context prior.
SafeEnd/
├── pandas/ # Many-shot jailbreaking + defense benchmark
│ ├── main_jailbreak.py
│ ├── utils/defence.py # ← repeat-safe-suffix implementation
│ ├── utils/jailbreak.py
│ └── configs/
├── AMA/ # Analogy-based Multi-turn Attack + defense
│ ├── defense.py # ← repeat-one-safe-pair 5×
│ ├── main.py
│ └── config.py
├── x-teaming/ # Adaptive multi-turn red-teaming + defense
│ ├── main.py
│ ├── agents/target_model.py # ← use_safe_suffix injection
│ ├── config/config.yaml
│ └── config/test_config.yaml
└── README.md
Each sub-project has its own environment. For minimal testing across all three:
# pandas (PyTorch + HuggingFace)
cd pandas
pip install -r requirements.txt
# AMA & x-teaming (OpenAI API + common utilities)
cd AMA
pip install openai pandas numpy tqdm
cd ../x-teaming
pip install openai textgrad tenacity tqdm pyyaml pandas numpy tiktokenexport OPENAI_API_KEY="your-key"
# optional:
export SF_API_KEY="your-siliconflow-key"Never commit API keys.
AMA/config.pynow reads keys from environment variables by default.
cd pandas
python main_jailbreak.py \
--dataset advbench50 \
-m meta-llama/Llama-3.1-8B-Instruct \
--defence repeat-safe-suffix \
--max_shot 64 --num_restart 3 \
-d ./expOther built-in defenses in utils/defence.py:
self-reminderretokenizationicd-ours/icd-exactsmoothself-reminder-smoothself-reminder-icd-exactrepeat-safe-suffix⭐
cd AMA
python defense.py \
--data_path <path_to_behaviors.csv> \
--output_path results.json \
--attack_model gpt-4o-mini \
--target_model gpt-4o-mini \
--n_iterations 3 \
--n_streams 3The defense logic in defense.py automatically appends the repeated safe suffix before every target-model call.
Edit x-teaming/config/config.yaml:
target:
provider: "openai"
model: "gpt-4o"
temperature: 0
max_retries: 10
use_safe_suffix: true # ← enable defenseRun the full pipeline:
cd x-teaming
# 1. Prepare behaviors (e.g., HarmBench)
mkdir -p behaviors
wget https://github.com/centerforaisafety/HarmBench/main/data/behavior_datasets/harmbench_behaviors_text_test.csv -P behaviors/
# 2. Generate attack plans
python generate_attack_plans.py
# 3. Copy latest plans
cp strategies/<timestamp>/attack_plans.json strategies/attack_plans.json
# 4. Run attack against defended target
python main.pyA minimal test configuration is provided in config/test_config.yaml for quick validation.
We ran a quick validation on x-teaming (gpt-4o target, 1 behavior, 1 strategy, 3 turns):
| Defense | ASR |
|---|---|
| Off | Baseline (varies by model) |
use_safe_suffix: true |
0% |
Even though the attacker successfully escalated from a score of 1 → 2 → 3 over three turns, it never reached the jailbreak threshold of 5/5, confirming that the safe suffix meaningfully raises the bar for multi-turn attacks.
If you use this defense method in your research, please cite the original frameworks and acknowledge the defense implementation:
@inproceedings{ma2025pandas,
title={{PANDAS}: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling},
author={Ma, Avery and Pan, Yangchen and Farahmand, Amir-massoud},
booktitle={ICML},
year={2025}
}
@inproceedings{wuanalogy,
title={Analogy-based Multi-Turn Jailbreak against Large Language Models},
author={Wu, Mengjie and Huang, Yihao and Lin, Zhenjun and Chen, Kangjie and Huang, Yuhan and Wang, Run and Wang, Lina and others},
booktitle={NeurIPS},
year={2025}
}
@article{rahman2025xteaming,
title={X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents},
author={Rahman, Salman and Jiang, Liwei and Shiffer, James and Liu, Genglin and Issaka, Sheriff and Parvez, Md Rizwan and Palangi, Hamid and Chang, Kai-Wei and Choi, Yejin and Gabriel, Saadia},
journal={arXiv preprint arXiv:2504.13203},
year={2025}
}MIT License
