This repository tests and compares different approaches to defending AI agents against prompt injection attacks — malicious instructions embedded in external content (documents, emails, API responses) that try to hijack an LLM's behaviour.
The research originated from the RFP Responder multi-agent solution, which extracts requirements from RFP documents and submits them to an LLM for answering. Adversaries could embed prompt injection payloads inside those documents, making the security of the extraction pipeline critical.
Prompt-Injection-Testing/
├── README.md # This file
├── security_agent-Prompt_INJECTION_And_Benign_DATASET.jsonl # Shared test dataset
├── Ollama/ # LLM-based inline security agent
│ ├── README.md
│ ├── ARCHITECTURE.md
│ ├── security_agent.py
│ └── test_security_agent.py
└── MAF-FIDES/ # FIDES content-labelling approach
├── README.md
├── ARCHITECTURE.md
├── fides_security_agent.py
├── test_fides_agent.py
└── requirements.txt
security_agent-Prompt_INJECTION_And_Benign_DATASET.jsonl
A curated dataset of 500 labelled prompts (250 malicious, 250 benign) used by both test harnesses to enable direct comparison.
| Field | Description |
|---|---|
id |
Unique identifier (e.g. pi-001) |
prompt |
The raw text to classify |
label |
malicious or benign |
attack_type |
code_execution, obfuscation, jailbreaking, data_leakage, role_playing, or none |
context |
Human-readable description of the attack or query |
response |
Expected agent response |
Attack type distribution:
| Attack Type | Count |
|---|---|
| code_execution | 146 |
| obfuscation | 61 |
| data_leakage | 18 |
| jailbreaking | 17 |
| role_playing | 8 |
| none (benign) | 250 |
Folder: Ollama/
An inline fire break that sits between the document-extraction step and the downstream execution agent in the RFP Responder pipeline. Every extracted requirement is scanned before it is passed forward. If the security agent detects a prompt injection attack, the pipeline is aborted immediately — the payload never reaches a downstream LLM that could act on it.
RFP document → extract requirements → [Security Agent] ──malicious──► ABORT
└──benign────► downstream agent
- The LLM receives the raw content and applies a detailed threat-detection system prompt.
- Two-phase analysis: per-node scan + full-structure scan for split-payload attacks.
- On malicious detection: pipeline halts (exit code 2 in standalone mode;
is_malicious: truein A2A mode). - On benign verdict: content is passed through to the next pipeline stage.
See Ollama/README.md and Ollama/ARCHITECTURE.md.
Folder: MAF-FIDES/
An implementation of Microsoft's FIDES (Foundational Integration Defense for Execution Security) approach from the Microsoft Agent Framework.
Rather than asking an LLM to detect attacks in raw content, FIDES prevents injection structurally:
- All external input is labelled
UNTRUSTED. - A middleware layer hides untrusted content behind an opaque variable reference before it reaches the main LLM.
- The main LLM never sees raw untrusted text; it only sees
[UNTRUSTED_CONTENT_REF: var_xxxxxxxx]. - When classification is needed, the agent calls a
quarantined_llmtool that processes the hidden content in complete isolation with no tool access.
See MAF-FIDES/README.md and MAF-FIDES/ARCHITECTURE.md.
| Dimension | Ollama Approach | FIDES Approach |
|---|---|---|
| Pipeline role | Inline fire break — aborts the pipeline on detection | Inline gate — blocks downstream tool calls on detection |
| On malicious detection | Pipeline halted immediately (abort / exit code 2) | Downstream agent actions blocked by policy enforcement |
| On benign verdict | Content passes through to the next pipeline stage | Content passes through; main agent proceeds normally |
| Defence mechanism | Probabilistic detection — LLM classifies raw content | Structural prevention (hiding) + probabilistic quarantine |
| Raw content seen by main LLM | Yes — sentinel LLM reads the raw payload | Never — raw payload is hidden before any LLM sees it |
| Injection vector | Sentinel LLM may be tricked by a sufficiently clever payload | Structurally closed for main agent; quarantine LLM is isolated |
| Classification method | Direct LLM analysis with security system prompt | Isolated quarantine LLM with explicit data-framing |
| False negative risk | Higher — novel attacks may fool the sentinel LLM | Lower — quarantine framing and isolation reduce susceptibility |
| False positive risk | Moderate | Moderate |
| Explainability | Full scratchpad reasoning in output | Full scratchpad reasoning from quarantine LLM + middleware event log |
Both harnesses produce the same set of metrics (accuracy, precision, recall, F1, confusion matrix) from the same dataset, making results directly comparable.
# Ollama approach
cd Ollama
python test_security_agent.py --limit 20 # quick test
python test_security_agent.py # full 500-prompt run
# FIDES approach
cd MAF-FIDES
pip install -r requirements.txt
python test_fides_agent.py --limit 20 # quick test
python test_fides_agent.py # full 500-prompt runBoth scripts accept --limit N, --start N, and --output path/to/results.json.
- Ollama running locally at
http://localhost:11434 - Granite 4 model pulled:
ollama pull granite4:latest - Python 3.11+,
pip install openai