Skip to content

steveh250/Prompt-Injection-Testing

Repository files navigation

Prompt Injection Testing

Purpose

This repository tests and compares different approaches to defending AI agents against prompt injection attacks — malicious instructions embedded in external content (documents, emails, API responses) that try to hijack an LLM's behaviour.

The research originated from the RFP Responder multi-agent solution, which extracts requirements from RFP documents and submits them to an LLM for answering. Adversaries could embed prompt injection payloads inside those documents, making the security of the extraction pipeline critical.


Repository Structure

Prompt-Injection-Testing/
├── README.md                                          # This file
├── security_agent-Prompt_INJECTION_And_Benign_DATASET.jsonl  # Shared test dataset
├── Ollama/                                            # LLM-based inline security agent
│   ├── README.md
│   ├── ARCHITECTURE.md
│   ├── security_agent.py
│   └── test_security_agent.py
└── MAF-FIDES/                                         # FIDES content-labelling approach
    ├── README.md
    ├── ARCHITECTURE.md
    ├── fides_security_agent.py
    ├── test_fides_agent.py
    └── requirements.txt

Shared Dataset

security_agent-Prompt_INJECTION_And_Benign_DATASET.jsonl

A curated dataset of 500 labelled prompts (250 malicious, 250 benign) used by both test harnesses to enable direct comparison.

Field Description
id Unique identifier (e.g. pi-001)
prompt The raw text to classify
label malicious or benign
attack_type code_execution, obfuscation, jailbreaking, data_leakage, role_playing, or none
context Human-readable description of the attack or query
response Expected agent response

Attack type distribution:

Attack Type Count
code_execution 146
obfuscation 61
data_leakage 18
jailbreaking 17
role_playing 8
none (benign) 250

Approaches Compared

1. Ollama — Inline LLM Fire Break

Folder: Ollama/

An inline fire break that sits between the document-extraction step and the downstream execution agent in the RFP Responder pipeline. Every extracted requirement is scanned before it is passed forward. If the security agent detects a prompt injection attack, the pipeline is aborted immediately — the payload never reaches a downstream LLM that could act on it.

RFP document → extract requirements → [Security Agent] ──malicious──► ABORT
                                                        └──benign────► downstream agent
  • The LLM receives the raw content and applies a detailed threat-detection system prompt.
  • Two-phase analysis: per-node scan + full-structure scan for split-payload attacks.
  • On malicious detection: pipeline halts (exit code 2 in standalone mode; is_malicious: true in A2A mode).
  • On benign verdict: content is passed through to the next pipeline stage.

See Ollama/README.md and Ollama/ARCHITECTURE.md.


2. MAF-FIDES — Content Labelling + Quarantine Isolation

Folder: MAF-FIDES/

An implementation of Microsoft's FIDES (Foundational Integration Defense for Execution Security) approach from the Microsoft Agent Framework.

Rather than asking an LLM to detect attacks in raw content, FIDES prevents injection structurally:

  • All external input is labelled UNTRUSTED.
  • A middleware layer hides untrusted content behind an opaque variable reference before it reaches the main LLM.
  • The main LLM never sees raw untrusted text; it only sees [UNTRUSTED_CONTENT_REF: var_xxxxxxxx].
  • When classification is needed, the agent calls a quarantined_llm tool that processes the hidden content in complete isolation with no tool access.

See MAF-FIDES/README.md and MAF-FIDES/ARCHITECTURE.md.


Key Distinction Between Approaches

Dimension Ollama Approach FIDES Approach
Pipeline role Inline fire break — aborts the pipeline on detection Inline gate — blocks downstream tool calls on detection
On malicious detection Pipeline halted immediately (abort / exit code 2) Downstream agent actions blocked by policy enforcement
On benign verdict Content passes through to the next pipeline stage Content passes through; main agent proceeds normally
Defence mechanism Probabilistic detection — LLM classifies raw content Structural prevention (hiding) + probabilistic quarantine
Raw content seen by main LLM Yes — sentinel LLM reads the raw payload Never — raw payload is hidden before any LLM sees it
Injection vector Sentinel LLM may be tricked by a sufficiently clever payload Structurally closed for main agent; quarantine LLM is isolated
Classification method Direct LLM analysis with security system prompt Isolated quarantine LLM with explicit data-framing
False negative risk Higher — novel attacks may fool the sentinel LLM Lower — quarantine framing and isolation reduce susceptibility
False positive risk Moderate Moderate
Explainability Full scratchpad reasoning in output Full scratchpad reasoning from quarantine LLM + middleware event log

Running the Comparisons

Both harnesses produce the same set of metrics (accuracy, precision, recall, F1, confusion matrix) from the same dataset, making results directly comparable.

# Ollama approach
cd Ollama
python test_security_agent.py --limit 20        # quick test
python test_security_agent.py                   # full 500-prompt run

# FIDES approach
cd MAF-FIDES
pip install -r requirements.txt
python test_fides_agent.py --limit 20           # quick test
python test_fides_agent.py                      # full 500-prompt run

Both scripts accept --limit N, --start N, and --output path/to/results.json.


Prerequisites

  • Ollama running locally at http://localhost:11434
  • Granite 4 model pulled: ollama pull granite4:latest
  • Python 3.11+, pip install openai

About

Testing prompt injection protection.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages