PromptShield — AI Jailbreak Firewall for OWS Wallets

System Design

Trust Model Diagram

User / dApp

Natural language prompt or transaction intent

prompt

AI Interpretation Layer

LLM (GPT / Claude)

UNTRUSTED

Parses natural language intent into structured transaction proposals. This layer can be manipulated by adversarial prompts.

▸Converts text → transaction struct

▸Infers chain, token, amount, recipient

▸Vulnerable to prompt injection attacks

transaction proposal

Policy Firewall Engine

PromptShield Core

ENFORCED

Evaluates every proposed transaction against a configurable ruleset. The single source of truth for what is and isn't allowed.

▸Rule-based policy evaluation

▸Amount, address & slippage limits

▸Prompt injection pattern detection

▸Immutable audit log generation

verdict

BLOCKED

Transaction rejected. Private key never accessed.

✗ Audit event written

✗ User notified

✗ No chain interaction

ALLOWED

Transaction forwarded to wallet layer for signing.

✓ Audit event written

✓ Wallet signer invoked

✓ Broadcast to chain

signed tx

OWS Wallet Layer

Private Key Isolation

TRUSTED

Signs and broadcasts only pre-approved transactions. The private key is never exposed to the AI or policy layers.

▸Private key isolated from AI layer

▸Signs only firewall-approved txs

▸Broadcasts to target chain

Layer breakdown

Three layers. One trust model.

Each layer has a clearly defined responsibility and trust level. The boundaries are explicit and enforced in code.

01UNTRUSTED

AI Interpretation Layer

The source of risk

The AI layer transforms natural language into structured transaction proposals. This is where prompt injection attacks enter the system — a maliciously crafted prompt can manipulate the LLM into generating dangerous transaction parameters.

Threat Vector

Prompt injection, jailbreaks, override instructions, and social engineering can cause the AI to propose unsafe transactions.

Capabilities

LLM parses user intent into action, token, amount, recipient

Adversarial prompts can override system instructions

Output is a structured JSON transaction proposal

Must be treated as untrusted input by downstream layers

02ENFORCED

Policy Firewall Engine

The enforcement boundary

The policy engine is the single gate between AI output and wallet execution. Every transaction proposal must pass a ruleset evaluation — regardless of what the AI was told or what the user requested.

Capabilities

Configurable rules: amount caps, address allowlists, slippage limits

Prompt injection pattern detection

Immutable audit log for every evaluation

Returns structured verdict with reasons and violated rules

03TRUSTED

OWS Wallet Layer

The execution boundary

The OWS wallet layer is responsible for signing and broadcasting approved transactions. The private key is completely isolated from the AI and policy layers — it only receives policy-approved requests.

Capabilities

Private key stored in secure local enclave

Signs only firewall-approved transactions

Broadcasts to target blockchain

Never exposed to AI inference context

Key Isolation

Why the private key must never reach the AI layer

In AI-powered wallet architectures, there's a tempting shortcut: give the LLM direct signing capability so it can "complete tasks autonomously." This is catastrophic from a security standpoint.

LLMs are probabilistic systems. They can be manipulated by sufficiently clever prompts — this is well-documented in the research literature. If the private key is accessible in the AI's context window or callable via the AI's tools, a single successful jailbreak can drain a wallet.

PromptShield enforces a hard boundary: the AI can only propose. It cannotsign. The policy engine is the gatekeeper, and the OWS wallet only responds to policy-approved requests. The key never moves.

Without PromptShield

✗ AI has direct signing access

✗ One jailbreak = wallet drained

✗ No audit trail

✗ No policy enforcement

With PromptShield

✓ AI only proposes transactions

✓ Policy engine enforces every rule

✓ Private key fully isolated

✓ Immutable audit log for all decisions

Design Principles

Security principles

Zero Trust on AI Output

The AI layer is treated as completely untrusted. Even a legitimate AI response is validated against policy before execution.

Separation of Concerns

Intent parsing, policy evaluation, and key management are fully isolated layers with no shared context.

Explicit Policy Rules

Security is codified in explicit, auditable rules — not implicit LLM judgment. Rules can be updated independently of the AI model.

Immutable Audit Trail

Every policy evaluation — blocked or allowed — is logged with full context. Compliance and forensics are built-in.

See it defend against a real attack

Run preset jailbreak scenarios in the interactive demo and watch the policy engine respond.

Open Demo →