How We Test AI: LLM & GenAI Security Methodology at Anvil Secure

How We Test AI:  LLM & GenAI Security Methodology at Anvil Secure

By George Damiris

Overview

The company's methodology for testing LLM and GenAI services is based on industry best practices as well as hands-on experience testing AI agents and models across multiple platforms and providers, combined with published industry frameworks: the OWASP Top 10 for Large Language Models and the OWASP LLM Security Verification Standard. Testing is performed through a combination of manual adversarial testing, semi-automated tooling, and proprietary research tooling developed internally.

The scope of assessment focuses on the customer's deployed model configuration and custom software stack โ€” not the underlying cloud infrastructure managed by the provider. Within the shared responsibility model, this covers the application layer, agent architecture, integrations, and AI-specific attack surfaces.

The company's methodology is focused on the following high-level areas:

  • Data: Protecting sensitive data used for training and inference, ensuring it's anonymized and compliant with regulations like GDPR.
  • Model: Securing the specific AI model in use, protecting it from attacks like adversarial inputs or model poisoning.
  • Access: Managing access controls to ensure only authorized users and applications can access the AI model and its data. Additionally, ensuring access controls are implemented to prevent the AI model from being exploited to bypass the intended authorization.
  • Applications: Securing the applications and agents built on top of the AI platform, as well as how they are configured and used.

Approach and Scope Definition

As each system has unique capabilities, integrations and risk exposure, our approach is never generic or checklist driven. Each project is tailored based on its architecture, trust boundaries and business risks of your deployment.

Rather than testing the model in isolation, we evaluate the entire AI execution chain: inputs, logic, permissions, and system impact. This ensures we discover systemic weaknesses, not just surface-level vulnerabilities.

Our objective is simple: identify how your AI could be manipulated, measure the real business impact, and provide clear, architecture-aligned remediation guidance that strengthens both security and operational resilience.

Scope is never assumed. Prior to testing, the company performs a structured threat modeling tailored to the target system. This produces the test case inventory used throughout the engagement. We map your AI ecosystem end-to-end:

  • Identify business purposes and Operational Context: We define the intended business function of the system, the level of autonomy granted to the agent, user roles interacting with the system, and the critical workflows it supports.
  1. Identify the Model in Use
    We document the model provider, model family, and version used by the service. Model capabilities, context limits, safety mechanisms, and update cadence influence the system's security posture.
  2. Map Architecture and Components
    We document the system architecture including orchestration layers, RAG pipelines, memory stores, vector databases, tool integrations, APIs, plugins, MCP integrations, external services, and human-in-the-loop oversight points.
  3. Identify MCP Integrations and Capabilities
    We identify connected MCP servers and the tools or services they expose to the agent. We document what operations these tools allow and what systems they interact with.
  4. Identify Trust Boundaries
    We determine where data or control crosses security domains, such as user input channels, external content sources, MCP servers, inter-agent communication, and interactions with internal systems.
  5. Classify Assets and Sensitive Data
    We identify sensitive assets accessible to the system, including credentials, API tokens, system prompts, proprietary knowledge bases, personal data, and other regulated information. We evaluate the potential operational, financial, or reputational impact if vulnerabilities were exploited.
  • Analyze Agent Capabilities and Permissions
    We evaluate what the agent can read, write, execute, or trigger through internal tools, MCP services, APIs, and integrated platforms.
  • Enumerate AI-Specific Attack Vectors
    We identify potential attack paths specific to LLM and agentic architectures, including prompt injection, jailbreak attempts, malicious MCP tool usage, RAG poisoning, memory manipulation, autonomous goal hijacking, tool misuse, and context leakage.

Test Case Development

Based on the threats we identified, we develop targeted test scenarios to evaluate whether the system can be manipulated or exploited in practice. To design these scenarios, we analyze:

  1. Accepted Input Channels
    We identify all input vectors accepted by the system, including user prompts, documents, web content, APIs, structured inputs, and other external data sources.
  • Interactions with other Systems
    We analyze how the agent's outputs are used and what actions it can trigger, including interactions with internal systems, APIs, tools, or automated workflows.
  • Analyze Tool and MCP Interactions
    We identify available tools and MCP services the agent can invoke, including the parameters accepted by these tools and the systems they interact with.
  1. Input and Output Guardrails
    We evaluate the presence and effectiveness of guardrails such as prompt filtering, policy enforcement layers, response validation, tool invocation restrictions, and monitoring mechanisms.
  • Define Expected Secure Behavior
    We define what the correct behavior should be (e.g., refusal, filtered output, blocked tool call). This allows you to objectively determine success or failure.
  1. Define Attack Method / Technique
    We design specific attack methods aligned with the threat model, including prompt injection, indirect prompt injection via retrieved content, RAG poisoning, manipulation of MCP tool parameters, memory manipulation, and multi-step workflow abuse.
  2. Define Observation and Validation Criteria
    We identify what indicators confirm vulnerability: Guardrail bypass, Data leakage, Unauthorized tool invocation, Instruction override.
  3. Impact Assessment
    We describe the potential business or system impact if the behavior were exploited in a real attack.

Rather than generic probing, each technique is applied deliberately based on what your system's architecture and trust model makes exploitable.

Testing covers techniques across the following categories:

Roleplay
  • Defined Personas
  • Virtual AI
  • Antagonistic Entities Split
  • Research & Testing
  • Joking Pretext
  • Simulations
  • Game
  • CTF
  • Reporter
  • Student
Benign Context Framing
  • Fictional Framing
  • Language Completion Games
  • Synonym Word Usage
Privilege Escalation / Persuasion / Cognitive Manipulation/Overload & Attention Misalignment
  • Sudo / Admin Mode
  • Jailbroken Model Simulation
  • Typographical Authority Simulation
  • Logical, Eventual, Quantification-Based Persuasion
  • Authority, Norm-Based Persuasion
  • Emotional, Reciprocity, Commitment Persuasion
  • Instruction Repetition
  • Urgency, Scarcity Persuasion
  • Manipulative, Coercive Persuasion
  • Distractor Instructions
  • Mathematical / Decomposition Attacks
  • Indirect Task Deflection
  • Context Saturation
  • Templating
  • Meta Prompting
  • Imperative Emphasis
  • Cognitive Overload
  • Syntax-Based Input
  • Crescendo Attacks
  • Instruction Override
  • Semantic Confusion
  • Refusal Suppression
Encoding & Obfuscation & Structuring
  • Random Character Insertion/Replacement
  • Encrypted/Encoded Input and Output
  • Surface Obfuscation
  • Token Splitting
  • Payload Splitting
  • Language Blind Spotting
  • Translated Language
  • Semantic Rewriting / Linguistic Encoding / Embedded Prompting
    • Sentence Level
    • Token Level
    • Low Resource Languages
    • Base64
    • Ascii Art
  • Repeating Output / Tokens
  • Output Pruning
  • Syntax Based Output
  • Output Creative
  • Stop-Token Prevention
  • Summarization
  • Output Natural Language
  • Output Termination
Goal Conflicting Attacks
  • Prefix Injection
  • Instruction Masking
    • Text To Completion as Instruction
  • Refusal Suppression
  • Context Ignoring
  • Assumption Of Responsibility
  • Objective Juxtaposition
Uncategorized
  • Tokenization Confusion
  • Broken-Token
  • Conversation Spoofing
  • Control Token Injection/Spoofing
  • System Prompt Spoofing
  • Chain-Of-Thought Spoofing
  • Policy Puppetry
  • Tool Spoofing
  • Context Window Separation
  • Overflow Induced Amnesia
  • Context with Needle In Haystack Injection
  • Token-Aware Prompting
  • Zero-Shot Prompting
  • Defined Dictionary
  • Attack Stacking
  • Attack Concatenation
  • Indirect Visibility
  • Multi-Turn
  • Invisible Ascii Characters
  • Abuse of Prompt Structure for Agent Behaviors
  • History Replay
  • History Sniffing
  • Chained Prompt Injections
  • Punctuation
  • Non-Punctuation Alternative Tokenization
  • Misspelling
  • Context Window Overflow
  • Conditional Prompt Injection
Multi-Modal Image
  • Text in image
  • Image Downscaling
  • Obfuscation / Encoding / Steganographic

Exfiltration techniques are used to test whether user data can be exfiltrated to attacker-controlled endpoints. These techniques include, but are not limited to, the following:

  • Hyperlink Unfurling
  • Markdown
  • Tool/Agent abuse

AI model specific Denial-of-Service testing techniques include:

  • Repeated Single and Multi-Tokens
  • Context-Window Overflow
  • Extended Reasoning
  • Time Consuming Background Tasks
  • Mutating Availability
  • Inhibiting Availabilities
  • Disrupting Search Queries

The techniques listed above are non-exhaustive and do not cover all categories of issues AI systems can contain.

The methodology follows industry best practices and incorporates the security principles and threat categories defined by the Model Context Protocol Security Initiative:

MCP Server
  • Prompt injection
  • Confused Deputy
  • Tool Poisoning
  • Credential and Token Exposure
  • Insecure Server Configuration
  • Supply Chain Attacks
  • Excessive Permissions and Scope Creep
  • Data Exfiltration
  • Context Spoofing and Manipulation
  • Insecure Communication
MCP Client
  • Malicious Server Connection
  • Insecure credential storage
  • UI/UX Deception
  • Insufficient Server Validation
  • Client-Side Data Leakage
  • Excessive Permission Granting
  • Client-Side Code Execution
  • Insecure Communication Handling
  • Session and State Management Failures
  • Update and Patch Management

These top risks are evaluated using our prompt-based techniques in addition to others, such as using ANSI terminal escape codes for making prompt injection text invisible in the terminal.

About the Author

George Damiris is a Security Engineer at Anvil Secure. He specializes in web application and network security assessments, with experience identifying vulnerabilities across modern and complex attack surfaces.

In recent years, his work has focused heavily on AI security, with a recent specialty in AI Red Teaming, GenAI security, and the security assessment of agentic systems. His interests include adversarial testing of LLM-powered applications, offensive security research, prompt injection testing, and evaluating the security boundaries of autonomous and AI-driven workflows.

Tools

aqlmap - A tool to extract information from ArangoDB through AQL injection. See the introductory blogpost.


awstracer - An Anvil CLI utility that will allow you to trace and replay AWS commands.


awssig - Anvil Secure's Burp extension for signing AWS requests with SigV4.


ByteBanter - A Burp Suite extension that leverages LLMs to generate context-aware payloads for Burp Intruder. See the introductory blogpost.


dawgmon - Dawg the hallway monitor: monitor operating system changes and analyze introduced attack surface when installing software. See the introductory blogpost.


GhidraGarminApp - A Ghidra processor and loader for Garmin watch applications. See the introductory blogpost.


HANAlyzer - A tool that automates SAP HANA security checks and outputs clear HTML reports. See the introductory blogpost.


IPAAutoDec - A tool that decrypts IPA files end-to-end via SSH. See the introductory blogpost.


nanopb-decompiler - Our nanopb-decompiler is an IDA python script that can recreate .proto files from binaries compiled with 0.3.x, and 0.4.x versions of nanopb. See the introductory blogpost.


OffTempo - A Burp Suite extension for statistical timing side-channel analysis. See the introductory blogpost.


PQCscan - A scanner that can determine whether SSH and TLS servers support PQC algorithms. See the introductory blogpost.


SAPCARve - A utility Python script for manipulating SAP's SAR archive files. See the introductory blogpost.


ulexecve - A tool to execute ELF binaries on Linux directly from userland. See the introductory blogpost.


usb-racer - A tool for pentesting TOCTOU issues with USB storage devices.

Recent Posts