How We Test AI: LLM & GenAI Security Methodology at Anvil Secure

By Anvil SecureOn May 27, 2026May 27, 20260 Comments

By George Damiris

Overview

The company's methodology for testing LLM and GenAI services is based on industry best practices as well as hands-on experience testing AI agents and models across multiple platforms and providers, combined with published industry frameworks: the OWASP Top 10 for Large Language Models and the OWASP LLM Security Verification Standard. Testing is performed through a combination of manual adversarial testing, semi-automated tooling, and proprietary research tooling developed internally.

The scope of assessment focuses on the customer's deployed model configuration and custom software stack — not the underlying cloud infrastructure managed by the provider. Within the shared responsibility model, this covers the application layer, agent architecture, integrations, and AI-specific attack surfaces.

The company's methodology is focused on the following high-level areas:

Data: Protecting sensitive data used for training and inference, ensuring it's anonymized and compliant with regulations like GDPR.
Model: Securing the specific AI model in use, protecting it from attacks like adversarial inputs or model poisoning.
Access: Managing access controls to ensure only authorized users and applications can access the AI model and its data. Additionally, ensuring access controls are implemented to prevent the AI model from being exploited to bypass the intended authorization.
Applications: Securing the applications and agents built on top of the AI platform, as well as how they are configured and used.

Approach and Scope Definition

As each system has unique capabilities, integrations and risk exposure, our approach is never generic or checklist driven. Each project is tailored based on its architecture, trust boundaries and business risks of your deployment.

Rather than testing the model in isolation, we evaluate the entire AI execution chain: inputs, logic, permissions, and system impact. This ensures we discover systemic weaknesses, not just surface-level vulnerabilities.

Our objective is simple: identify how your AI could be manipulated, measure the real business impact, and provide clear, architecture-aligned remediation guidance that strengthens both security and operational resilience.

Scope is never assumed. Prior to testing, the company performs a structured threat modeling tailored to the target system. This produces the test case inventory used throughout the engagement. We map your AI ecosystem end-to-end:

Identify business purposes and Operational Context: We define the intended business function of the system, the level of autonomy granted to the agent, user roles interacting with the system, and the critical workflows it supports.

Identify the Model in Use
We document the model provider, model family, and version used by the service. Model capabilities, context limits, safety mechanisms, and update cadence influence the system's security posture.
Map Architecture and Components
We document the system architecture including orchestration layers, RAG pipelines, memory stores, vector databases, tool integrations, APIs, plugins, MCP integrations, external services, and human-in-the-loop oversight points.
Identify MCP Integrations and Capabilities
We identify connected MCP servers and the tools or services they expose to the agent. We document what operations these tools allow and what systems they interact with.
Identify Trust Boundaries
We determine where data or control crosses security domains, such as user input channels, external content sources, MCP servers, inter-agent communication, and interactions with internal systems.
Classify Assets and Sensitive Data
We identify sensitive assets accessible to the system, including credentials, API tokens, system prompts, proprietary knowledge bases, personal data, and other regulated information. We evaluate the potential operational, financial, or reputational impact if vulnerabilities were exploited.

Analyze Agent Capabilities and Permissions
We evaluate what the agent can read, write, execute, or trigger through internal tools, MCP services, APIs, and integrated platforms.
Enumerate AI-Specific Attack Vectors
We identify potential attack paths specific to LLM and agentic architectures, including prompt injection, jailbreak attempts, malicious MCP tool usage, RAG poisoning, memory manipulation, autonomous goal hijacking, tool misuse, and context leakage.

Test Case Development

Based on the threats we identified, we develop targeted test scenarios to evaluate whether the system can be manipulated or exploited in practice. To design these scenarios, we analyze:

Accepted Input Channels
We identify all input vectors accepted by the system, including user prompts, documents, web content, APIs, structured inputs, and other external data sources.

Interactions with other Systems
We analyze how the agent's outputs are used and what actions it can trigger, including interactions with internal systems, APIs, tools, or automated workflows.
Analyze Tool and MCP Interactions
We identify available tools and MCP services the agent can invoke, including the parameters accepted by these tools and the systems they interact with.

Input and Output Guardrails
We evaluate the presence and effectiveness of guardrails such as prompt filtering, policy enforcement layers, response validation, tool invocation restrictions, and monitoring mechanisms.

Define Expected Secure Behavior
We define what the correct behavior should be (e.g., refusal, filtered output, blocked tool call). This allows you to objectively determine success or failure.

Define Attack Method / Technique
We design specific attack methods aligned with the threat model, including prompt injection, indirect prompt injection via retrieved content, RAG poisoning, manipulation of MCP tool parameters, memory manipulation, and multi-step workflow abuse.
Define Observation and Validation Criteria
We identify what indicators confirm vulnerability: Guardrail bypass, Data leakage, Unauthorized tool invocation, Instruction override.
Impact Assessment
We describe the potential business or system impact if the behavior were exploited in a real attack.

Rather than generic probing, each technique is applied deliberately based on what your system's architecture and trust model makes exploitable.

Testing covers techniques across the following categories:

Roleplay

Defined Personas
Virtual AI
Antagonistic Entities Split
Research & Testing
Joking Pretext

Simulations
Game
CTF
Reporter
Student

Benign Context Framing

Fictional Framing
Language Completion Games
Synonym Word Usage

Privilege Escalation / Persuasion / Cognitive Manipulation/Overload & Attention Misalignment

Sudo / Admin Mode
Jailbroken Model Simulation
Typographical Authority Simulation
Logical, Eventual, Quantification-Based Persuasion
Authority, Norm-Based Persuasion
Emotional, Reciprocity, Commitment Persuasion
Instruction Repetition
Urgency, Scarcity Persuasion
Manipulative, Coercive Persuasion
Distractor Instructions

Mathematical / Decomposition Attacks
Indirect Task Deflection
Context Saturation
Templating
Meta Prompting
Imperative Emphasis
Cognitive Overload
Syntax-Based Input
Crescendo Attacks
Instruction Override
Semantic Confusion
Refusal Suppression

Encoding & Obfuscation & Structuring

Random Character Insertion/Replacement
Encrypted/Encoded Input and Output
Surface Obfuscation
Token Splitting
Payload Splitting
Language Blind Spotting
Translated Language
Semantic Rewriting / Linguistic Encoding / Embedded Prompting
- Sentence Level
- Token Level
- Low Resource Languages
- Base64
- Ascii Art

Repeating Output / Tokens
Output Pruning
Syntax Based Output
Output Creative
Stop-Token Prevention
Summarization
Output Natural Language
Output Termination

Goal Conflicting Attacks

Prefix Injection
Instruction Masking
- Text To Completion as Instruction

Refusal Suppression
Context Ignoring
Assumption Of Responsibility
Objective Juxtaposition

Uncategorized

Tokenization Confusion
Broken-Token
Conversation Spoofing
Control Token Injection/Spoofing
System Prompt Spoofing
Chain-Of-Thought Spoofing
Policy Puppetry
Tool Spoofing
Context Window Separation
Overflow Induced Amnesia
Context with Needle In Haystack Injection
Token-Aware Prompting

Zero-Shot Prompting
Defined Dictionary
Attack Stacking
Attack Concatenation
Indirect Visibility
Multi-Turn
Invisible Ascii Characters
Abuse of Prompt Structure for Agent Behaviors
History Replay
History Sniffing
Chained Prompt Injections
Punctuation
Non-Punctuation Alternative Tokenization
Misspelling
Context Window Overflow
Conditional Prompt Injection

Multi-Modal Image

Text in image
Image Downscaling
Obfuscation / Encoding / Steganographic

Exfiltration techniques are used to test whether user data can be exfiltrated to attacker-controlled endpoints. These techniques include, but are not limited to, the following:

Hyperlink Unfurling
Markdown
Tool/Agent abuse

AI model specific Denial-of-Service testing techniques include:

Repeated Single and Multi-Tokens
Context-Window Overflow
Extended Reasoning
Time Consuming Background Tasks
Mutating Availability
Inhibiting Availabilities
Disrupting Search Queries

The techniques listed above are non-exhaustive and do not cover all categories of issues AI systems can contain.

The methodology follows industry best practices and incorporates the security principles and threat categories defined by the Model Context Protocol Security Initiative:

MCP Server

Prompt injection
Confused Deputy
Tool Poisoning
Credential and Token Exposure
Insecure Server Configuration

Supply Chain Attacks
Excessive Permissions and Scope Creep
Data Exfiltration
Context Spoofing and Manipulation
Insecure Communication

MCP Client

Malicious Server Connection
Insecure credential storage
UI/UX Deception
Insufficient Server Validation

Client-Side Data Leakage
Excessive Permission Granting
Client-Side Code Execution
Insecure Communication Handling
Session and State Management Failures
Update and Patch Management

These top risks are evaluated using our prompt-based techniques in addition to others, such as using ANSI terminal escape codes for making prompt injection text invisible in the terminal.

George Damiris is a Security Engineer at Anvil Secure. He specializes in web application and network security assessments, with experience identifying vulnerabilities across modern and complex attack surfaces.

In recent years, his work has focused heavily on AI security, with a recent specialty in AI Red Teaming, GenAI security, and the security assessment of agentic systems. His interests include adversarial testing of LLM-powered applications, offensive security research, prompt injection testing, and evaluating the security boundaries of autonomous and AI-driven workflows.

How We Test AI: LLM & GenAI Security Methodology at Anvil Secure

Approach and Scope Definition

Test Case Development

About the Author

Tools

Recent Posts