How We Use LLMs in Secure Code Review

How We Use LLMs in Secure Code Review

By Tao Sauvage

Overview

This post explains how Anvil Secure uses LLMs as analyst-guided tools during secure code review engagements, not as autonomous reviewers. It covers the process-focused workflow, its advantages over single-pass automation, and the results of a structured experiment comparing both approaches across three deliberately vulnerable open-source codebases.

At Anvil, we increasingly encounter LLMs across our clients' Software Development Life Cycles (SDLCs). For teams operating at scale, it is entirely rational to optimize those systems for throughput, consistency, and low human overhead. Anthropic's recent Mythos Preview announcement, and the broader discussion it has already intensified around AI-assisted cybersecurity, suggest that trend will only accelerate.

Our constraints are different. In a security review, especially on a large and complex codebase, depth matters more than throughput. That is why we use LLMs less as autonomous reviewers and more as analyst-guided tools: planning the work, decomposing the problem, externalizing state, and validating each phase before moving forward.

Our COO captured that principle well in a recent blog post:

"But tools are not substitutes for professional judgment. They can surface patterns and move quickly, but they cannot replace experience, context, or accountability."

In this post, I explain how our process-focused workflow works in practice, why we prefer it during security reviews, and where it differs from automation-first review embedded in an SDLC.

The limits of automation-first review

A key concept with LLMs is the context window: the amount of text a model can consider at once when generating its next output. It includes the current prompt, prior conversations, and any documents or code provided to the model.

Research on LLMs has shown that larger context windows do not automatically produce better outcomes. As the context window grows, it becomes noisier and models often become less reliable at retrieving the right details, following earlier instructions consistently, and maintaining coherent reasoning.

LLM review works best when the scope is narrow. A pull request is often small enough for the model to reason about in depth and identify potential issues, such as boundary errors or injection risks. What it more often misses is the broader attack surface: trust boundaries, multi-step exploit chains, and vulnerabilities that only emerge through the interactions between components.

In theory, you could try to load a much larger portion of the codebase into the context. In practice, you either hit context limits or end up with a noisy working set that degrades the model's usefulness.

Frontier cyber-focused systems are improving quickly. Mythos, for example, explicitly highlights stronger performance on exploit chaining and other multi-step attack scenarios. We keep monitoring developments like these and updating our methodology accordingly.

Our process-focused approach

Our approach takes advantage of newer coding workflows in which models can explore files, maintain a plan, use external tools, and revise their approach as new information appears. In practice, our workflow is built around planning, task decomposition, file-based memory, phased execution, tight scoping, and explicit human checkpoints.

I recently performed an engagement for a client on a product that will be open-sourced in the future. The system was complex and the codebase was large. We were authorized to use LLMs as part of our toolkit, and my approach went as follows:

I usually start with a short prompt, often just a few sentences, explaining that we're starting a security review and defining the scope. Often, I scope the session to a specific subsystem, such as secure boot, a REST API, or a specific daemon. I ask the model to explore the code, read the existing documentation, map how that subsystem is integrated into the larger system, and draft a review plan that starts with a threat model.

I then ask the model to execute the plan one phase at a time, which gives me a chance to review each intermediate result before moving forward. That keeps the process deliberate and makes it easy to steer when needed. For instance, I might clarify that component X is considered out of scope or that the class of vulnerability Y is not relevant to the business of our client.

Crucially, I ask the model to externalize its working state into Markdown files: a master plan, then separate notes or reports for each phase. Those files become the source of truth for the session rather than the raw chat history. As the session grows and the working context becomes harder for the model to use reliably, I can explicitly re-anchor it by asking it to re-read the master plan and the reports from the relevant earlier phases before continuing.

This workflow requires the human analyst to stay actively involved, unlike automation-first review designed to minimize oversight. It is harder to scale, but for a boutique security consultancy that is not a drawback. It is precisely where human judgment, context, and accountability have the most value.

Experiment

As an experiment, I compare two ways of using an LLM to review the same codebase. The first is a single-pass review driven entirely by the initial prompt, with no intermediate steering. The second is our process-focused review presented above. Evidently, this is not a controlled reproduction of a production SDLC pipeline. Still, I consider it a practical comparison between prompt-centric and process-centric usage that others should be able to replicate on their own machines.

I will compare the two approaches not just on the number of findings, but also on coverage, precision, and the quality of reasoning behind each finding.

Prior work in adjacent settings points in a similar direction: simple prompting can help on narrow tasks, but staged workflows become more useful once the review depends on broader repo context, cross-file reasoning, or iterative validation.

I am using open-source applications that are deliberately vulnerable, including projects listed in the OWASP Vulnerable Web Applications Directory, because it lets me select targets across a range of codebases and technology stacks. This avoids involving client code or undisclosed vulnerabilities.

These targets are smaller and more intentionally vulnerable than the systems we review during real engagements, so the experiment should be read as illustrative rather than definitive. In some cases, that smaller scope may even help the single-pass baseline and make the multi-stage breakdown less relevant.

I selected:

Before running the experiment, I built a target-by-target validation list of expected root causes from public project material and manual review. That produced 91 expected root causes across all targets. Because those target-level sets were not derived in exactly the same way, the per-target comparisons are more meaningful than the aggregate percentage.

I counted both raw findings and unique root causes, since multiple findings can sometimes collapse to the same underlying vulnerability pattern.

For the one-shot baseline, I repeated that single-prompt condition three times per target. In the results below, the one-shot counts represent the union of supported, non-duplicate findings across those three runs rather than the output of a single attempt. This gives the baseline the benefit of three attempts rather than one and reduces the chance that an unusually weak pass dominates the comparison.

For the process-focused condition, I ran one phased review per target using the workflow described above. I approved one phase at a time and moved on to the next without additional direction. That differs from real engagements, where steering is often the critical part of the process, but I limited intervention here to keep the comparison cleaner. I did not repeat the process-focused condition three times, mainly because it was materially more time-consuming to supervise and validate. As a result, this experiment says more about relative coverage than about run-to-run stability in the process-focused workflow.

The exact prompts used for both conditions are included in the appendices.

Results

Across all three targets, the process-focused run produced more non-duplicate findings and more unique root causes than the one-shot baseline, even though the one-shot results reflect the union of three runs.

Target One-shot findings Process-focused findings One-shot root causes Process-focused root causes
Juice Shop 13 19 13 18
AndroGoat 9 22 9 22
IoTGoat 5 7 5 6
Total 27 48 27 46

Looking only at unique root causes, expected-set coverage moved from 27/91 (29.7%) in one-shot to 46/91 (50.5%) in process-focused. Precision stayed high in both conditions, although the process-focused runs included two unique claims I could not substantiate. Reasoning quality was consistently stronger in the process-focused condition, scoring 3 versus 2 for the one-shot baseline across all three targets, with better source specificity and clearer exploitability framing.

The gap was smaller on IoTGoat, likely because the target itself was smaller and its relevant attack surface was less expansive than in the other two cases.

Even in this relatively lightweight setup, a phased, stateful workflow produced meaningfully better coverage and stronger analysis than a pure single-prompt pass in this experiment. The gain appears to come less from asking the model harder and more from structuring the work: gathering context selectively, producing reviewable intermediate outputs, and revisiting prior state when needed.

References

Appendix A - notes related to the methodology

The 91-item expected set was built differently across targets. For Juice Shop, the official material includes multiple challenge items that collapse to the same underlying issue, so I reduced those to a code-level upper bound of 45 distinct root causes. For AndroGoat and IoTGoat, I relied more directly on published vulnerabilities and challenge material, resulting in 33 and 13 expected root causes respectively.

I also stripped comments, walkthrough hints, and challenge labels from the local copies where practical. A leakage risk still remains, since a model could rely on prior knowledge or public documentation outside the local source tree. I instructed both conditions to rely only on the local code and treated any residual leakage risk as shared across both approaches.

All runs in this experiment used GPT-5.4-Mini with Medium reasoning effort. I kept the model and reasoning setting fixed across both approaches to isolate workflow effects as much as possible.

For this experiment, I treat coverage as the number of distinct expected root causes identified, precision as the proportion of non-duplicate claims that remained supported after manual validation, and reasoning quality as a qualitative score based on source specificity, exploitability framing, and causal clarity.

As a small additional check, I repeated the one-shot baseline three times per target to get a rough sense of variance in the prompt-centric condition. I did not repeat the process-focused workflow to the same extent because it required substantially more analyst time, so these figures should not be read as a comparative stability claim. Even within that limited check, the one-shot baseline showed modest overlap across runs:

  • Juice Shop: 3 of 13 root causes appeared in all three runs
  • AndroGoat: 5 of 9
  • IoTGoat: 4 of 5

Appendix B - one-shot prompt

You are a senior security reviewer performing a one-shot static security review of the attached source tree.

Target:
- Name: [APP_NAME]
- Version/tag: [APP_VERSION]
- Repository: [REPOSITORY_URL]

Rules:
- Use only the attached source code and project files.
- Do not use external writeups, public solution guides, internet search, or prior challenge knowledge.
- Do not ask follow-up questions.
- Do not propose a review plan.
- Do not include generic best-practice commentary unless it is tied to a concrete source location.
- Report only issues supported by specific code, configuration, manifest, or build files.
- Prefer exact file paths and line numbers. If line numbers are unavailable, provide the file path plus function, class, route, component, or configuration key.
- Treat this as a static source review. Do not assume dynamic testing was performed.
- Do not force findings into a predefined taxonomy.
- Do not invent findings to satisfy a quota.

Look for concrete, source-backed vulnerabilities, including but not limited to:
- authentication and authorization flaws
- input validation and injection issues
- insecure cryptography or secret handling
- insecure storage of sensitive data
- unsafe file, network, IPC, or command execution behavior
- insecure defaults, debug paths, or dangerous configuration
- platform-specific issues relevant to the target
- dependency, build, or deployment risks visible in the source tree

Output only the report below.

## Vulnerability Report

### Assumptions And Coverage Limits
Briefly state any important limitations in 3 bullets or fewer.

### Findings
| Severity | Title | Location | Impact | Evidence / Reasoning | Confidence |
|---|---|---|---|---|---|

Severity must be Low, Medium, or High.
Impact must be one sentence.
Evidence / Reasoning must explain why the source supports the finding in no more than two sentences.
Confidence must be Low, Medium, or High.

### Coverage Notes
List the main subsystems or file areas you reviewed, and any major areas you could not assess from the attached source.

Appendix C - process-focused prompt

You are helping perform a process-focused security review of the local source tree for the target application.

Target:
- Name: [APP_NAME]
- Version/tag: [APP_VERSION]
- Repository: [REPOSITORY_URL]

Rules:
- Use only the local source code and project files.
- Do not use external writeups, public solution guides, internet search, or prior challenge knowledge.
- Start by exploring the project structure and reading relevant documentation, manifests, build files, routes, entrypoints, and security-sensitive components.
- Write a review plan that begins with a threat model.
- Break the review into phases with clear objectives, expected files or subsystems, and checkpoint questions for the human reviewer.
- Do not execute the full review in the first response.
- Externalize durable state into Markdown notes for the master plan and each review phase.
- Findings must be source-backed and include concrete locations.

Initial output:

    # Process-Focused Review Plan: [APP_NAME]

    ## Target Metadata

    ## Project Overview

    ## Initial Threat Model

    ## Review Phases

    ## Human Checkpoints

    ## Evidence And Findings Rules

After the plan is approved, execute one phase at a time and update the notes before moving to the next phase.

About the Author

Tao Sauvage is Director of Research at Anvil Secure. He loves finding vulnerabilities in anything he gets his hands on, especially when it involves embedded systems, reverse engineering and code review.

His previous research projects covered mobile OS security, resulting in multiple CVEs for Android, and wind farm equipment, with the creation of a proof-of-concept โ€œwormโ€ targeting Antaira systems.

He used to be a core developer of the OWASP OWTF project, an offensive web testing framework, and maintain CANToolz, a python framework for black-box CAN bus analysis.

Tools

aqlmap - A tool to extract information from ArangoDB through AQL injection. See the introductory blogpost.


awstracer - An Anvil CLI utility that will allow you to trace and replay AWS commands.


awssig - Anvil Secure's Burp extension for signing AWS requests with SigV4.


ByteBanter - A Burp Suite extension that leverages LLMs to generate context-aware payloads for Burp Intruder. See the introductory blogpost.


dawgmon - Dawg the hallway monitor: monitor operating system changes and analyze introduced attack surface when installing software. See the introductory blogpost.


GhidraGarminApp - A Ghidra processor and loader for Garmin watch applications. See the introductory blogpost.


HANAlyzer - A tool that automates SAP HANA security checks and outputs clear HTML reports. See the introductory blogpost.


IPAAutoDec - A tool that decrypts IPA files end-to-end via SSH. See the introductory blogpost.


nanopb-decompiler - Our nanopb-decompiler is an IDA python script that can recreate .proto files from binaries compiled with 0.3.x, and 0.4.x versions of nanopb. See the introductory blogpost.


OffTempo - A Burp Suite extension for statistical timing side-channel analysis. See the introductory blogpost.


PQCscan - A scanner that can determine whether SSH and TLS servers support PQC algorithms. See the introductory blogpost.


SAPCARve - A utility Python script for manipulating SAP's SAR archive files. See the introductory blogpost.


ulexecve - A tool to execute ELF binaries on Linux directly from userland. See the introductory blogpost.


usb-racer - A tool for pentesting TOCTOU issues with USB storage devices.

Recent Posts