Back to Resources
AI CodingArchitectureSecurity

No AI Coding Agent Builds Secure Applications. Here's What the Data Actually Shows.

March 12, 202612 min read

DryRun Security published a study yesterday that deserves more attention than the headline suggests.

The headline — "Anthropic's Claude Generates the Most Unresolved Security Flaws in AI-Built Applications" — is accurate, but it frames the finding as a competition between agents. That framing misses the more important result, which appears in a single sentence buried toward the end of the report: no agent produced a fully secure application.

Not Claude. Not Codex. Not Gemini. None of them.

This is not a story about which AI coding tool is worst. It is a story about a structural failure that affects every major AI coding agent in production today — and what that failure reveals about what is actually missing from the autonomous agent stack.

What the Study Found

DryRun's methodology is worth understanding because it is different from benchmark evaluations. They did not ask agents to solve coding puzzles or pass test suites. They had Claude, Codex, and Gemini each build two full applications — a family allergy tracking web app and a browser-based racing game — through sequential pull requests, mirroring how real engineering teams build software over time. Each PR was analyzed before the next feature was implemented.

The results, by agent:

Codex finished with the fewest vulnerabilities and demonstrated something the other agents did not: self-correction behavior. It identified and remediated some vulnerabilities it had introduced during earlier PRs. This is architecturally significant — more on that shortly.

Gemini introduced vulnerabilities early and removed some through later modifications, but still finished with several high-severity findings.

Claude produced the highest number of unresolved high-severity flaws in the final codebases. Despite writing functionally correct security code in individual PRs, it accumulated the most unresolved issues overall.

The finding that applies to all three, universally, is this: four vulnerability classes appeared in every single final codebase, across both applications, regardless of which agent built them. All four were authentication-related. The pattern was the same in every case: the agent implemented authentication middleware — correctly, locally — but failed to apply it consistently across the system. REST API endpoints were protected. WebSocket endpoints were not. Form handlers had validation. Equivalent data paths through different entry points did not.

The Deeper Pattern

The authentication consistency failure is not a code quality problem. It is an architectural awareness problem.

Every agent that DryRun tested could write correct authentication code. The evidence is in the study — the middleware was implemented, the logic was right, the individual PR would pass a behavioral code review. What the agents lacked was not the ability to write secure code. What they lacked was a system-level understanding of where that security code needed to be applied.

This is a specific and important distinction. Behavioral verification — running tests, scanning individual files for vulnerabilities — cannot catch this failure class. A scanner that checks whether authentication middleware exists will find it. The problem is in what the scanner cannot see: the other endpoints, in other files, added in other PRs, that process equivalent data without the same protection.

Catching this requires something different: a model of the system's architecture that tracks which patterns have been applied where, and enforces consistency across the whole. Not a scanner. An architectural constraint layer.

The Codex self-correction behavior points to what this looks like in practice. Codex's agent loop appears to include some mechanism for checking its current output against previous work. It is not a perfect architectural verifier — it still finished with vulnerabilities — but it produced materially better outcomes than agents without this feedback mechanism. The difference is not model quality. It is agent loop architecture.

Why This Matters Beyond the Security Headline

The DryRun study was published on the same day that Check Point Research documented active CVEs in Claude Code: CVE-2025-59536 and CVE-2026-21852, covering remote code execution and API token exfiltration triggered by malicious repository project files. Any developer who clones an untrusted repository and opens it with Claude Code is exposed.

These two findings — systemic security inconsistency in agent-built applications, and active supply chain exploitation of the most widely used coding tool — describe the same underlying problem from two directions.

The DryRun finding shows what happens when agents operate without architectural awareness: they implement security locally, miss it globally, and the accumulated risk compounds with every PR.

The Claude Code CVEs show what happens when agents trust their environment without validation: adversaries embed malicious instructions in configuration files, and the agent executes them.

Both failures reduce to the same architectural gap: agents operating without a validated, authoritative model of the system they are working in and the constraints they must respect.

What Most People Miss

The conversation about AI coding agent security tends to focus on prompt injection — getting an agent to do something it should not by embedding instructions in the input. Prompt injection is real and important. But the DryRun and CVE findings point to failure modes that are structurally more severe.

Authentication inconsistency is not an injection attack. It is the natural outcome of an agent that reasons locally without system-level context. The agent is doing exactly what it was asked to do — implement the feature, add authentication where the task specifies it. The gap is not in the agent's execution. It is in the task's specification and the agent's inability to independently verify that its implementation is globally consistent.

This is the same failure mode that shows up in multi-agent architectural drift (locally correct, globally inconsistent), in the DORA bug rate data (+9% bugs with 90% AI adoption), and in the MSR 2026 PR acceptance data (74.1% of agentic PRs accepted without modification). The pattern is consistent: agents optimize locally, miss globally, and the gap compounds at scale.

What is missing is not a better prompt or a better model. It is a system that maintains an authoritative understanding of the architecture — what patterns have been applied, where they must be applied, and what consistency rules must hold — and makes that understanding available to agents during development, not just to reviewers afterward.

What Successful Platforms Must Do

The DryRun findings define the requirement precisely: platforms that want to produce secure agent-built software need a constraint enforcement layer that checks security pattern consistency across the system, not just within individual files or PRs.

Concretely: when an agent adds authentication middleware to a REST endpoint, the constraint layer should check whether every other endpoint in that system that handles equivalent data has the same protection. Not by scanning the code. By maintaining a model of which security patterns are supposed to apply to which system components, and verifying that the model is satisfied.

This is the architectural verification use case — not abstract, not theoretical, but documented in a controlled study published yesterday with real applications and real vulnerability counts.

The platforms that embed this layer — queryable by the agent during development, not just by a scanner after the PR — will produce the self-correction behavior that made Codex's output better than its competitors. Not because of a better model. Because of a better architecture.

Where This Is Heading

The DryRun study will not be the last of its kind. As empirical evaluation of agent-built software becomes standard practice — which it will, following OWASP's Top 10 for Agentic Applications and the growing enterprise compliance pressure — the finding that no agent builds secure applications will become a procurement-level concern.

The organizations that address this proactively — by embedding architectural constraint enforcement and security consistency checking into their agent pipelines before the compliance requirements crystallize — will be in a different position than those that respond reactively.

The security evidence is now empirical. The architectural solution is known. The gap between them is the product that needs to be built.

Early Access

Build with architecture-aware AI.

Pilaro is in early access. Join the waitlist and we'll reach out when your spot is ready.

Join the Waitlist