Back to Resources
AI CodingBenchmarksArchitecture

75% of AI Coding Agents Break Working Code Over Time. Here Is Why That Number Matters.

March 13, 202611 min read

Three papers published or covered yesterday measure the same underlying failure from three different angles. The number that ties them together is 75.

Alibaba's SWE-CI benchmark, released on arXiv this week, evaluated 18 models from 8 providers on 100 real-world codebase maintenance tasks — each spanning an average of 233 days of evolution history and 71 consecutive commits. The finding: 75% of AI coding agents break working code over time. Agents produce initial patches that pass. Problems surface later in the CI loop, as subsequent changes interact with the earlier ones.

The SWE-EVO benchmark, covered in broader circulation yesterday, provides the framing for why. GPT-5 with OpenHands — the best available model on the best available open scaffolding — scores 65% on SWE-Bench Verified when tasks involve a single isolated issue. On SWE-EVO tasks, which require coordinated changes across an average of 21 files over multiple development iterations, the same model scores 21%. The performance gap between short-horizon and long-horizon tasks is 44 percentage points.

And an ETH Zurich paper, covered by InfoQ on March 12, tested the industry's primary response to this problem — giving agents more repository context via AGENTS.md files — and found that LLM-generated context files make things worse in 5 of 8 evaluation settings, while adding 20% or more to cost.

The picture these three papers paint together is specific and important: AI coding agents fail systematically at long-horizon engineering tasks, the performance gap is larger than benchmarks suggest, and the standard fix makes the problem worse.

What Each Paper Shows

SWE-CI is the most architecturally significant of the three. Previous benchmarks measured whether an agent could resolve a single GitHub issue. SWE-CI measures something closer to what engineering actually looks like: can an agent maintain code quality across dozens of consecutive changes, over a development period measured in months?

The 75% regression rate answers that question with uncomfortable precision. "Break working code" here means that code which passed the CI suite before the agent's changes fails afterward — not immediately, but as the sequence of agent-driven modifications accumulates. Agents produce patches that are locally correct. The regression appears later, when a subsequent change interacts with an assumption the earlier patch silently introduced.

This is architectural drift made measurable. And crucially, it is the majority outcome, not an edge case.

SWE-EVO documents the same phenomenon from a different angle. The 44-point gap between Verified (65%) and long-horizon (21%) is not a gap in model intelligence — GPT-5 is a frontier model. It is a gap in coordination capability. When a task requires interpreting high-level requirements, planning changes across 21 files, and iterating while preserving existing functionality, current agents collapse. The ETH Zurich study then shows that the natural solution engineers reach for — give the agent better instructions — is empirically counterproductive. More context, more instructions, more detail: the LLM-generated AGENTS.md files that practitioners spend time crafting reduce success rates while inflating costs.

The Deeper Pattern

These three findings share an underlying cause.

Current AI coding agents are optimized for a specific type of task: read the issue, find the relevant file, make the change, verify the tests pass. This is what SWE-Bench Verified measures. It is also the narrowest possible slice of real software engineering.

What the agents are not equipped for is reasoning about the consequences of their changes for the rest of the system. When an agent modifies a data model in file A, does it understand that file B makes assumptions about that model's structure, that file C has a cache that will be invalidated, and that the test suite for file D does not cover the interaction between A and D's data path? The answer, empirically, is no — and SWE-CI proves it by measuring what happens to the system when the agent does not track these consequences across a sequence of PRs.

The AGENTS.md finding is instructive about why adding instructions does not fix this. An instruction file can tell an agent "follow the authentication pattern in auth.py." It cannot tell the agent, for every new endpoint it creates, whether the authentication pattern has been applied — because the agent would have to check the entire application state to answer that question, and a text file is not the right interface for checking application state.

What the agent actually needs is not more instructions. It is a structured model of the system — what components exist, what constraints they must satisfy, what invariants must hold across the whole — and the ability to query that model during development. Not read it once at the start of a session. Query it, at each decision point, to verify that the action it is about to take is consistent with the system's architectural rules.

This is the distinction between unstructured context (what AGENTS.md provides) and structured constraint queries (what a component registry and architectural truth layer provide). ETH Zurich proves the former hurts. The SWE-CI and SWE-EVO data prove the problem it was supposed to solve is real and large.

Why This Matters for Engineers and Builders

If you are building software with AI coding agents, these findings have specific operational implications.

The benchmark scores you use to select your model or tooling are measured on single-issue tasks. The engineering you are actually doing is long-horizon. If you are running agents on a codebase over weeks or months — which, increasingly, is exactly what teams are doing — the relevant performance number is not 65%. It is closer to 21%, and the 75% regression rate applies to your codebase too.

The AGENTS.md files you have carefully crafted are, with high probability, making your agents slightly worse. The ETH Zurich recommendation is to omit LLM-generated context files entirely, and to keep human-written instructions minimal — non-inferable details only, such as specific build commands or tooling configurations. Everything else you are putting in AGENTS.md is noise the agent is spending tokens and attention to process, at a cost to reasoning quality.

And the CI integration you run to verify agent output is catching less than you probably assume. SWE-CI's finding that 75% of regressions slip through is a statement about the limitations of existing CI as a verification mechanism for agent-driven development. Tests catch what tests cover. They do not catch architectural drift in paths that do not have test coverage — which is most of the real failure surface.

What Most People Miss

The natural reaction to the 75% regression finding is to ask which agent is safest — which one breaks the least code. SWE-CI provides that data: Claude Opus leads throughout the evaluation period, with GLM-5 as a strong second. But the more important observation is that even the leading model still breaks code over time. The problem is not which agent you choose. The problem is that the failure mode is structural, not agent-specific.

What most teams miss is that the 75% regression rate is a property of the interaction between agents and a codebase over time, not a property of any individual agent. The agent makes a locally correct decision. Then another agent, or the same agent in a future session, makes another locally correct decision. The regression is in the interaction between the two — and no agent currently has a mechanism for tracking that interaction across sessions and PRs.

This is why better models, better prompts, and richer context files do not solve the problem. The failure is in the architecture of how agents relate to the systems they build. The fix is infrastructure that maintains a persistent, authoritative model of the system's state and constraints — one that every agent queries before acting and that every verification step checks afterward.

What the Research Points Toward

The SWE-CI paper introduces a new evaluation paradigm: measure agent quality not within a single PR but across the CI loop over a development history. This is the right framework. Behavioral PR review catches bugs in the current change. CI-loop verification tracks whether the system degrades over time. Architectural constraint enforcement tracks whether the system's structural rules are preserved.

These are three distinct verification layers. Behavioral review (Claude Code Review) is now productized. CI-loop verification is now the research standard. Architectural constraint enforcement remains the open gap.

The ETH Zurich finding points to the interface design for closing that gap: not richer text instructions, but structured queries. Agents asking specific, structured questions — "what constraints apply to this component?", "is this pattern consistently applied across the system?" — and receiving precise, structured answers from an authoritative truth layer.

The 75% regression rate is not the final state of AI-driven software development. It is the measurement of where the field currently is, without the infrastructure that would prevent it. Building that infrastructure is the next meaningful problem in the space.

Early Access

Build with architecture-aware AI.

Pilaro is in early access. Join the waitlist and we'll reach out when your spot is ready.

Join the Waitlist