Carnegie Mellon researchers just published a study that changes how engineering teams should think about autonomous coding agents. The key finding is not mainly about what agents can do. It is about what they leave behind.
The paper, "AI IDEs or Autonomous Agents? Measuring the Impact of Coding Agents on Software Development," presented at MSR 2026, is the first large-scale causal study of autonomous agents on real open-source repositories. The researchers tracked 401 repositories that adopted autonomous agents (Claude Code, Codex, Devin, Jules, and others) as their first observable AI tool, and 117 repositories that added agents after already using AI IDE assistants like Copilot or Cursor. They used staggered difference-in-differences with propensity score matching to isolate the impact of agent adoption on both velocity and quality over six months.
The velocity result gets attention. The quality result matters more.
The Velocity Story Is Conditional
If a team has never used AI coding tools before and adopts autonomous agents, the velocity gains are real: +36% commits and +76% lines added on average, sustained through six months. This is not a benchmark score. It is a causal estimate from real repositories with matched controls.
But those gains are conditional. If a team already uses AI IDEs like Copilot or Cursor and then adds autonomous agents, the pattern changes. There is a short-lived bump at adoption (+16-28% commits in the first two months), then a reversion toward zero. By month six, commits are down 35% and lines added are down 61%.
The paper links this to AI-mature repositories: heavier review burden, higher codebase complexity, and more merge friction. In environments that already captured AI-assisted productivity gains, autonomous agents can introduce coordination overhead faster than they add throughput.
For teams already operating with AI IDEs, this is an important signal: the productivity case for autonomous agents is not additive.
MSR 2026 · Agent-First Repositories (n=401)
Velocity fades. Complexity debt stays.
% change from pre-adoption baseline · Months relative to first agentic PR
+76%
Peak velocity gain at adoption.
+49%
Cognitive complexity by month 5.
−35%
Commits in IDE-first repos by month 6.
Source: Agarwal, He, Vasilescu (2026), AI IDEs or Autonomous Agents? Measuring the Impact of Coding Agents on Software Development (MSR 2026).
The Quality Story Is Not Conditional
Here is the universal finding: regardless of prior AI usage and regardless of whether velocity goes up or down, autonomous agent adoption is associated with persistent, compounding quality risk.
Static analysis warnings increase 18-19% across repositories. Cognitive complexity rises 35-43%. Both effects are statistically significant, both appear at adoption, and neither regresses over time.
In AI-naive repositories, complexity climbs to roughly +49% by month five. In AI-experienced repositories, elevated complexity appears even before formal agent adoption and continues rising through month six.
The paper names this agent-induced complexity debt: agents accelerate the introduction of code that raises long-term cognitive and maintenance load, even when velocity benefits fade.
This is not mostly duplication debt. The paper shows small and inconsistent duplication effects. The larger problem is structural complexity: autonomous feature generation spans multiple files and relationships, and the architecture becomes harder to reason about with each cycle.
What Makes This Different From Earlier Warnings
Two points make this study different.
First, it is causal. The difference-in-differences design with propensity score matching controls for repository age, contributor count, stars, activity trajectory, and other confounders. The complexity increase is measured as the effect of agent adoption, not a loose correlation.
Second, it weakens the most common counterargument: "mature teams will handle this better." Repositories with prior AI IDE experience still accumulate comparable complexity debt. Prior AI tool maturity does not protect against agent-induced complexity growth.
That means a "we will grow into it" strategy is not supported by evidence. This is a structural problem that needs structural controls.
The Three-Paper Pattern
Three independent datasets now describe the same failure from different angles.
SWE-CI (Alibaba, arXiv 2603.03823) reports that 75% of AI coding agents break working code over time in long-horizon maintenance tasks. The regression often appears in interactions between consecutive agent-generated PRs.
The ETH Zurich AGENTS.md study found that repository instruction files, often used to improve agent context, reduced performance in most settings and increased cost.
MSR 2026 adds the causal layer: velocity gains are conditional and front-loaded, while quality costs are persistent and compounding.
Taken together, these findings point to the same architectural gap: agents optimize locally, miss globally, and the cost compounds over time.
What This Means Operationally
If a team is new to AI coding tools, velocity gains are real, but so is a complexity tax that starts immediately. The practical question becomes whether review and refactoring capacity can offset that accumulation. Current evidence suggests most teams do not offset it consistently.
If a team already uses AI IDEs, the velocity upside of autonomous agents appears weaker and can become negative over time. The quality case for an architectural constraint layer is just as strong, and often more urgent.
In both cases, safeguards need to sit inside the workflow, not after it. Complexity accumulates PR by PR. Preventing it requires tracking trajectories across PRs, not only checking one PR at a time.
The Open Gap
The MSR 2026 paper is clear about what remains unresolved: it identifies complexity-aware review and architectural enforcement as necessary, but does not evaluate which tools measurably reduce the accumulation slope.
That is now the key product and research question: if +39% complexity growth is the baseline for unmanaged agent adoption, what is the trajectory under enforced architectural constraints at PR time?
SWE-CI, ETH Zurich, and MSR 2026 now define the problem with enough precision to narrow the solution space. The answer is not better prompts or more context files. It is infrastructure that maintains an authoritative model of architectural state and enforces constraints before local wins compound into system debt.