Agent-CoEvo: Code and Tests Should Evolve Together — Multi-Agent Framework Outperforms on SWE-bench
A new multi-agent framework called Agent-CoEvo demonstrates that software repair should not optimize code under fixed tests, but instead coevolve both code and tests simultaneously — achieving state-of-the-art results on SWE-bench Lite and SWT-bench Lite.
The Problem With Current AI Code Repair
Most LLM-based repair systems use a linear pipeline:
Bug Report → Generate Fix → Run Tests → Pass/Fail
Tests are treated as immutable correctness oracles. But real software engineers don't work this way — they often discover that tests themselves contain bugs, missing assumptions, or misinterpreted failure conditions.
The Insight
"Repository-level issue resolution is fundamentally not optimization under fixed tests, but search over evolving behavioral constraints."
When fixing code, the behavioral constraints (tests) should evolve alongside the fix.
Agent-CoEvo Framework
A coevolutionary multi-agent system where:
- Code agents propose and refine patches
- Test agents propose and refine test modifications
- Mutual evaluation — Each evaluates the other
- Semantic recombination — Best elements are combined
- Iterative refinement — Both code and tests improve together
Results
| Benchmark | Metric | Agent-CoEvo |
|---|---|---|
| SWE-bench Lite | Repair success | SOTA |
| SWT-bench Lite | Repair success | SOTA |
| Test quality | Reproduction | Improved |
Why This Matters
- Real-world relevance — Matches how actual engineers fix bugs
- AI code tools — Directly applicable to Claude Code, Codex, Copilot
- SWE-bench — The standard benchmark for AI code repair
- Paradigm shift — From "code-only optimization" to "coevolution of implementation and specification"