How Do We Make Codebases Agent-Ready?
The mission to bring autonomy to software engineering has become one of the most significant transformations in how we build software. While phrases like "autonomous engineering" might sound buzzwordy, there are concrete insights that apply to organizations building products in this space, teams adopting AI coding tools, and anyone looking to make their engineering org successful with agents.
The key insight: this applies to any AI tools you're using. The principles aren't specific to any single product. They're universal patterns that will shape how software gets built.
Software 2.0: A New Programming Paradigm
Andrej Karpathy's framework for understanding AI's impact on programming provides a useful mental model. The shift from Software 1.0 to Software 2.0 represents a fundamental change in how we think about building systems.
Software 1.0: Automation via Specification
- Write explicit algorithms by hand
- If you can specify the rules, you can automate it
- Key question: "Is the algorithm fixed and easy to specify?"
Software 2.0: Automation via Verification
- Specify objectives and search program space via gradient descent
- If you can verify it, you can optimize it
- Key question: "Is the task verifiable?"
"The environment has to be resettable (you can start a new attempt), efficient (a lot of attempts can be made), and rewardable (there is some automated process to reward any specific attempt that was made)." — Andrej Karpathy
The most interesting thing here is that the frontier of what can be solved by AI systems is really just an input function of whether or not you can specify an objective and search through the space of possible solutions. We're used to building software purely via specification. The algorithm does this, input is X, output is Y. But shifting to thinking about automation via verification opens up different possibilities for what we can build.
The Asymmetry of Verification
There's a concept that's pretty intuitive to anyone familiar with P versus NP: many tasks are much easier to verify than they are to solve. This asymmetry is fundamental to understanding where AI can excel.
| Verification Type | Examples |
|---|---|
| Easy to Verify | Sudoku, Math, Code Tests |
| Symmetric | Arithmetic, Data Processing |
| Hard to Verify | Essays, Hypotheses, Creative Work |
Five Properties of Verifiable Tasks
- Objective Truth — Clear correctness criteria
- Fast to Verify — Seconds, not hours
- Scalable — Parallel verification possible
- Low Noise — Strong signal quality
- Continuous Reward — Ability to rank quality (30%, 70%, 100%)
"The ease of training AI to solve a task is proportional to how verifiable the task is. All tasks that are possible to solve and easy to verify will be solved by AI." — Jason Wei
The most interesting easy-to-verify problems are ones where there's an objective truth, they're quick to validate, they're scalable (you can validate many in parallel), they have low noise, and they provide continuous signals. Not just binary yes/no, but gradients like 30%, 70%, 100% correct.
Why Software Development is the Frontier
Software development is highly verifiable. This is the frontier. It's why software development agents are the most advanced agents in the world right now. There's been so much work over the last 20-30 years around automated validation and verification of software.
The Eight Pillars of Verification Infrastructure
| Pillar | Description |
|---|---|
| 01 Testing | Unit, integration, E2E tests |
| 02 Documentation | Specs, APIs, architecture |
| 03 Code Quality | Linters, formatters, types |
| 04 Build Systems | Reproducible compilation |
| 05 Dev Environment | Easy setup, consistency |
| 06 Observability | Logs, metrics, tracing |
| 07 Security | Scanning, policies, secrets |
| 08 Standards | Conventions, patterns, style |
Software engineering has spent decades building verification infrastructure. This accumulated infrastructure makes code one of the most favorable domains for AI agents.
Think of it as a checklist: Do you have automated validation for the format of your code? Do you have linters? For professional software engineers, these seem obvious. But the question is whether you can go a step further.
Do you have linters that are so opinionated that a coding agent will always produce code at exactly the level of what your senior engineers would produce? Do you have tests that will fail when AI slop has been introduced, and pass when high-quality AI code is introduced?
The Problem: Most Codebases Lack Sufficient Verifiability
These additional layers of validators are things that most codebases actually lack, because humans are pretty good at handling most of this without automated validation.
What Humans Can Handle
- 60% test coverage ("I'll test manually")
- Outdated docs ("I'll ask the team")
- No linters/formatters ("I'll review it")
- Flaky builds ("Just retry it")
- Complex setup ("I'll help onboard")
- Missing observability ("I'll check logs")
- No security scanning ("We'll catch it later")
- Inconsistent patterns ("I know the history")
What Breaks AI Agents
- ✕ No tests → can't validate correctness
- ✕ No docs → makes wrong assumptions
- ✕ No quality checks → generates bad code
- ✕ Unreliable builds → can't verify changes
- ✕ Complex setup → can't reproduce environment
- ✕ No observability → can't debug failures
- ✕ No security checks → introduces vulnerabilities
- ✕ No standards → creates inconsistency
Most organizations have partial infrastructure across the eight pillars. AI agents need systematic coverage to succeed.
Your company may be at 50% or 60% test coverage, and that's good enough because humans will test manually. You may have a flaky build that fails every third time, and everyone secretly hates it but no one says anything. These are accepted norms at large codebases.
As you scale to extremely large organizations with thousands of engineers, this becomes accepted. A bar of maybe 50-60%. Most software orgs can scale like that. But when you start introducing AI agents into your software development lifecycle (not just interactive coding, but review, documentation, testing) this breaks their capabilities.
From Verification to Specification
The traditional loop of understanding a problem, designing a solution, coding it out, and testing shifts when you have rigorous validation. It becomes a process of specifying constraints, generating solutions, verifying with automated validation and your own intuition, then iterating.
| Approach | Flow |
|---|---|
| Traditional | Understand → Design → Code → Test |
| Specification-Driven | Specify → Generate → Verify → Iterate |
With strong verification, you can search for solutions instead of crafting them by hand.
This move from traditional development to specification-driven development is bleeding into all the different tools. Many tools have spec mode, plan mode, or are entire IDEs oriented around this specification-driven flow.
Specifications + Validation = Reliable AI-Generated Code
| Approach | Description |
|---|---|
| Traditional | Prompt agent → Generate code → Hope it works (Unreliable, hard to debug, doesn't scale) |
| Specification-Driven | Write specs + tests → Generate code → Validate → Iterate (Reliable, debuggable, scales to complex tasks) |
How It Works:
- Define Specs — Tests, types, expected behavior
- Generate Solutions — Agent produces multiple candidates
- Validate & Select — Run tests, pick best verified code
When these three things combine, that's how you build reliable, high-quality solutions. So what's the best decision for an organization? Is it spending 45 days comparing every single coding tool and determining one is slightly better because it's 10% more accurate on benchmarks? Or is it making changes to organizational practices that enable all coding agents to succeed, and then picking one your developers like?
What Becomes Possible with Strong Verification
When you have validation criteria, you can introduce far more complex AI workflows. If you can't automatically validate whether a PR is reasonably successful or has code that won't break production, you're not going to parallelize several agents at once. You're not going to decompose large-scale modernization projects into subtasks.
Autonomous SDLC Processes
| Process | Description | Verification |
|---|---|---|
| Large Task Decomposition | Migration/modernization projects broken into verifiable subtasks | Per-subtask tests, integration tests, specs |
| Task Parallelization | Run multiple agents in parallel on independent tasks | Isolated test suites, no cross-dependencies |
| Code Review | Automated, context-aware reviews for every PR | Linters, tests, security scans |
| QA & Test Generation | Generate test scenarios, validate edge cases | Tests must pass, coverage metrics |
| Incident Response | Analyzes errors, proposes fixes, validates solutions | Error logs, monitoring metrics, tests |
| Documentation Automation | Keep docs in sync with code | Docs build, examples run, links work |
Each process leverages your verification infrastructure. The better your specs, tests, and validation, the more reliably these processes run.
If a single task execution ("I would like to get this done, here's exactly how, here's how to validate") doesn't work nearly 100% of the time, you can forget about successfully using more complex workflows at scale.
The Developer's Evolving Role
For high-quality AI-generated code review, you need documentation for your AI systems. Agents will get better at picking out whether to run lint or test. They'll get better at finding solutions without explicit pointers. They'll improve at search. But they won't get better at randomly creating validation criteria out of thin air.
This is why we believe software developers will continue to be heavily involved in building software. The role shifts to curating the environment and garden that software is built from. You're setting the constraints. You're building automations and introducing continued opinionatedness into these systems.
If your company doesn't have all eight pillars of verification, that means there's a lot of work you can do totally absent of a procurement cycle or buying new tools.
The Virtuous Cycle
Here's a quote I heard recently from an engineer that has taken AI systems very seriously: "A slop test is better than no test." Slightly controversial, but the argument is that just having something there (that passes when changes are correct and somewhat accurately matches the spec) means people will enhance it. They'll upgrade it. And other agents will notice these tests. They'll follow the patterns. The more opinionated you get, the faster the cycle continues.
What you should be thinking about is: what are the feedback loops in your organization that you're catering towards? Better agents make the environment better, which makes the agents better, which means you have more time to make the environment better. This is the new DevX loop that organizations can invest in, and it enhances all the tools you're procuring.
It shifts your mental model about what you're investing in. Instead of just thinking about headcount as the input to engineering projects ("we need 10 more people to solve this problem") you can now invest in this environment feedback loop that enables people to be significantly more successful. One opinionated engineer leaning into this structure of working can actually meaningfully change the velocity of the entire business.
The Path Forward
The best coding agents take advantage of these validation loops. If your coding agent isn't proactively seeking linters, tests, etc., it won't be as good as one that seeks validation criteria.
Four Steps to Autonomous Systems
- Assess — Measure your verification infrastructure across all 8 pillars
- Improve — Systematically enhance verifiability through automated fixes
- Deploy — Roll out AI agents that leverage your specifications
- Iterate — Continuous feedback loop improves both specs and agents
Key Takeaways
- Verifiability is the key to AI agent effectiveness
- Specifications enable modular, reliable AI systems
- Software development is highly verifiable when infrastructure exists
- Results are measurable through objective metrics
- Most codebases lack sufficient verification infrastructure
- The investment today compounds into 5-7x improvements
The Future Is Now
We are still really early in our journey of using software development agents. But imagine this world: the moment a customer issue comes in and a bug is filed, that ticket is picked up, a coding agent executes on it, feedback is presented to a developer, they click approve, the code is merged and deployed to production. All in a feedback loop that takes maybe an hour or two.
That will be possible. We're all skeptical about fully autonomous flows, but it's technically feasible today. The limiter is not the capability of the coding agent. The limit is your organization's validation criteria.
This is an investment that, made today, will make your organization not 1.5x or 2x better. That's where the real 5x, 6x, 7x comes from. It's an unfortunate story in some ways because it means you have to invest in this. It's not something AI will magically give you. It's a choice.
If you make it now, you will be in the top 1-5% of organizations in terms of engineering velocity. You will out-compete everybody else in the field.


