Building Autonomous Engineering Systems

How Do We Make Codebases Agent-Ready?

The mission to bring autonomy to software engineering has become one of the most significant transformations in how we build software. While phrases like "autonomous engineering" might sound buzzwordy, there are concrete insights that apply to organizations building products in this space, teams adopting AI coding tools, and anyone looking to make their engineering org successful with agents.

The key insight: this applies to any AI tools you're using. The principles aren't specific to any single product. They're universal patterns that will shape how software gets built.

Software 2.0: A New Programming Paradigm

Andrej Karpathy's framework for understanding AI's impact on programming provides a useful mental model. The shift from Software 1.0 to Software 2.0 represents a fundamental change in how we think about building systems.

Software 1.0: Automation via Specification

Write explicit algorithms by hand
If you can specify the rules, you can automate it
Key question: "Is the algorithm fixed and easy to specify?"

Software 2.0: Automation via Verification

Specify objectives and search program space via gradient descent
If you can verify it, you can optimize it
Key question: "Is the task verifiable?"

"The environment has to be resettable (you can start a new attempt), efficient (a lot of attempts can be made), and rewardable (there is some automated process to reward any specific attempt that was made)." — Andrej Karpathy

The most interesting thing here is that the frontier of what can be solved by AI systems is really just an input function of whether or not you can specify an objective and search through the space of possible solutions. We're used to building software purely via specification. The algorithm does this, input is X, output is Y. But shifting to thinking about automation via verification opens up different possibilities for what we can build.

The Asymmetry of Verification

There's a concept that's pretty intuitive to anyone familiar with P versus NP: many tasks are much easier to verify than they are to solve. This asymmetry is fundamental to understanding where AI can excel.

Verification Type	Examples
Easy to Verify	Sudoku, Math, Code Tests
Symmetric	Arithmetic, Data Processing
Hard to Verify	Essays, Hypotheses, Creative Work

Five Properties of Verifiable Tasks

Objective Truth — Clear correctness criteria
Fast to Verify — Seconds, not hours
Scalable — Parallel verification possible
Low Noise — Strong signal quality
Continuous Reward — Ability to rank quality (30%, 70%, 100%)

"The ease of training AI to solve a task is proportional to how verifiable the task is. All tasks that are possible to solve and easy to verify will be solved by AI." — Jason Wei

The most interesting easy-to-verify problems are ones where there's an objective truth, they're quick to validate, they're scalable (you can validate many in parallel), they have low noise, and they provide continuous signals. Not just binary yes/no, but gradients like 30%, 70%, 100% correct.

Why Software Development is the Frontier

Software development is highly verifiable. This is the frontier. It's why software development agents are the most advanced agents in the world right now. There's been so much work over the last 20-30 years around automated validation and verification of software.

The Eight Pillars of Verification Infrastructure

Pillar	Description
01 Testing	Unit, integration, E2E tests
02 Documentation	Specs, APIs, architecture
03 Code Quality	Linters, formatters, types
04 Build Systems	Reproducible compilation
05 Dev Environment	Easy setup, consistency
06 Observability	Logs, metrics, tracing
07 Security	Scanning, policies, secrets
08 Standards	Conventions, patterns, style

Software engineering has spent decades building verification infrastructure. This accumulated infrastructure makes code one of the most favorable domains for AI agents.

Think of it as a checklist: Do you have automated validation for the format of your code? Do you have linters? For professional software engineers, these seem obvious. But the question is whether you can go a step further.

Do you have linters that are so opinionated that a coding agent will always produce code at exactly the level of what your senior engineers would produce? Do you have tests that will fail when AI slop has been introduced, and pass when high-quality AI code is introduced?

The Problem: Most Codebases Lack Sufficient Verifiability

These additional layers of validators are things that most codebases actually lack, because humans are pretty good at handling most of this without automated validation.

What Humans Can Handle

60% test coverage ("I'll test manually")
Outdated docs ("I'll ask the team")
No linters/formatters ("I'll review it")
Flaky builds ("Just retry it")
Complex setup ("I'll help onboard")
Missing observability ("I'll check logs")
No security scanning ("We'll catch it later")
Inconsistent patterns ("I know the history")

What Breaks AI Agents

✕ No tests → can't validate correctness
✕ No docs → makes wrong assumptions
✕ No quality checks → generates bad code
✕ Unreliable builds → can't verify changes
✕ Complex setup → can't reproduce environment
✕ No observability → can't debug failures
✕ No security checks → introduces vulnerabilities
✕ No standards → creates inconsistency

Most organizations have partial infrastructure across the eight pillars. AI agents need systematic coverage to succeed.

Your company may be at 50% or 60% test coverage, and that's good enough because humans will test manually. You may have a flaky build that fails every third time, and everyone secretly hates it but no one says anything. These are accepted norms at large codebases.

As you scale to extremely large organizations with thousands of engineers, this becomes accepted. A bar of maybe 50-60%. Most software orgs can scale like that. But when you start introducing AI agents into your software development lifecycle (not just interactive coding, but review, documentation, testing) this breaks their capabilities.

From Verification to Specification

The traditional loop of understanding a problem, designing a solution, coding it out, and testing shifts when you have rigorous validation. It becomes a process of specifying constraints, generating solutions, verifying with automated validation and your own intuition, then iterating.

Approach	Flow
Traditional	Understand → Design → Code → Test
Specification-Driven	Specify → Generate → Verify → Iterate

With strong verification, you can search for solutions instead of crafting them by hand.

This move from traditional development to specification-driven development is bleeding into all the different tools. Many tools have spec mode, plan mode, or are entire IDEs oriented around this specification-driven flow.

Specifications + Validation = Reliable AI-Generated Code

Approach	Description
Traditional	Prompt agent → Generate code → Hope it works (Unreliable, hard to debug, doesn't scale)
Specification-Driven	Write specs + tests → Generate code → Validate → Iterate (Reliable, debuggable, scales to complex tasks)

How It Works:

Define Specs — Tests, types, expected behavior
Generate Solutions — Agent produces multiple candidates
Validate & Select — Run tests, pick best verified code

When these three things combine, that's how you build reliable, high-quality solutions. So what's the best decision for an organization? Is it spending 45 days comparing every single coding tool and determining one is slightly better because it's 10% more accurate on benchmarks? Or is it making changes to organizational practices that enable all coding agents to succeed, and then picking one your developers like?

What Becomes Possible with Strong Verification

When you have validation criteria, you can introduce far more complex AI workflows. If you can't automatically validate whether a PR is reasonably successful or has code that won't break production, you're not going to parallelize several agents at once. You're not going to decompose large-scale modernization projects into subtasks.

Autonomous SDLC Processes

Process	Description	Verification
Large Task Decomposition	Migration/modernization projects broken into verifiable subtasks	Per-subtask tests, integration tests, specs
Task Parallelization	Run multiple agents in parallel on independent tasks	Isolated test suites, no cross-dependencies
Code Review	Automated, context-aware reviews for every PR	Linters, tests, security scans
QA & Test Generation	Generate test scenarios, validate edge cases	Tests must pass, coverage metrics
Incident Response	Analyzes errors, proposes fixes, validates solutions	Error logs, monitoring metrics, tests
Documentation Automation	Keep docs in sync with code	Docs build, examples run, links work

Each process leverages your verification infrastructure. The better your specs, tests, and validation, the more reliably these processes run.

If a single task execution ("I would like to get this done, here's exactly how, here's how to validate") doesn't work nearly 100% of the time, you can forget about successfully using more complex workflows at scale.

The Developer's Evolving Role

For high-quality AI-generated code review, you need documentation for your AI systems. Agents will get better at picking out whether to run lint or test. They'll get better at finding solutions without explicit pointers. They'll improve at search. But they won't get better at randomly creating validation criteria out of thin air.

This is why we believe software developers will continue to be heavily involved in building software. The role shifts to curating the environment and garden that software is built from. You're setting the constraints. You're building automations and introducing continued opinionatedness into these systems.

If your company doesn't have all eight pillars of verification, that means there's a lot of work you can do totally absent of a procurement cycle or buying new tools.

The Virtuous Cycle

Here's a quote I heard recently from an engineer that has taken AI systems very seriously: "A slop test is better than no test." Slightly controversial, but the argument is that just having something there (that passes when changes are correct and somewhat accurately matches the spec) means people will enhance it. They'll upgrade it. And other agents will notice these tests. They'll follow the patterns. The more opinionated you get, the faster the cycle continues.

What you should be thinking about is: what are the feedback loops in your organization that you're catering towards? Better agents make the environment better, which makes the agents better, which means you have more time to make the environment better. This is the new DevX loop that organizations can invest in, and it enhances all the tools you're procuring.

It shifts your mental model about what you're investing in. Instead of just thinking about headcount as the input to engineering projects ("we need 10 more people to solve this problem") you can now invest in this environment feedback loop that enables people to be significantly more successful. One opinionated engineer leaning into this structure of working can actually meaningfully change the velocity of the entire business.

The Path Forward

The best coding agents take advantage of these validation loops. If your coding agent isn't proactively seeking linters, tests, etc., it won't be as good as one that seeks validation criteria.

Four Steps to Autonomous Systems

Assess — Measure your verification infrastructure across all 8 pillars
Improve — Systematically enhance verifiability through automated fixes
Deploy — Roll out AI agents that leverage your specifications
Iterate — Continuous feedback loop improves both specs and agents

Key Takeaways

Verifiability is the key to AI agent effectiveness
Specifications enable modular, reliable AI systems
Software development is highly verifiable when infrastructure exists
Results are measurable through objective metrics
Most codebases lack sufficient verification infrastructure
The investment today compounds into 5-7x improvements

The Future Is Now

We are still really early in our journey of using software development agents. But imagine this world: the moment a customer issue comes in and a bug is filed, that ticket is picked up, a coding agent executes on it, feedback is presented to a developer, they click approve, the code is merged and deployed to production. All in a feedback loop that takes maybe an hour or two.

That will be possible. We're all skeptical about fully autonomous flows, but it's technically feasible today. The limiter is not the capability of the coding agent. The limit is your organization's validation criteria.

This is an investment that, made today, will make your organization not 1.5x or 2x better. That's where the real 5x, 6x, 7x comes from. It's an unfortunate story in some ways because it means you have to invest in this. It's not something AI will magically give you. It's a choice.

If you make it now, you will be in the top 1-5% of organizations in terms of engineering velocity. You will out-compete everybody else in the field.

Older PostOlder Post