Formally Verified AI-Generated Code is Already Here

A response to Chris Riccomini’s “The Waymo Rule for AI-Generated Code.”

Chris Riccomini argues that AI-generated code should be held to the same bar as a Waymo: it doesn’t have to be perfect, just better than you. He’s moved past formal-methods-plus-AI and landed somewhere pragmatic: English specs, ordinary languages, ordinary tests, plus a layered, Antithesis-style QA stack. It’s a good stack.

One sentence caught my attention: “It has earned my trust.” He’s referring to GPT-5.4’s capabilities. That sentence is both the premise behind the Waymo rule and the superficial reason he dismisses formal methods — not because verification is wrong in principle, but because AI is now good enough, and the QA stack good enough with it, that the overhead isn’t worth it.

I think that last assumption — that the overhead isn’t worth it because other approaches are good enough — is the true lead-in to the conclusion and worth examining.

First, the trust feedback loop

Earned trust compounds. As capability rises, trust rises with it — and review falls. Review is expensive; if you trust, why spend resources on it? The scope creeps outward: what you trusted for UI logic, you start trusting for the auth layer. The test suite gets entrusted to AI. Automated reviews pit model against model. Many bugs are caught because many bugs are written.

Chris sees this and his prescription is the Antithesis stack: property-based testing, deterministic simulation, etc. These tools are genuinely powerful — they catch a class of bug that neither AI-generated tests nor human review reliably find. Short term, we want more of this, not less. So far so good.

But this type of QA/testing is meant to find bugs after code is written and running. Formal verification makes the expensive, devastating class of bugs (state machines, data corruption, business rule violations, etc) impossible by construction.

The gap in Chris’ argument

His move away from the promise of formal methods + AI rests on three beliefs:

the specification language is just English

the AI’s programming languages are just the languages we have today

the verification is just tests

Two of these already apply to formal verification. The third is not mutually exclusive.

English specs. LLMs are capable of formalizing English into formal specifications. We’re still optimizing the translation, but the capability is there. Nobody — not people nor AI — needs to learn a new spec language.

Familiar languages. Dafny has been around for a while, and AI writes it about as well as it writes a slightly less mainstream language like OCaml. Dafny compiles to JavaScript. We’re also building a compiler that models TypeScript in Dafny. In both cases, the verified code and the running code come from the same source.

Tests as verification. They’re not in competition. Point tests at what can’t be formally verified; build what can be verified from the ground up so that testing resources are not wasted there. The right workflow doesn’t add overhead; it shifts where the work happens, and the most expensive failures never ship to begin with.

The Waymo analogy breaks down

The analogy points at something true: perfect is the wrong bar. But Waymos fail one car at a time. In software, one bug in auth hits many users simultaneously, usually silently at first, and may cascade from there. LLMs also anchor on context: given code with errors, a model may catch them, or it may build on them.

“Better than the median human” is a reasonable bar when failures are independent and don’t self-propagate. It’s less useful when the tail is fat and one bug (that could have been prevented) can delete a company.

Where this leaves us

We’re not against AI-generated code — everything we build is accelerated by it. And testing has its place; there’s more code that can’t be cleanly verified than can be, especially in web dev. We need robust testing (specifically outside the AI loop) in all those places.

But in light of how far formal verification has come, where it can now handle the failure modes that matter most — state machines, permissions, business rules — without the large overhead Chris has implied, this conclusion feels unnecessarily broad:

I just don’t think that the future of AI-generated code is one in which AIs write code in a language with formal methods properties and then verify it with a model checker.

The direction is right, but the conclusion is misguided. Writing off formal verification in mainstream AI programming would be reasonable if the cost was so high as to be prohibitive. But the truth is, anyone writing a React app can ship it today with some formal guarantees about their logic, just by installing an npm package.

First, the trust feedback loop

The gap in Chris’ argument

The Waymo analogy breaks down

Where this leaves us

Stay Connected