Formal verification has never been part of web development culture. We think it should be.
Generative AI has changed who builds software and how. Less technical creators can now ship working products because LLMs can write functional code. Experienced engineers move faster because LLMs handle the tedious parts, freeing them to focus on architecture and design.
Using AI for code generation as it exists today means choosing between two imperfect options: Trust the output and iterate as bugs are discovered in production, or review the code to some lesser or greater extent manually, at which point the heavy supervision diminishes the productivity gains.
We recognize and want a third option: specify intent upfront, translate that into a mathematical property, and verify that all generated code satisfies it automatically. For the less technical creator this means far fewer bugs shipped, and for the experienced engineer this means far less supervision beyond the expression of intent and definition of specs.
The human states what should be true. The machine proves that it’s true. Code that cannot be proven to be true does not compile and is not surfaced to the user at all.
To enter this space, we are starting with UI state: foundational to modern web apps, deceptively difficult to get right, and an area where LLMs consistently struggle.
The Core Idea: How Verified State Evolution Works
If every state transition preserves a global invariant, then every state reachable through the system is correct — by construction.
We demonstrate the core idea with one of the most basic examples of UI state: undo/redo. We’re not here to re-invent replay — it’s a solved problem. We’re using replay as an anchor to demonstrate how verification might apply to a familiar methodology, and then imagine how it might extend to harder problems.
Applying verification to replay
Consider a counter that can’t go negative. The state invariant is simple: n ≥ 0. A user increments twice, undoes once, then decrements. What should redo do?
In linear undo (implemented by most applications) redo should do nothing. The decrement created a new timeline; the old future is gone. But a subtle bug (forgetting to clear the future on new actions) would let redo jump back to 2 — a state from a timeline where the user incremented, not decremented. No crash. No invalid state. Just silently wrong.
Now consider the same counter, but with the behavior defined upfront. The flow would be as such:
The human expresses intent in natural language:
“I need a counter for a shopping cart quantity. It can’t go below zero. Undo/redo should work like most apps — if I undo and then do something new, I can’t redo the old action anymore.”
The LLM translates this to a structured spec:
Model: integer (quantity)
Invariant: quantity ≥ 0
Actions: increment, decrement
Semantics: linear undo — new actions clear the future
The human reviews and confirms or modifies. Then the LLM generates the implementation in Dafny, a verification-aware language. The proof requires:
- Every state in history satisfies
n ≥ 0 Do(action)setsfuture = []
The second requirement isn’t a lint rule or a code review comment. It’s part of the specification. If the LLM then incorrectly generates future: h.future instead of future: [], the proof fails. The code doesn’t compile and doesn’t surface for review.
The bug from the previous example — silently jumping to the wrong timeline — is not caught at runtime or by tests. Instead, because that implementation cannot be proven, the code cannot be generated.
The LLM then acts less like a generator of the most plausible implementation and more like a search for the correct implementation.
From toy example to real applications
A counter is trivial: one invariant, two functions. But the structure — specify, prove, compile — scales to domains where the invariants are global and the bugs are subtle:
- Kanban boards where cards must exist in exactly one column
- Capability delegation with transitive permissions
- Collaborative documents with referential integrity
Beyond Undo: Offline Collaboration
Replay handles a single user’s timeline. But what happens when multiple users edit the same board and one goes offline?
Alice drags card X to “Done.” Bob, working offline, drags the same card to position 2 in “In Progress”, right after card Y. When Bob reconnects, his action references a board state that no longer exists. Card Y might have moved. The column might be full.
Most systems pick one of two bad options:
-
Reject. Bob’s action fails. His work disappears. Safe, but infuriating, especially if his intent (“put X in In Progress”) was perfectly valid against the current state.
-
Rebase naively. Reinterpret Bob’s action against the new board. “Position 2” now means something different. Card X lands in the wrong place, or worse, if the naive logic has bugs, it duplicates, vanishes, or violates the column’s WIP limit.
The verified approach is different: try to preserve intent, prove what you accept.
When Bob’s action arrives, the server asks: can we interpret this in a way that satisfies the invariant? It tries Bob’s original placement first. If that fails (card Y is gone), it tries fallbacks — “at end of column,” “before the first card.” Only if every interpretation would violate the invariant does it reject.
The kernel proves: every accepted action produces a valid state. The server never accepts a corrupted board, no matter how stale the client or how concurrent the edits.
This is where replay becomes infrastructure. The same architecture — kernel, domain, proof obligation — handles problems that can’t be tested exhaustively.
The next section walks through how the methodology is built to scale to complex domains.
Implementation: Kernels, Domains & AppCore
The Architecture
Our system separates what’s generic from what’s application-specific. The fundamental capability here is that Dafny code can be verified and compiled to JavaScript. The specific architecture we’ve implemented here (separating kernels, domains, and AppCore) is one pattern for organizing that capability. Its benefit is reuse: prove a kernel once, plug in many domains. But it’s not the only way. A simpler project might verify a single module without the kernel abstraction. This separation of concerns is a design choice, not a requirement of the approach.
Kernels
Kernels are generic, reusable modules that encapsulate a pattern of state evolution and prove it correct once:
- A kernel is parameterized over a domain — it doesn’t know what the state looks like or what the actions mean, only that the domain satisfies certain proof obligations
- The kernel’s guarantees follow from those obligations — if the domain proves its transitions preserve the invariant, the kernel guarantees that all states it produces (including through undo, redo, sync, or conflict resolution) also satisfy that invariant
- Once a kernel is proved, it never needs to be touched again — new domains just plug in
We’ve built three kernels so far: replay (undo/redo), authority (client-server synchronization), and multi-collaboration (offline clients with conflict resolution). Each encapsulates a different pattern but all follow the same structure.
Domains
Domains are application logic — the part that’s unique to each app:
- At minimum, a domain defines
Model(the data shape),Action(things users can do), and an invariantInv(rules that must always be true of the model) - More sophisticated kernels require more from domains. For example, the multi-collaboration kernel also needs Rebase (how to transform stale actions), Candidates (fallback interpretations when the original action fails), and Explains (what counts as preserving user intent)
- The domain interface is a contract with the kernel: define what it asks for, prove the obligations it requires, and the kernel guarantees its properties
AppCore
AppCore is the API layer that wires a domain to a kernel and exposes it to the application:
- It instantiates the generic kernel with a specific domain
- It exposes the methods the UI will actually call:
Init,Dispatch,Undo,Redo,Present - It handles the translation between the domain’s action types and the kernel’s generic interface
Think of it this way: the kernel is a template, the domain is business logic, and AppCore is the glue that produces a ready-to-use JavaScript module.
Responsibility by Layer
Each layer trusts the one below it.
File Structure
A typical verified domain example:
Replay.dfy # Generic kernel (proved once, never touched)
KanbanDomain.dfy # Domain: model, actions, invariant, proofs
# AppCore: Wires Kanban domain to Replay kernel, exposes API
compile.sh # Dafny → JavaScript compilation
Kanban.cjs # Compilation output: what the app imports
kanban/ # React app imports the compiled module
The .dfy files are where verification happens. The .cjs file is what ships. The React app imports the compiled module and calls its methods.
The Proof Obligation
If we take the replay kernel as an example, for each domain, one thing must be proved:
lemma ApplyPreservesInv(m: Model, a: Action)
requires Inv(m)
requires Valid(m, a)
ensures Inv(Apply(m, a))
In plain terms: if the rules hold before an action, they hold after. That’s the only obligation for this specific kernel. Other kernels may include more obligations. In addition, each domain may have domain-specific guarantees beyond the kernel.
Proof Effort Scales with Complexity
The effort scales with domain complexity, not with the kernel’s logic. A counter domain (~50 lines of proofs) and the Kanban domain (~1,400 lines of proofs) share the same kernel.
Same Methodology, Different Shapes
To show this generalizes, we implemented three domains with the same replay kernel:
| Counter | Kanban | Delegation | |
|---|---|---|---|
| Model | Single integer | Columns, cards, positions | Subjects, grants, edges |
| Structure | Primitive value | Ordered sequences | Directed graph |
| Key operation | Increment/decrement | Move card | Delegate capability |
| Invariant | n ≥ 0 | No duplicate cards | Referential integrity |
| Proof effort | ~50 lines | ~1,400 lines | ~600 lines |
| Kernel | Replay | Replay | Replay |
The domains look nothing alike. The methodology is identical:
- Define
Model,Action,Inv - Prove transitions preserve invariant
Inv - Plug into the kernel
- Compile to JavaScript
Undo/redo works automatically for all different data shapes, but with the same guarantees.
Try It & Contribute
The code is open. The demos work. The methodology is documented.
Most importantly, the paradigm shift is real.
- GitHub repository — kernels, domains, React integration, and Dafny source
- Dafny documentation — learn more about the verification language
- Get in touch — if you’re interested in collaborating or have a hard problem for us
Interested in verified software?
We're exploring how formal verification can make AI-generated code more reliable. Get updates on our research and tools.
Challenges
Technical Limitations
Verification doesn’t cover integration
Verification guarantees that the Dafny code is correct. It doesn’t guarantee that everything around it is wired correctly.
Dafny’s JavaScript compilation is functional but not battle-hardened. Bugs can exist in:
- The Dafny-to-JS compiler itself
- The runtime library that compiled code depends on
- The glue code connecting AppCore to the React layer
- Serialization and deserialization at system boundaries
The kernel logic is proved, but the integration is not. We still need to write tests for the boundaries to verify that React dispatches the right actions, that the compiled module loads correctly, and that the API behaves as expected end-to-end. Verification reduces the surface area of what needs testing and human supervision; it doesn’t eliminate testing entirely.
Performance is not a verification target
Verification proves correctness, not efficiency. The compiled JavaScript uses Dafny’s runtime abstractions and is not hand-optimized. For UI state management — where operations are infrequent and data structures are small — this is rarely a problem. For hot paths, high-frequency updates, or large datasets, the overhead may matter.
Additionally, when faced with a sophisticated specification that accounts for performance, the LLM may not implement it to its full extent if a proof cannot be found.
Profiling still applies. If performance becomes an issue, the verified kernel can serve as a reference implementation while a hand-tuned version is developed and tested against it.
Methodology Limitations
Proof effort scales with domain complexity
The counter domain is ~50 lines. The Kanban domain is ~1,400 lines of proofs. This isn’t accidental — the Kanban invariant spans multiple data structures (columns, cards, orderings), and proving that operations preserve it requires lemmas about sequence manipulation, map updates, and their interactions.
For domains with richer invariants — distributed systems, financial calculations, access control policies — the proof burden grows. At some point, the effort to prove correctness may exceed the effort to test thoroughly. The methodology is most valuable where:
- Bugs are costly (data corruption, security, financial loss)
- The state space is large (testing is unlikely to cover it)
- The invariants are expressible (you can state what “correct” means)
Not every domain hits this sweet spot.
Specification correctness is assumed, not verified
Proofs guarantee that the implementation satisfies the specification. They don’t guarantee that the specification is what you actually wanted.
If the invariant is wrong — too weak, too strong, or missing a case — the proofs will happily verify buggy code. “No duplicate cards” is a reasonable invariant for a Kanban board. But if the real requirement was “no duplicate cards with the same title” and you specified ID uniqueness instead, verification won’t catch the mismatch.
The human still owns the spec. Verification shifts the review burden from “is this code correct?” to “is this specification correct?” It’s a smaller surface, but not zero.
LLMs may modify your specification
There’s a subtler issue: LLMs, when faced with a spec they can’t prove, will sometimes adjust the spec itself. This isn’t inherently bad. In our experience, these modifications often exposed ambiguities or over-constraints we hadn’t noticed. The model might weaken a precondition that was unnecessarily strict, or generalize an invariant that was too narrow. The resulting implementation was still reliable.
But it requires vigilance. The human must review the final specification. Did the model prove what you asked, or something subtly different?
Adoption Barriers
LLM capability is a hard requirement
We used Claude Code (with Opus 4.5) to write the kernels, domains, proofs, and React apps. This worked because the LLM was able to generate Dafny code, interpret verification errors, and iterate until proofs went through (proofs often take more than 10 minutes to run).
The trajectory and speed of LLM improvements point to a near-future where formal verification is much more accessible and mainstream. However, as of December 2025, only some frontier-level models can reliably write verified Dafny code. Smaller or less capable models struggle with the feedback loop of “proof failed → diagnose → fix → retry.” The methodology assumes access to a highly capable model.
The ecosystem is immature
Dafny is a research language with industrial users, but it’s not mainstream. Compared to TypeScript or Python:
- Community resources are sparse (fewer Stack Overflow answers, tutorials, libraries)
- Hiring developers with Dafny experience is effectively impossible
While the methodology presented here is ready for power users and teams willing to invest in the learning curve, broader adoption will require better tooling, more documentation, and ideally, LLMs that can bridge the knowledge gap.
Future Work
Surfacing specifications for human review
Currently, the LLM generates both the formal specification and the implementation. The human reviews the final artifact, but by then the spec is already embedded in the code. A more transparent and efficient workflow would surface the generated domain obligations (i.e., the intended behavior), along with any kernel-specific requirements, back to the user in a human-readable form before proof generation begins.
This would allow iteration on intent before committing to a lengthy proof search. The user could refine invariants, add edge cases, or reject/revise a misunderstood requirement early.
Verification-aware tooling for non-experts
The methodology currently assumes familiarity with formal methods concepts or access to an LLM that can abstract them away. Broader adoption would benefit from:
- IDE integration that explains proof failures in plain language
- Visual tools for exploring state spaces and invariant violations
- Templates and scaffolding for common domain patterns (CRUD, permissions, workflows)
Spec drift detection
As noted in Challenges, LLMs may modify specifications during proof generation. Tooling that diffs the original user intent against the final proven spec — highlighting what changed and why — would increase trust and reduce silent failures. This could be as simple as a structured changelog or as sophisticated as a semantic diff over formal specifications.
Why Dafny?
For the technically curious: we chose Dafny over alternatives (Rocq, Lean, TLA+) for practical reasons:
- JavaScript compilation is first-class — not a research prototype or third-party tool
- Semi-automation — Dafny dispatches verification conditions to SMT solvers
- Designed for program verification — we’re proving code correct, not mathematical theorems
As verification tooling and LLMs mature, other languages may become viable.
About
This work is a collaboration between:
- Nada Amin — Assistant Professor at Harvard, specializing in programming languages and AI
- Fernanda Graciolli — Developer specializing in React apps and product design