In web development, formal verification of implementation code must account for both soundness and completeness.
- A sound system doesn’t accept invalid actions.
- A complete system doesn’t reject valid actions.
We’ve been building toward correct-by-construction systems — using Dafny to formally verify that implementation code matches spec before it even compiles. Our focus has been largely on whether the structure of the code itself preserves a generic and/or domain-specific invariant. Our system mathematically proves the absence of any bug that doesn’t match the specifications.
Thus far, formal verification of programs has been used mainly in domains where correctness is cut and dry, like strict security rules or blockchain. In these cases, we want the user’s actions to be constrained by very rigid rules. But this changes when we move into the world of more mainstream web development, where flexibility is a feature of user experience.
Our bet at Midspiral is that formal verification can be brought into mainstream web development. And for everyday apps, the user experience relies heavily on the flexibility of the system to allow actions that achieve the user’s goal, even when the literal request can’t be fulfilled exactly as specified. This is what we call “intent preservation.” The pattern is already pervasive in web development: When a user navigating with a map takes a different turn than what they were instructed, the map reroutes because correctness means ultimately getting the user to their destination regardless of their path.
This creates a verification challenge: proving that a system is flexible enough. Not just that it won’t accept invalid actions — but that it won’t reject valid ones.
In this post, we focus on proving completeness through an intent envelope that bakes flexibility into formal verification. We use property-based testing as a comparison assurance check, and show that property-based testing struggles to catch completeness bugs while proofs catch them structurally.
The Intent Envelope
To prove completeness, we first need to define what counts as a valid interpretation of the user’s request. That definition is what we call the intent envelope.
In collaborative systems, for example, users send actions based on stale state. A client might say “move card A after card B” while card B has already moved elsewhere. The server can’t always do exactly what was asked — but it can try to preserve the intent. If card B is gone, putting it at the end of the column still gets the card to the list the user intended, even if not in the exact position.
In our system, an intent envelope defines what outcomes preserve the user’s intent. It’s not a list of actions — it’s a predicate. The developer defines the rule in natural language and the LLM generates the code.
A candidate preserves intent if the card ends up in the target column — either at the exact position requested, or at the end.
function explains(orig, candidate) {
return candidate.cardId === orig.cardId
&& candidate.toColumn === orig.toColumn
&& (candidate.placement === orig.placement
|| candidate.placement.type === 'AtEnd');
}
The predicate takes two actions — the original request and a candidate alternative — and returns true if the candidate preserves the intent of the original. The user wanted card1 in column A. The original placement puts it there. AtEnd puts it there. Both satisfy explains. Both are in the envelope.
The envelope could be wider or richer depending on what the developer considers acceptable. The developer writes the spec in natural language and the LLM generates the predicate:
Wider envelope:
If the card ends up in the target column, the position doesn’t matter.
function explains(orig, candidate) {
return candidate.cardId === orig.cardId
&& candidate.toColumn === orig.toColumn;
}
Richer envelope:
A candidate preserves intent if:
- It places the card exactly where requested, OR
- It places the card at the end (unless the user specifically asked for the start), OR
- It places the card at the start (if the user asked for start or before the first card), OR
- It places the card adjacent to the original position, OR
- It uses After(prev) when the user said Before(x), since those are the same position
function explains(orig, candidate) {
if (candidate.cardId !== orig.cardId) return false;
if (candidate.toColumn !== orig.toColumn) return false;
// Exact match always valid
if (deepEqual(candidate.placement, orig.placement)) return true;
// AtEnd is valid if original was relative placement
if (candidate.placement.type === 'AtEnd'
&& orig.placement.type !== 'AtStart') return true;
// AtStart is valid if original was AtStart or Before(first)
if (candidate.placement.type === 'AtStart'
&& (orig.placement.type === 'AtStart'
|| orig.placement.type === 'Before')) return true;
// Adjacent positions valid if anchor exists but moved
if (isAdjacentTo(candidate.placement, orig.placement)) return true;
// Before(x) equivalent to After(prev(x))
if (candidate.placement.type === 'After'
&& orig.placement.type === 'Before'
&& candidate.placement.anchor === getPrev(orig.placement.anchor)) return true;
return false;
}
The developer defines, in natural language, what behavior is acceptable (i.e., the intent envelope). The LLM generates both the formal predicate (explains) and the implementation (candidates). The proof (also generated by the LLM) guarantees the implementation covers the predicate. If the developer decides to narrow or widen the envelope, they simply add to the natural language spec; the LLM + proof handle the rest.
The completeness guarantee follows directly from this definition: if any action in the envelope would succeed without violating invariants, accept. If something in it works, the server should find it.
The Verification Gap
In practice, teams building collaborative systems think carefully about intent preservation. There are design docs, product specs, code review. The question isn’t whether developers think about what should count as valid, it’s whether their decisions are guaranteed in the implementation.
A design doc might say: “If the user’s requested placement fails, fall back to placing the card at the end of the column.” A developer implements candidates to match. Code review confirms it looks right. Tests verify state invariants hold.
Six months later, someone adds an optimization: “Skip the AtEnd fallback for columns with 3+ cards to reduce redundant checks”. The change looks reasonable, passes review. All tests still pass because state invariants still hold.
But the intent envelope drifted from the implementation.
Where Property-Based Tests Fall Short
To test this, we built a kanban board implementation and intentionally introduced a subtle bug. The correct kernel always includes AtEnd as a fallback:
function candidates(state, action) {
if (action.type !== 'MoveCard') return [action];
return [action, { ...action, placement: { type: 'AtEnd' } }];
}
The buggy kernel adds a heuristic — it skips AtEnd when the target column has 3+ cards:
function candidates(state, action) {
if (action.type !== 'MoveCard') return [action];
const targetCards = getCardsInColumn(state, action.toColumn);
if (targetCards.length < 3) {
return [action, { ...action, placement: { type: 'AtEnd' } }];
}
return [action]; // skip fallback
}
In practice, this specific bug — hardcoding 3 when the WIP limit is 10 — would likely be caught in code review, but we’re not testing whether humans notice bad constants. We’re testing whether property-based tests catch completeness violations at all.
We ran 50,000 random action sequences against the buggy kernel with zero failures.
The tests passed because they check invariants — cards partitioned correctly, no duplicates, WIP limits respected. This isn’t a bug that corrupts state. Instead, it’s a bug that causes over-rejection: valid actions get rejected when they could have been accepted. A kernel that rejects everything would pass all safety tests.
Property tests ask: “If you accept an action, is the new state valid?” They don’t ask: “Should you have accepted the action?” That question requires knowing what should be accepted — the intent envelope.
Writing Completeness Tests
The buggy heuristic fails in a specific collaborative scenario:
- Column A has 4 cards: card1, card2, card3, card4 (WIP limit is 10)
- User X moves card2 to column B — column A now has 3 cards
- User Y (stale state) tries to reposition card1 before card2
- Server receives:
MoveCard(card1, columnA, Before(card2)) Before(card2)fails — card2 isn’t there. AtEnd would succeed, but the heuristic skipped it.
We set up a targeted test which also checks at runtime, but instead of checking for validity, it checks for completeness: “if a valid interpretation exists, did dispatch find it?” Here, we knew exactly what we were targeting because we injected the bug. In practice, you’d need to anticipate every way completeness could be violated — which means encoding the intent envelope into your tests.
// 1. Define the envelope as a testable function
function explains(orig, candidate) {
return candidate.cardId === orig.cardId
&& candidate.toColumn === orig.toColumn
&& (deepEqual(candidate.placement, orig.placement)
|| candidate.placement.type === 'AtEnd');
}
// 2. Enumerate all candidates in the envelope
function allCandidatesInEnvelope(action) {
return [
action,
{ ...action, placement: { type: 'AtEnd' } }
];
}
// 3. Test the completeness property
forAll(state, action, () => {
const validExists = allCandidatesInEnvelope(action)
.some(c => tryStep(state, c).ok);
if (validExists) {
assert(dispatch(state, action).ok);
}
});
For this envelope, it might be reasonable to do so. You define explains, enumerate the candidates, and test that dispatch accepts whenever any candidate would succeed.
With this completeness test, we caught the bug 100% of the time.
But notice what you’ve done: you’ve written the spec. explains defines what’s valid. allCandidatesInEnvelope lists them. If these are correct and your tests cover enough states, you’ll catch completeness bugs.
The proof does the same work differently. You write Explains (the predicate). You write Candidates (the implementation — what dispatch actually tries). This is the same as testing.
The difference is that, in testing, you enumerate the envelope explicitly:
function allCandidatesInEnvelope(action) {
// Must list EVERYTHING explains would accept
return [action, { ...action, placement: { type: 'AtEnd' } }];
}
If you miss one, your test won’t catch bugs related to it; your enumeration must match your predicate perfectly.
Writing Proofs
In proofs, you universally quantify over the envelope:
lemma CandidatesComplete(m: Model, orig: Action, aGood: Action)
requires Explains(orig, aGood)
requires TryStep(m, aGood).Ok?
ensures aGood in Candidates(m, orig)
The lemma says “for any aGood such that Explains(orig, aGood)…” — the prover checks this holds for all such actions without you listing them. You don’t enumerate. You describe the rule, and the prover verifies the implementation covers it. The structure is checked directly at construction time, not at runtime.
In Summary
- Property testing asks: did an action in this sample corrupt the state?
- Completeness testing asks: did an action in this sample violate the intent envelope?
- Proofs ask: does the structure of my implementation cover the spec?
Completeness is hard for property testing: you need to define what “valid” means, enumerate it, and then run enough samples to hit every edge case (which is impossible when the state space is infinite).
Proofs verify the structure directly — no sampling, no enumeration. If your Candidates function doesn’t cover what Explains defines, the proof fails — no matter how rare the edge case.