claimcheck: Narrowing the Gap between Proof and Intent

As LLMs get better at writing code, the programmer’s job changes. It becomes more about describing what you want to implement and less about how you want to implement it.

This is already happening. Spec-driven development in the context of AI is becoming quite common, where developers use natural language to structure requirements and program behaviors and leave the Implementation up to the AI. (Yes — testing, constraints, sanity checks, review are all still important parts of the process.)

The delegation of implementation to AI raises two big questions.

How do we know AI-generated code is correct, and not just superficially working?
How do we know the implementation actually captures what the programmer intended?

Our recent work tackles the first question by adding formal verification to the AI coding loop. We force the AI to mathematically prove its code is correct. But that still leaves the second question: does the specification we proved actually match what we meant?

That’s what claimcheck addresses. It’s a simple tool that narrows the gap between “the code is guaranteed correct” and “the code makes the intended guarantees.”

The core technique is round-trip informalization: translate the formal specification back into plain English, without ever seeing the original input, then compare that translation against the original intent to check whether they actually match.

How it works concretely

Claimcheck tests each requirement-lemma pair in two steps:

Step 1: Informalize

Send the Dafny lemma to an LLM (Haiku, by default with low temperature) and ask it to describe in plain English what the lemma guarantees. The model does not see the original requirement; it works only from the formal code.

Step 2: Compare

Send both the original requirement and the back-translated description to a second LLM (Sonnet 4.5) and ask whether they express the same guarantee.

If the back-translation matches the requirement semantically, the claim is confirmed. If it diverges (the lemma proves something weaker, stronger, or just different) the claim is disputed.

The structural separation is the key design choice. Because the informalizer never sees the requirement, it can’t anchor on the original English and parrot it back. It has to independently derive what the formal code means.

In practice, both steps are batched into 2 API calls per file: one batch-informalize call, one batch-compare call.

What claimcheck caught

We asked Claude Code to build a verified election tally system: a Count function that tallies ballots, with 11 lemmas covering bounds, vote effects, aggregation, unanimity, and monotonicity.

Claude wrote the Dafny code and planted an intentionally incorrect lemma. The requirement said “adding a ballot can’t decrease a tally,” but the lemma just proved Count(...) >= 0, which is trivially true since counts are natural numbers. This is a weak guarantee compared to the intent. It should have compared before vs. after. This was a tautology masquerading as monotonicity. It’s true by definition that you can’t have negative votes, regardless of what you do, but it proves nothing about system behavior.

The code was internally consistent (machine-verified) and every proof passed. The result: the implementation code is mathematically guaranteed.

But, what exactly did we guarantee?

We ran claimcheck. Result: 9 guarantees confirmed, 2 disputed.

The first catch was the one we planted on purpose. Claimcheck flagged it as wrong-property.

The second catch was one we didn’t expect. The requirement said “order of ballots doesn’t affect the tally,” and the lemma proved that prepending vs. appending gives the same count. That’s one specific reordering, not any permutation. It looks right at a glance to a reviewer and Dafny doesn’t care, it verified fine. But claimcheck caught the narrowed-scope: the lemma proves something weaker than what the requirement claims.

We fixed both (rewrote OrderIrrelevant using multisets for full permutation invariance, removed the tautology), re-ran claimcheck: 10 confirmed, 0 disputed. One iteration.

What we measured

We tested across 5 domains (counter, kanban, colorwheel, canon, delegation-auth) with 36 requirement-lemma pairs, including 8 deliberately bogus lemmas: tautologies, weakened postconditions, narrowed scope. These are the kinds of subtle spec drift that are hard to spot in code review or catch with tests.

The two-pass approach got 96.3% accuracy across 108 comparisons (3 runs × 36 pairs) in about 108 seconds per run, using just 2 API calls per run. Single-prompt mode (one model does both passes sequentially) dropped to 86.1% accuracy and took 353 seconds with 36 individual calls. Piping through Claude Code’s general-purpose system prompt: 69.4%, 693 seconds.

The 96.3% figure comes from running against our own dafny-replay projects, not a formal evaluation. It’s a development benchmark, useful for comparing modes (two-pass vs. single-prompt vs. Claude Code) but not a claim about general reliability.

The accuracy gap is still worth noting. Structural separation (two different models, informalizer blind to the requirement) outperforms prompt-level separation (one model told not to peek ahead) by 10 points. And both outperform no separation at all. Batching also gives roughly a 3× speedup over individual calls.

Where this fits

We built claimcheck as a Claude Code Hackathon project, but obviously it’s very useful for our own workflow, where Claude Code writes Dafny proofs against natural-language requirements. The proofs verify, but the intent gap is still real.

claimcheck is one layer in our ongoing attempt to map natural language requirements to formal guarantees. It gives us a fast, cheap, surface-level check. It’s not a full guarantee, but a flag that goes up when something looks off.

Whether LLM-based informalization is reliable enough for higher-stakes use, or whether the structural separation trick generalizes to other formal/informal comparison tasks, is still an open question.

What this doesn’t solve

claimcheck is an LLM comparing an LLM’s translation to a natural-language requirement. It’s not a proof. It’s a probabilistic sanity check that requires human review and approval.

A few specific limitations worth naming:

The informalization step can be wrong. If the LLM misreads the Dafny, the comparison is comparing two wrong things. We haven’t yet measured how often Haiku misinterprets Dafny lemmas in isolation.
The comparison step can be wrong. Sonnet might say two descriptions match when they don’t, or flag a discrepancy that isn’t real. The 96.3% accuracy means roughly 1 in 27 comparisons is incorrect. Not a rate to bet a production system on without human review, especially when the ultimate goal is guarantee (i.e., 100% coverage).
Coverage is not checked. claimcheck verifies claims that exist in the mapping file. It doesn’t tell you whether the mapping is complete, whether there are requirements with no corresponding lemma, or code behavior with no corresponding requirement.
The approach assumes the Dafny lemma is self-contained enough for an LLM to informalize. Lemmas that depend heavily on external definitions, module imports, or complex invariant chains may not translate well out of context.

We view these not as blockers, but as challenges worth tackling as we move closer to a formally verified system of programming.

Repo: github.com/metareflection/claimcheck