With last week’s release of dafny-replay, a methodology that uses formal verification to prove state transitions in web apps, we presented clear benefits and real tradeoffs. We stress-tested it on single-featured example apps designed to flesh out dafny-replay. I wanted to try to build a very simple, but fully-functional application.
I’m a mere React developer. I know more about Dafny and formal verification now than I did six months ago, but I’m not an expert. So this write-up reflects an experience closer to someone who would write a colorwheel app, than someone who would write a formal verification methodology.
This is a build log, not a tutorial. It will take you through my journey of building an app using a completely unfamiliar workflow. The main portions will be the uninterrupted technical narrative, and you will find hidden “Dear Diary” moments that you can dig into if you’d like to read my reflections and insights.
What I Built
ColorWheel is a 5-color palette generator. You pick a mood and a harmony type, and it generates a palette that satisfies both constraints. You can adjust individual colors, you can make minor shifts to the palette, re-generate the palette, undo/redo (this is partially broken, more about it later), and preview contrast ratios.
There’s nothing exotic feature-wise — although this is an app I have personally wanted for a while. The interesting part is under the hood: the state management is written in Dafny, mathematically verified, and compiled to JavaScript. Every action — generate, adjust, undo, redo, etc — goes through a verified kernel that guarantees the state can never be corrupted.
I won’t walk through features. You can try it or read the code. The rest of this post is about what it was like to build it.
Designing the Spec
I fired up Claude Code and got to work. I started with a purposefully vague prompt:
I want to create a new app using this methodology - see @README.md, @HOWTO.md and @Replay.dfy (as example). For now, all I need is a spec. We need to go back and forth on the spec to ensure that the functionality of the colorwheel web app is how I want it.
What followed was 10 questions by Claude Code and a back and forth before any code was written. Each question led to more questions.
Some of the questions:
- “How do you define ‘soft’?” → Led to the mood bounds table (Vibrant: S 70-100, L 50-70; Soft/Muted: S 20-50, L 60-80; etc.)
- “Should palette size be fixed or variable?” → Fixed at 5, which required deciding on a strategy for the remaining colors if harmony rules produce fewer colors
- “What happens when mood and harmony constraints conflict?” → Best effort generation, then show actual values (the palette might not satisfy both perfectly)
Dear Diary
Going into this build, I didn’t have a very clear idea in my head of how these elements (mood, harmony, etc) would be represented or how their relationships would be handled. In my head was a visualization of a color generator app that I want and can never find exactly as I want it. Claude’s questions helped me refine my vision and begin to see not just the end product but understand each building block required to get there.
After these 10 questions were clarified, Claude wrote an initial Dafny spec. The spec itself revealed holes in our app’s functionality. There were many questions left to be answered (hue distribution, adjustment clamping, linked vs independent mode behaviors, initial state, color representation).
We continued on. Some of these were simple answers, some (like the hue distribution question) required some research on my part so that I could provide a reasoned response based on what I was actually envisioning.
Dear Diary
Quick note to say that I learned more about color theory in 15 minutes during this session than I ever had before, and that was because the bar here was high. I was giving information that would turn into a spec that the implementation will always satisfy. If I mess up at this level, things become harder down the line. To me, this responsibility felt good. In the past I may have delegated the choice to the LLM since I didn’t know much about it but as long as it looked good and worked I’d be happy with it. Here, I felt like I was flying the plane. I didn’t only care that it landed, I was responsible for it landing well.
Claude took my answers and took another stab at a Dafny spec. Again, more open questions. It’s so important to note here that the questions Claude was asking were not because it had some high level representation of what the colorwheel app should do. Rather, it had formal specs to review and from those formal specs derived gaps that we still needed to fill. So, even before any verification was done, the process of writing the spec already guided the LLM to build a better quality product.
Dear Diary
Maybe, by now, you think I would be feeling impatient and want to see something on a screen already, click around, etc. But this wasn’t the case. I found this process of thinking through the details delightful.
After another round of final open questions, the ColorWheelSpec.dfy was complete. I asked it to create a DESIGN.md file to catalog all of the decisions we had made so far (this was essentially a natural language translation of the spec). This allowed me to 1) review my decisions one more time; and 2) verify that the spec reflected my intentions, since the DESIGN.md was translated directly from it.
Verifying the Spec
Once the first version of the spec was complete, the spec itself needed to verify. I asked Claude to run dafny verify ColorWheelSpec.dfy and fix any errors.
The first run hit 19 errors — all the same issue: both Mood and Harmony have a Custom variant, and Dafny couldn’t tell which one I meant. The compiler refused to proceed until I was explicit: Mood.Custom vs Harmony.Custom. After fixing those, the second run revealed deeper issues: missing preconditions (calling functions that require 5 colors without proving the model has 5 colors) and potential index-out-of-bounds errors in Normalize. A few rounds of fixes later: 14 verified, 0 errors.
TypeScript with strict settings would catch most of these at compile time too. The difference is that preconditions become explicit declarations in the spec — requires |m.colors| == 5 — rather than implicit assumptions.
(The real Dafny wins come later when proving invariants hold across all possible state transitions. For now, at the spec level, it’s mostly about forcing explicitness.)
First Integration
With the spec complete, I deviated from the process we initially proposed. Instead of immediately running the proofs from here, I decided to create the React app, hook it, and run it. I did this because, while the spec looked good on paper, I wanted to make sure that as a user, the functionalities made sense and if anything in the spec sounds good theoretically but in practice doesn’t work or falls flat, I can catch that before running lengthy proofs.
This turned out to be a great decision.
Claude created a domain file ColorWheelDomain.dfy that wires into our replay kernel, then created the React app with Vite and all the relevant UI components.
I ran the app in the browser. Looked like an app. I clicked Generate to generate a new palette, nothing happened.
None of the buttons did anything. No crash, no errors.
We assumed the failure was at the integration level, so we added strategic logging to pinpoint the issue:
// What does the Dafny model actually look like in JS?
// - Are fields named what we expect, or do they have prefixes like `dtor_`?
// - Are values plain numbers or BigNumber objects?
// - Is `colors` an array or a Dafny Seq?
console.log('Raw model from Dafny:', model);
// Is the action being constructed correctly?
// - Are we using the right factory method (create_Vibrant vs Mood_Vibrant)?
// - Are parameters wrapped in BigNumber where needed?
// - Does the action object have the shape Dafny expects?
console.log('Dispatching action:', action);
And there it was — 2 issues:
TypeError: seeds[_0_i].isLessThanOrEqualTo is not a function- Dafny
seq<int>is not a JavaScript array
Both were solved quickly by Claude and when the server restarted, the app worked.
Spec-UI Mismatch
With the app running, I started clicking around and something felt off. The “Generate” button barely changed the colors. But the “Randomize Base Hue” button gave me completely new palettes. That was backwards from what I expected/intended. The code was actually doing exactly what it said: “Generate” anchors to the current base hue and regenerates saturation/lightness values; “Randomize Base Hue” picks a new hue entirely, which cascades into a completely different palette. The code was correct, but the labels were wrong. This let me see that the fix was cosmetic, not logical:
- “Generate” → renamed to “Shift” (same base hue, new S/L values, small shift in all colors)
- “Randomize Base Hue” → renamed to “Re-generate” (new base hue = brand new palette)
This took about 10 minutes to sort out. Here’s the significance of this seemingly trivial step: I ensured the end experience matched the spec before investing into writing proofs. If I’d gone straight from spec to proofs, I would have spent time proving behavior that didn’t match my actual intent. By testing the UI first, I caught a spec-UX gap early and adjusted before investing in proofs. In this particular case, the logic was sound, and the cosmetic button/label change would have not necessitated a new proof, but there are other likely scenarios where finding a mismatch between the UI behavior and the initial intent would have an impact on the proof and require proving the whole logic again.
Separating the Proofs
With the app running and the UI aligned with my intent, it was time to write the actual proofs. But first, an architectural question: where should the proofs live?
The previous dafny-replay apps placed the proofs directly in the domain files. Unfortunately, proofs are long, messy and complicated, whereas the domain files are straightforward and readable. I wanted to keep the proof separate. When initially writing the ColorWheelDomain.dfy I specifically prompted Claude to not write any proofs yet. Instead, it added an {:axiom} stub as a placeholder.
Now that it was time for proofs, we flipped the dependency:
Before:
ColorWheelDomain.dfy (has {:axiom} — ugly)
After:
ColorWheelSpec.dfy
↓
ColorWheelProof.dfy (proofs live here)
↓
ColorWheelDomain.dfy (clean — just delegates to Proof)
All the proof complexity lives in ColorWheelProof.dfy. The file that gets compiled stays clean.
The Actual Proof
Now for the real work: proving that every state transition preserves the invariant. We wanted to encourage thoroughness, so we prompted Claude to: feel free to use assume {:axiom} false to do the proofs in steps. use dafny verify to check your work.
The strategy was to break Inv into components and prove each separately:
NormalizeEnsuresValidBaseHue— hue in [0, 360)NormalizeEnsuresColorCount— exactly 5 colorsNormalizeEnsuresValidColors— all colors have valid h/s/lNormalizeEnsuresValidContrastPair— indices in [0, 5)NormalizeEnsuresMoodConstraint— mood satisfactionNormalizeEnsuresHarmonyConstraint— hue pattern matching
Five of them verified immediately. One didn’t: NormalizeEnsuresMoodConstraint.
I asked Claude to keep working on it. Many attempts failed:
- Added
{:fuel Normalize, 2}— didn’t help - Used
calcstatements — didn’t help - Asserted field-by-field equality — all passed except mood
Eventually Claude identified the root cause: quantifiers inside functions create barriers.
// Inside Normalize:
var finalMood := if forall i :: ColorSatisfiesMood(normalizedColors[i], m.mood) then ...
// In our proof:
var check := forall i :: ColorSatisfiesMood(nc[i], m.mood);
Both forall expressions range over equal sequences, but Dafny’s SMT solver treats them as different because they appear in different scopes.
The fix was trivial. It was a change to the spec, but a well-reasoned one that didn’t inherently weaken it: extract the forall into a named predicate in the spec:
predicate AllColorsSatisfyMood(colors: seq<Color>, mood: Mood) {
forall i | 0 <= i < 5 :: ColorSatisfiesMood(colors[i], mood)
}
Now both Normalize and the proof reference the same predicate. Dafny connects them. Proof verifies. Zero axioms.
All proofs took about 20 minutes.
Dear Diary
This is where I had to trust Claude. I don’t fully understand why quantifiers inside functions create barriers for the SMT solver. But Claude explained the issue, proposed a fix, and the proof went through. What I did understand was that the fix was a semantic (not logical) change to the spec: adding a named predicate. I reviewed the change, it made sense, and I accepted it. This felt like the right division of labor: Claude handles the proof machinery, I verify the spec still says what I want.
Building Out the UI
With proofs complete, I turned back to the app. Since I cannot read proofs, I asked Claude to create a document showing me what has been proved and what has not. It looked like this.
Satisfied, I moved on to the UI implementation. I noticed that in the doc above, it mentioned ValidContrastPair, something I had spec-ed out. However, this was not in the UI. So I asked Claude to examine the UI against the spec and the implementation and find a discrepancy between what is logically implemented and what is visible in the UI. It found that 4 actions had been fully spec-ed, proved and implemented, but were not exposed in the UI.
| Feature | Spec | Proof | UI |
|---|---|---|---|
| AdjustColor | ✓ | ✓ | ✗ |
| SetAdjustmentMode | ✓ | ✓ | ✗ |
| SelectContrastPair | ✓ | ✓ | ✗ |
| SetColorDirect | ✓ | ✓ | ✗ |
Dear Diary
At this point it was a relief to see that the plumbing for these actions was intact, and the only missing thing was a hook into the UI (that’s by far the easiest part!). In the future, when first setting up the React app, I’ll be more specific in my prompt. My prompt to Claude at the UI set up stage was essentially, “Create a React app.” I should maintain better control over this step and treat it much like a separate, albeit not formal, spec that corresponds to the formal one.
Hooking in these actions was incredibly easy. I told Claude to do it, and it did it. Because all the logic was verified and completely decoupled from the UI code, my confidence levels were running high.
The Modularization Question
Unfortunately, the app looked terrible. It was just a mess of ugly, stacked, inconsistently-sized components. With the peace of mind that my logic was verified and if anything broke while restructuring components, it would not be logic-related, I confidently dove into a full redesign. After a pretty detailed design prompt, Claude came back with an even worse version. This surprised me because when I create apps using Claude directly in React, they actually look quite good (assuming the design prompt is good).
Then, I realized that the entire application was inside App.jsx. Claude was attempting to fix a messy file, making an even further mess out of it instead.
Dear Diary
This tastes of sweet validation for our position paper stating that modularity matters when using LLMs to generate code.
Naturally, I wanted to refactor the code into discrete components, but I worried: would modularizing break the verified logic?
The answer: no. The Dafny state is pure functional — model in, action out. It doesn’t care how many React components exist.
The pattern: share model and dispatch via React context, then pass slices down as props. Each component receives the data it needs and dispatches actions back up. Local UI state (slider positions, which panel is open) stays local to the component. The verified state flows through context; everything else is just React.
After splitting into separate components for each section (generate, adjust, palette, contrast, etc.), the app became much cleaner, both architecturally and visually.
More Integration Friction
Building out the UI surfaced more integration issues. Same pattern as before: BigNumber conversions, action signature mismatches.
One example: I added an AdjustPalette action to the spec (adjusts all colors by the same delta — always linked, by definition). The Dafny change was trivial, the proof passed immediately. But hooking it into React broke everything again.
// Component was calling:
dispatch(App.GeneratePalette(model.baseHue, mood, harmony));
// But Dafny requires randomSeeds:
dispatch(App.GeneratePalette(model.baseHue, mood, harmony, randomSeeds()));
Dafny is deterministic — randomness must be injected from JavaScript. The wrapper can hide BigNumber conversion, but it can’t hide missing parameters.
These bugs were annoying but never subtle. They threw errors immediately. Fix, refresh, continue.
Dear Diary
Here’s the thing about these integration bugs: they’re loud. The app either works or it throws an error in your face. There’s no “works in dev, breaks in prod after 1000 users” situation. Once the integration layer is fixed and the app runs, I can trust the verified logic underneath. The bridge is finite and visible. The bugs I’m not shipping — state corruption, invariant violations — those are the invisible ones. And those are gone by construction.
The Ghost Invariant
With the UI complete (and now delightfully providing me with the exact functionality I have been looking for for months), one thing caught my attention:
function Apply(m: Model, a: Action): Model {
if Inv(m) then CWSpec.Apply(m, a) else m // Runtime check!
}
The invariant was being checked on every action at runtime. But wait — we just spent 20 minutes proving that the invariant is always preserved. If Init() satisfies Inv, and every Step preserves Inv, then by induction all reachable states satisfy Inv. The runtime check always passes. This is unnecessary and contradicts the core promise: prove once, trust forever. We were paying for the proof and checking at runtime.
The fix was two lines:
- Make
Inva ghost predicate (erased at compile time — it only exists for verification):
ghost predicate Inv(m: Model) { ... }
- Remove the redundant check:
function Apply(m: Model, a: Action): Model {
CWSpec.Apply(m, a)
}
Ran dafny verify — all 26 proof obligations still passed. The proofs didn’t care about this extra runtime check because the Apply method already handles.
Note that there are cases where we might want to have extra runtime checks. One example: if there is low trust in the initial state of the client, a runtime check will ensure all invariants hold against the client’s initial state. So runtime checks are not to be universally avoided, but they should only be applied when the circumstances require.
What I Shipped (And What I Didn’t)
I shipped a minimalistic color palette generator app that works. The verified state management does exactly what the spec says.
What I didn’t ship:
- State corruption bugs — impossible by construction (caveat: only the ones included in the spec)
- Edge cases I forgot to test — the proofs cover all inputs, not just the ones I imagined
- Silent failures — if something breaks, it breaks loudly at the integration layer
What I did ship:
- A broken undo/redo — every slider tick creates a history entry, making undo tedious. This isn’t a bug in the verified logic, this is exactly how the logic says it should work; it’s a gap in what I specified. I never defined “what constitutes a meaningful undo step.” That’s a design decision, not a proof. This is a WIP.
The methodology held up. The friction was real — and largely due to the immaturity of the process — but it feels surmountable. And the bugs that remain are the visible kind — UI misalignments, missing features, design gaps. Not the invisible kind that surface after user #4,382 hits a state you never imagined or tested.
Try It Yourself
If you want to build something with verified state:
- Start with the spec. Spend time here. It’s worth it.
- Hook into UI before proofs — catch intent mismatches early.
- Expect integration friction — document what you learn, and submit an issue or PR.
And, do get in touch with questions or suggestions!
This is early. The tooling will get better. But the core idea — specify upfront, prove exhaustively, ship without fear — that part works today.