I commissioned a build last week. Eleven slices, multiple verifiers, a real piece of infrastructure that ten other things now sit on top of. The agent on the other side read my one-page spec and refused to start.

The refusal was three lines.

You declared this as C-Heavy. You named no security verifier. You listed a standard that doesn't exist on disk. Resubmit.

What the closeout layer hides

I'd been shipping work for months where "this looks good to me" was the gate. Author wrote the thing, author marked it done, author moved on. Once I started watching closely I could see the failure mode: the closeout layer was outrunning the verification layer. Things got shipped that weren't shipped. Specs got accepted that hadn't been built. The agent layer mostly drifts in the direction of confident-sounding completion, because that's the local minimum.

The build I'm describing didn't drift, and I think I now understand why.

Three things did the load-bearing work

None of them on their own would have been enough.

Acceptance criteria as mechanical pass/fail. Not "this should feel coherent." Not "responsive across breakpoints." Specific things a verifier can run: this table exists, this endpoint returns 200, this visual baseline matches at this viewport, this row appears in the registry within 30 seconds of insert. Each criterion was either a probe, an observation, or a value check. Anything aspirational got bounced back at intake. The author of the spec (me) had to do the work of making the criteria testable before the executor would touch them. That work is annoying. It's also where most of the wins live, because vague criteria are how silent failure enters.

The author of the work cannot authorise their own completion. This sounds obvious. It is not how most agent setups work. By default, the model writing the code is also the model reporting whether the code works. Both jobs feel like the same job from inside the session. They are not the same job. The fix is mechanical: a different agent has to read the hand-back, run the named probes, and authorise GREEN, AMBER, or RED. The author hands back at GREY. They don't get to flip to GREEN. On the build I'm describing, one slice went to RED at attestation because the verifier ran the runtime probe and it failed where the author had claimed it passed. The work re-entered the loop. Nobody got grovellingly apologetic. The system just kept moving.

The work must pass a runtime probe, not just look right in the artefact. A migration that exists as a SQL file is not applied. A daemon defined in a launchd plist is not running. A row that says "completed" in a tracker is not state. The probe is the proof. For migrations, it's a row in the registry table. For daemons, it's launchctl print showing a live PID. For endpoints, it's a smoke request returning 200. For UI, it's the rendered DOM and a visual baseline match. Configuration in repo is not configuration that works at runtime. Anyone claiming "applied" has to reference the probe result, not the artefact.

Eleven slices. Single-pass on ten of them. Five iterations on slice five because mobile parity needed visual self-check at the breakpoint I'd forgotten to include. Six security amendments from a separate agent role got merged before main accepted the work. 520-plus visual baselines, 30-plus tests, no claim accepted without a paired evidence path. The contract closed itself the moment the last probe came back green, because there was no human in the loop trying to decide if it was "good enough." It either passed the named gates or it didn't.

What I keep coming back to

How cheap the three constraints are to specify and how hard they are to bypass. Together they remove the part of a build where someone has to feel pressure to call it done. Pressure is how bad work ships. Mechanical gates absorb that pressure into the substrate. You stop being the person who decides; you become the person who designed the test.

This shape transfers. Boss to report. Founder to contractor. Future-you to past-you. Anything where one party commissions another to produce work and verify it. Three things on the front of every commission:

  1. Acceptance criteria must be testable mechanical pass/fail. Not aspirational language. Each one is something an independent reader can run or observe.
  2. The author of the work cannot authorise it as complete. Someone other than the author signs off, against the named criteria, with evidence paths.
  3. The work must pass a runtime probe. Not "looks right in the doc." A live check that proves the work works in the world.

The contracts I write now look different than they did three months ago. They take twenty more minutes to draft. They save days I don't have to spend chasing claims that turned out to mean nothing. The first time you get refused at intake, you'll be annoyed. The annoyance fades faster than you expect. The yield doesn't.

The agent on the other side wasn't being difficult. It was doing the work the system was asking it to do, which is to refuse vague specs. That's the only way the contract closes itself.