Treating autonomous agents as untrusted participants: what the Claude Code harness suggests for on-chain mechanism design

This continues a line I have been posting here since early May about the gap between what a protocol can enforce on-chain and what actually happens off-chain, and how good mechanism design narrows that gap by making the dishonest action unprofitable rather than by trusting anyone to be honest. A recent event outside our usual subject matter gave me a clean, large external example of the same principle, and I think it is worth bringing back on topic because autonomous agents are about to become first-class participants in the systems we design here.

In late March an AI coding tool’s full source was exposed by accident, and several groups published analyses of it. The detail relevant to us is structural. The part of the system that calls the model and decides what to do is tiny. By one community estimate it is under two percent of the codebase, and I would not lean on the exact figure because it depends entirely on how you classify code, but the qualitative split is not in dispute. The overwhelming majority of the engineering is not the intelligence. It is the apparatus around the intelligence: a default-deny permission layer between the agent and any state-changing action, a context-management pipeline so the agent does not lose track of its objective, isolation so that parallel agents cannot corrupt each other, and explicit checkpoints where a privileged action is held for approval.

Read that as a mechanism design problem and it is familiar. The agent is a participant that will take the locally attractive action, including the harmful one, unless the surrounding structure removes the reward for doing so. The permission layer is a clearing rule. The isolation is the same property we want when we prevent one actor’s pending action from leaking into another’s. The held-for-approval checkpoint is a commitment device. None of the apparatus tries to make the participant more virtuous. It assumes self-interest and fallibility as given and constrains the action space until the bad outcomes stop paying.

This is the position I have been arguing for the off-chain gap. You cannot close it by asking participants to behave, because the incentive to defect is structural and intentions are not load bearing. You close it by changing the structure so the defection is not profitable. I have been referring to this stance as augmenting the invariant rather than replacing the participant, and I think the leaked harness is an unusually concrete demonstration of it at a scale most of us do not get to inspect.

The reason this belongs on ethresearch rather than in a general AI venue is the direction it is heading. Autonomous agents are already acting as searchers, solvers, and intent executors, and the share of on-chain activity initiated by non-human participants is rising. We tend to model those agents as rational and well-specified. The harness analysis is a reminder that the people running real agents in production do not trust them that way at all. They wrap them in deterministic constraint because they expect the agent to occasionally do the wrong thing with full confidence.

If that is the right operating assumption, it changes how we should specify mechanisms that agents participate in. A few questions I do not have settled answers to:

Should incentive-compatibility analysis for agent-facing mechanisms include a fallibility term, where the participant sometimes plays a non-best-response action with non-trivial probability, rather than assuming a clean rational actor? The standard equilibrium argument weakens if a meaningful fraction of participants are confidently wrong rather than strategically adversarial.

When the harness that constrains an agent lives off-chain and the mechanism it participates in lives on-chain, we have reproduced the airgap one level up. The constraint and the action are enforced by different trust domains. Is there a design where the agent’s permission envelope is itself committed on-chain, so that the constraint and the action share an enforcement domain?

And the inverse, which is the part I find most interesting. The harness pattern was discovered by people who could not improve the core component and had to build everything around it. Mechanism design has the same shape. We cannot make participants honest, so we build the structure that makes honesty the profitable move. If the two fields are solving the same problem under different names, what does on-chain mechanism design already know that agent-harness engineering is currently rediscovering by hand?

I would be interested in whether others here see the agent-as-untrusted-participant framing as a useful extension of the incentive-compatibility toolkit, or as a category error. Pushback welcome.