Harness Engineering as Categorical Architecture

Paper 5 is on arXiv. The triple (G, Know, Φ) is the formalism behind every gate, certificate, and adapter Operon has shipped — here is what it claims, what it does not, and what it unlocks for the rest of the work.

Bogdan Banu · May 13, 2026

paper-track post · arXiv:2605.12239

Five papers in, the load-bearing object across Operon — certificates, gates, compilers, adapters, the whole runtime — finally has the formal name it has been wearing implicitly. Paper 5: Harness Engineering as Categorical Architecture (arXiv:2605.12239, posted today) takes the categorical Architecture triple (G, Know, Φ) from de los Riscos et al.’s ArchAgents framework and shows that the four pillars of agent externalization — Memory, Skills, Protocols, Harness — map onto the triple exactly, with no slack and no daylight.

This post is the trailing-edge companion to the paper. It is what I would tell someone who reads the abstract, raises an eyebrow, and asks “what does this actually do that the previous four papers did not?” The answer is short: it turns the per-paper engineering claims into one structural claim with a single preservation theorem, and it locates Operon’s contribution against neighbour formalisms (typed lambda calculi, fixed-substrate frameworks, output-validation harnesses) in a way that survives review.

The triple, in one breath

G is the syntactic wiring — the directed graph of stage names and edges, the part you see when you read a LangGraph spec or a Swarms GraphWorkflow or an organism YAML. It is the part everyone agrees is “the architecture” in the colloquial sense. It is also the least load-bearing component: two graphs with identical G can behave wildly differently if their Know and Φ differ.

Know is the knowledge component — not stored facts, but self-verifiable structural guarantees. Integrity gates, quality-based escalation, supported convergence checks. Every certificate Operon emits (behavioral_stability_windowed, langgraph_state_integrity, dna_repair, dspy_compile_pinned_inputs, agentflow_evolve_pinned_inputs) is a Know-component object. The closure property the paper proves is that Know is preserved by Operon’s compiler functors: rewriting the graph does not break the certificate predicate.

Φ is the interface mapping — the profunctor that says how stages of a given mode (fast / fuzzy / deep) bind to model tiers. In current Operon, Φ is hand-coded in the compiler; in a future Operon it could be learned, but the contract is that Φ’s preservation is what makes mode swaps safe.

The four-pillar mapping, stated bluntly: Memory is coalgebraic state living over Φ. Skills are operad-composed objects living in Know. Protocols are exactly G. The full Harness is the Architecture itself — the whole triple, not any one component. The naming convention I have been using on Twitter and in talks for two years (“harness engineering is its own discipline”) was waiting for this sentence.

Why property preservation is the discriminating guarantee

The neighbour formalisms make adjacent claims. Liu’s typed lambda calculus λ_A gives type safety and termination, with empirical work showing 94.1% of GitHub agent configurations are structurally incomplete under that formalization. That is a strong result on a different axis. Liu binds at the language layer; Paper 5 binds at the compilation layer. Type safety says “this expression’s shape will not crash.” Property preservation says “this certificate predicate’s truth value survives every compiler functor in Comp(Operon → X).” The two are stackable, not competitive — an obvious follow-up paper is to land both formalisms in one project.

Where Operon’s formalism distinguishes itself is on the question of what survives the harness rewrite. Operon’s preservation is structural replay: the compiler checks identity-of-Know and verifier-replay-of-Know, not output-layer correctness, not model behaviour, not human-readability. This is a smaller claim than the field default (“the agent will produce a correct answer”) and a larger claim than the typed-lambda default (“the program is well-formed”). It is the right scope for the harness-engineering problem because the harness is the only layer where you can structurally guarantee anything in a system whose terminal layer is a non-deterministic model.

The reference implementation: five compiler functors, three certificates, one preservation theorem

The paper validates the correspondence with a reference implementation in the operon repo. Five external targets: Swarms, DeerFlow, Ralph, Scion, and LangGraph. Each gets a compiler functor — parse_X_topology() on the way in, organism_to_X() on the way out, plus compile_guarded_graph for the LangGraph-native execution path. Three named certificate types are validated for preservation by identity and replay across all five targets: behavioral_stability_windowed, langgraph_state_integrity, dna_repair.

The LangGraph compiler in particular is worth flagging because it pays back the formal investment with concrete observability. It creates one LangGraph node per organism stage using the same per-stage method the native runtime uses — not a re-implementation, but a shared executor. That gives you LangGraph-native tracing, LangSmith integration, and langgraph-cli compatibility for free, without forking the harness logic. The categorical theorem is what makes this safe: the LangGraph rewrite is a compiler functor in Comp, so the Know-component is preserved, so the certificates emitted by the native runtime remain valid in the LangGraph wrapping.

One escalation experiment with real LLM agents (two models, one task) confirms that the quality-based escalation control path is model-parametric — the structural guarantee survives the model swap, which is exactly the substrate-independence claim under empirical pressure. The experiment is bounded; the paper is honest about that. It is not the result that earns its keep, the framework is.

Connected to: pinecones and SLAM

Two earlier posts on this blog set up the cross-domain context the paper now formally consumes. Pinecones and the Portable Certificate walks through the Marom, Tibbits, Zardini, Buehler preprint from materials engineering — the same compositional-verification framework arrived at by a fundamentally different problem (biological mechanics → 4D-printed bilayer composites), with a fundamentally different validation culture (substrate-validated rather than property-validated). That is the cross-domain hit that should reduce the prior on “this is just an LLM-internal trick.”

SLAM Already Solved Stagnation grounds the wedge mechanism: the structural pattern Operon’s pre-/post-guard implements is a discrete-state port of factor-graph fixed-lag smoothing from robotics SLAM (Kaess et al., recently re-framed by Dellaert as STAG). That post argued the gates implementation has the right citation lineage. This paper argues the gates implementation has the right formalism. SLAM gives the mechanism; Marom-Buehler give the cross-domain corroboration; Paper 5 gives the load-bearing theorem.

What this is not

Marking the bound clearly because the abstraction is attractive enough to invite overreach.

This is not a claim about output equivalence. Operon preserves recorded structural guarantees across compiler functors. It does not claim that two harnesses with the same Know will produce the same model output. They will not. The certificate predicate is the only thing that survives the rewrite; everything else is downstream of model non-determinism.
This is not a typed lambda calculus. Liu’s λ_A binds at the language layer with type-safety and termination as the guarantees. Paper 5 binds at the compilation layer with property-preservation as the guarantee. The two are complementary; stacking them is a clean follow-up paper, not a contradiction in this one.
This is not a replacement for output-validation harnesses. Guardrails AI and the validators-hub family operate downstream of the model output. Operon operates structurally over the harness. They are sibling layers. The paper’s §5 maps the relationship; this post is not the place to relitigate it.
This is not a claim that the empirical escalation experiment generalises. Two models, one task, one escalation control path. The result is “model-parametric in this experiment.” The framework’s value is independent of how that one experiment grows; the experiment is illustrative, not load-bearing.
This is not a wedge change. operon-langgraph-gates v0.1.0 ships StagnationGate and IntegrityGate; that scope is fixed. This paper sharpens the explanation of why those two gates are the right v0.1 surface (they are the cheapest non-trivial Know-component objects under the categorical lens) but does not move the v0.2 line.

What it does unlock

The formal payoff is that the next round of cross-framework adapters — the agentflow L1+L2 pair shipped in v0.39 and the gascity adapter in v0.38 — can be read as compiler-functor instances of preservation, not as bespoke engineering. The L2 hooks (Certificate.from_dspy_compile, Certificate.from_agentflow_compile) become provenance-binding morphisms in a specific subcategory of Comp; their cheap variants (recording-integrity) and their deferred heavy variants (re-execution-check) become two ends of the same morphism family. Each new framework integration is now “another arrow in this diagram,” not “another bespoke contract to write down.”

The methodological payoff is that the (G, Know, Φ) language gives an honest way to compare frameworks without picking a winner. A framework that ships only G (LangGraph circa 2024) has a real and valuable surface but cannot claim preservation. A framework that ships (G, Know) without Φ (early Operon) can claim preservation under fixed-mode execution but not under mode-swap. A framework that ships the full triple opens up the substrate-independence move. This is descriptive, not prescriptive: most production agent stacks do not need the full triple, and saying so is part of the paper’s honesty.

The strategic payoff is that the gates wedge (operon-langgraph-gates v0.1.0) now has a paper-grade citation it can point at when asked “why those two gates and not a different two?” The answer is short: integrity and stagnation are the two cheapest Know-component objects with non-trivial preservation properties under the LangGraph compiler functor. That sentence is now in arXiv, which is worth quite a lot for the next twelve months of conversations with LangGraph users, framework authors, and prospective collaborators.

The triple, in one breath

Why property preservation is the discriminating guarantee

The reference implementation: five compiler functors, three certificates, one preservation theorem

Connected to: pinecones and SLAM

What this is not

What it does unlock

Links