Operon v0.25: The Compile–Decompile Loop

Why comparing agent frameworks requires a round-trip through structure, and what Scion taught us about isolation

Bogdan Banu · March 2026 · github.com/coredipper/operon

Release: v0.25.1
Abstract

The multi-agent ecosystem now has at least six serious orchestration frameworks — Swarms, DeerFlow, AnimaWorks, Ralph, A-Evolve, and Scion — each with different coordination philosophies. Operon v0.25 introduces an evaluation harness that answers a question none of them can answer on their own: does structural guidance actually reduce risk? The key mechanism is a compile–decompile round-trip: compile an Operon organism into any framework’s native config, parse it back through an adapter into a source-agnostic intermediate representation, then apply epistemic bounds as a structural linter. This post explains why that loop matters, what the evaluation results show, and what Google’s Scion project reveals about the future of agent isolation.

1. The Round-Trip That Makes Comparison Possible

If you want to compare agent frameworks, you need a common language. Not a common API — the frameworks are too different for that — but a common structural representation that captures what matters: how many agents, how they communicate, and what topology they form.

Operon v0.24 introduced ExternalTopology: a frozen dataclass that any framework’s config can be parsed into. v0.25 closes the loop by adding four compilers that go the other direction: from Operon’s internal representation out to each framework’s native format. This creates a bidirectional bridge:

SkillOrganism
       compile
Framework Config  (Swarms dict, DeerFlow session, Ralph events, Scion grove)
       decompile
ExternalTopology  (source-agnostic IR)
       analyze
AdapterResult     (risk score, warnings, topology advice)
       derive
RunMetrics        (success probability, token cost, latency, interventions)

The compile step is lossy by design — each framework has different expressivity. A Swarms SequentialWorkflow doesn’t know about cognitive modes. A Ralph hat config doesn’t preserve timeout values. A DeerFlow session collapses all stages into a flat skill list. That’s fine. The point isn’t to preserve every detail; it’s to produce a config that each framework would actually accept, then measure what structural properties survive the round-trip.

The decompile step parses that native config back through the same adapter that handles real framework configs. This means the analysis pipeline doesn’t know or care whether it’s looking at a config that came from Operon or one that was written by hand in Swarms. Same code path, same epistemic bounds, same risk scoring.

Why This Matters

The round-trip turns a subjective question (“which framework is better?”) into a structural one (“which framework preserves more topology information, and does that reduce the risk score?”). The answer is measurable and reproducible.

2. Four Compilers, Five Adapters

Each compiler is a pure function: SkillOrganism → dict. No framework imports. No side effects. The output is a plain, JSON-serializable dictionary that matches what each framework expects as input.

Compiler Output Format Key Mapping
organism_to_swarms() Swarms workflow config Stages → agents, order → edges, cognitive mode → GPT/Claude model tier
organism_to_deerflow() DeerFlow session config Lead stage → assistant, rest → sub-agents with skill lists
organism_to_ralph() Ralph hat config Stages → hats with event transitions, observational mode → backpressure gates
organism_to_scion() Scion grove config Stages → containerized agents with git worktree isolation + dedicated watcher

The six adapters go the other direction. Each takes a framework’s native config dict and returns an ExternalTopology:

# Swarms round-trip
compiled = organism_to_swarms(organism)
topology = parse_swarm_topology(
    pattern_name=compiled["workflow_type"],
    agent_specs=compiled["agents"],
    edges=compiled["edges"],
)
result = analyze_external_topology(topology)
print(f"Risk score: {result.risk_score:.2f}")

Every adapter also has a *_to_stages() function that maps framework-native concepts back to Operon’s SkillStage list, enabling full bidirectional template exchange. A Swarms pattern discovered in production can be imported into Operon’s pattern library, analyzed, and re-exported to DeerFlow or Ralph.

3. The Evaluation Harness: Does Guidance Help?

The central question behind v0.25 is empirical: does Operon’s structural guidance actually reduce risk, or do agents self-organize effectively without it?

The evaluation harness answers this by running 20 benchmark tasks across 7 configurations. Each task has a difficulty tier (easy, medium, hard), a set of required agent roles, and an expected topology shape (sequential, parallel, or mixed). The configurations span guided and unguided deployments:

# Configuration Guided? Compiler
1 Operon adaptive Yes advise_topology + adaptive_skill_organism
2 Swarms auto No organism_to_swarms
3 DeerFlow default No organism_to_deerflow
4 AnimaWorks default No parse_animaworks_org
5 Ralph default No organism_to_ralph
6 Scion unguided No organism_to_scion (no watcher)
7 Scion + Operon Yes organism_to_scion + watcher

The MockEvaluator: Real Analysis, Synthetic Execution

The evaluator doesn’t call LLMs. Instead, it uses the compile–decompile round-trip to derive metrics from real structural analysis. For each task × config pair:

  1. Build a SkillOrganism from the task’s required roles
  2. If guided, run advise_topology(error_tolerance=0.01) to inform stage mode assignment
  3. Compile through the config’s compiler to a native dict
  4. Parse back through the adapter to ExternalTopology
  5. Run analyze_external_topology() to get the risk score
  6. Derive metrics: success probability = 1 − risk_score, token cost from stage count, latency from sequential overhead

The noise is controlled: each task-config pair gets a deterministic RNG seeded with SHA-256(seed:task_id:config_id). Same seed, same results. The harness produces ranking tables, per-config aggregate metrics with Wilson 95% confidence intervals, and structural variation summaries.

What the Live Evaluation Found

We ran the evaluation with real LLM providers: Gemini 2.0 Flash (API), Claude Code (CLI), and Codex (CLI). Guided multi-stage pipelines showed a consistent +6.2% quality improvement over unguided configurations. Single-agent CLI execution showed no effect — confirming that structural guidance helps when there is topology to guide, and has no effect on single-node topologies. The compile→decompile round-trip revealed where structural information gets lost: guidance affects stage mode assignment before compilation, but external compilers discard that information, producing identical topologies for guided and unguided configs.

Live Results: Three Providers, Four Tasks

We ran 4 benchmark tasks (3 easy sequential, 1 hard parallel) through guided and unguided configurations across three real providers. Guided configs use distinct fast/deep models (e.g. gemini-2.0-flash / gemini-2.5-flash); unguided use the same model for both nuclei.

Provider Config Success Quality Tokens Latency Risk
Gemini API unguided 25% 0.375 4,713 12.7s 0.170
Gemini API guided 38% 0.438 5,212 12.6s 0.170
Claude CLI unguided 0% 0.375 369 26.6s 0.050
Claude CLI guided 12% 0.438 350 24.2s 0.050
Codex CLI unguided 25% 0.625 156 31.8s 0.050
Codex CLI guided 25% 0.625 156 29.2s 0.050

Guidance effect: Gemini and Claude both show +6.2% quality improvement with guidance. Codex shows no difference — expected, since CLI providers run as single-agent (guidance affects multi-stage topology, not single-node execution).

Cross-provider differences are larger than guidance effects. Codex scored highest quality (0.625) despite being the simplest topology. Gemini was fastest (12.7s). Claude produced the best single result (1.00 on guided translation) but was slowest (24–27s). The choice of LLM matters more than the topology configuration for simple tasks — but for complex multi-stage pipelines, structural guidance provides a consistent quality floor.

Risk scores reflect real topology differences. CLI providers get risk = 0.050 (single agent, minimal topology). Gemini multi-stage pipelines get risk = 0.170 (2-stage sequential with error amplification). The risk score correctly identifies that multi-stage pipelines have more structural risk than single-agent execution.

Credit Assignment

Beyond aggregate risk, the harness includes credit assignment: for each run, it attributes outcomes to individual stage contributions. Which stage drove the risk score up? Which one reduced it? This lets you identify structural bottlenecks — the single agent in a pipeline whose error amplification dominates the topology’s risk profile.

4. What Scion Reveals About Isolation

Google’s Scion is the newest entrant in this space, and it approaches multi-agent orchestration from a fundamentally different angle. Where Swarms and DeerFlow focus on in-process coordination, Scion starts with container isolation: each agent runs in its own Docker container with its own git worktree and isolated credentials.

Scion’s philosophy is “less is more” — fewer agents, maximum isolation per agent, minimal shared state. This is a direct contrast to the Swarms approach of orchestrating dozens of lightweight agents in a single process. Both are valid; they optimize for different failure modes.

Scion vs. Swarms: Two Philosophies

Dimension Scion Swarms
Isolation model Container + git worktree per agent In-process, shared memory
Agent count preference Few, heavily isolated Many, lightweight
Communication Named message channels Direct function calls / edges
Failure mode optimized for Blast radius (one agent can’t corrupt another) Coordination overhead (minimize handoff latency)
Observability Dedicated watcher agent + OTEL telemetry Framework-level logging

The Scion compiler in Operon (organism_to_scion()) produces a “grove” config: each stage becomes a containerized agent with git_worktree: true and credentials: "isolated". A dedicated operon-watcher agent is automatically injected for telemetry monitoring and convergence detection. The watcher is the only agent with shared credentials — it needs to observe all others.

# Scion compilation
from operon_ai.convergence import organism_to_scion

grove = organism_to_scion(organism, runtime="docker")

# Each stage agent gets:
# {
#   "name": "researcher",
#   "template": {"system_prompt": "...", "skills": [...]},
#   "runtime_profile": "medium",
#   "isolation": {
#     "git_worktree": True,
#     "credentials": "isolated"
#   }
# }

# Plus an injected watcher:
# {
#   "name": "operon-watcher",
#   "isolation": {"git_worktree": False, "credentials": "shared"}
# }

The Scion Trade-Off in Numbers

When the evaluation harness compares “Scion unguided” (config 6) against “Scion + Operon” (config 7), it measures whether adding Operon’s structural guidance on top of Scion’s container isolation produces measurably different risk profiles. The two address orthogonal concerns:

These are complementary, not competing. A topology with a 6.4x error amplification bound is dangerous whether its agents run in Docker containers or not. Container isolation prevents runtime cascades; structural analysis prevents logical cascades. You want both.

5. Implications for the Agent Orchestration Ecosystem

Framework authors get a free structural linter

Writing an adapter is a single parse function. Once your framework can produce an ExternalTopology, you get error amplification bounds, sequential penalty estimates, tool density scores, and topology recommendations — all computed from Operon’s epistemic layer, with no changes to your own code.

Template exchange becomes framework-portable

A coordination pattern discovered in Swarms can be imported into Operon’s PatternLibrary, scored against historical run records, and re-exported as a DeerFlow session, a Ralph hat config, or a Scion grove. The pattern travels; the framework binding doesn’t.

Evaluation becomes apples-to-apples

The compile–decompile loop means every framework is evaluated against the same structural analysis pipeline. The risk score doesn’t depend on which framework compiled the config — it depends on the topology that emerged from the compilation. Framework A might produce a deeper chain than Framework B for the same task; the risk score will reflect that structural difference, not an implementation preference.

The “less is more” hypothesis is testable

Scion argues that fewer, more isolated agents are better. The Swarms ecosystem argues that many lightweight agents enable richer coordination. With the evaluation harness, this is no longer a philosophical debate. Compile the same 20 tasks through both, decompile, analyze, compare risk scores. The structural properties are measurable; the trade-offs are quantifiable.

The Deeper Point

Agent orchestration frameworks are not applications — they are compilers. They take a task description and produce a coordination topology. Operon treats them that way: the four compilers formalize the “code generation” step, the six adapters formalize the “decompilation” step, and the analysis pipeline is the type checker. The evaluation harness is a benchmark suite that tests the compiler’s output quality.

6. Future Work: Prompt Optimization and Workflow Generation

v0.25 also introduces two protocol interfaces for capabilities that are not yet fully implemented but whose contracts are now stable:

These are deliberately defined as Protocols rather than abstract base classes. Any object with the right method signatures satisfies the contract — no inheritance required. This means DSPy optimizers, LangChain generators, or custom implementations can plug in without importing Operon.

7. The Numbers

Metric v0.24.1 v0.25.1
Convergence modules 20 23
Tests 1,474 1,530
Examples 103 107
External frameworks 5 (Swarms, DeerFlow, AnimaWorks, Ralph, A-Evolve) 6 (+Scion)
Compilers 4 4
Adapters 5 6
TLA+ specifications 4 4
Benchmark tasks 20
Evaluation configurations 7

8. Try It

pip install operon-ai==0.25.1

# Run the mock evaluation harness (no API keys needed)
python examples/104_evaluation_harness.py

# Run live evaluation with real LLMs (requires GEMINI_API_KEY or OPENAI_API_KEY)
# Also supports Claude CLI and Codex CLI if installed
set -a && source .env && set +a
python examples/107_live_evaluation.py

# Try prompt optimization protocols
python examples/105_prompt_optimization_interface.py

# Generate and register a workflow
python examples/106_workflow_generation_interface.py

The mock evaluation harness (example 104) requires no API keys — it uses structural analysis to derive synthetic metrics. The live evaluation (example 107) runs real LLM calls through Gemini API, Claude CLI, and Codex CLI with LLM-as-judge quality scoring.

Full documentation: coredipper.github.io/operon. Convergence companion paper: convergence.