Operon v0.25: The Compile–Decompile Loop
Why comparing agent frameworks requires a round-trip through structure, and what Scion taught us about isolation
Release: v0.25.1The multi-agent ecosystem now has at least six serious orchestration frameworks — Swarms, DeerFlow, AnimaWorks, Ralph, A-Evolve, and Scion — each with different coordination philosophies. Operon v0.25 introduces an evaluation harness that answers a question none of them can answer on their own: does structural guidance actually reduce risk? The key mechanism is a compile–decompile round-trip: compile an Operon organism into any framework’s native config, parse it back through an adapter into a source-agnostic intermediate representation, then apply epistemic bounds as a structural linter. This post explains why that loop matters, what the evaluation results show, and what Google’s Scion project reveals about the future of agent isolation.
1. The Round-Trip That Makes Comparison Possible
If you want to compare agent frameworks, you need a common language. Not a common API — the frameworks are too different for that — but a common structural representation that captures what matters: how many agents, how they communicate, and what topology they form.
Operon v0.24 introduced ExternalTopology: a frozen dataclass
that any framework’s config can be parsed into. v0.25 closes the loop
by adding four compilers that go the other direction: from
Operon’s internal representation out to each framework’s
native format. This creates a bidirectional bridge:
SkillOrganism
↓ compile
Framework Config (Swarms dict, DeerFlow session, Ralph events, Scion grove)
↓ decompile
ExternalTopology (source-agnostic IR)
↓ analyze
AdapterResult (risk score, warnings, topology advice)
↓ derive
RunMetrics (success probability, token cost, latency, interventions)
The compile step is lossy by design — each framework has different
expressivity. A Swarms SequentialWorkflow doesn’t
know about cognitive modes. A Ralph hat config doesn’t preserve
timeout values. A DeerFlow session collapses all stages into a
flat skill list. That’s fine. The point isn’t to preserve
every detail; it’s to produce a config that each framework
would actually accept, then measure what structural properties
survive the round-trip.
The decompile step parses that native config back through the same adapter that handles real framework configs. This means the analysis pipeline doesn’t know or care whether it’s looking at a config that came from Operon or one that was written by hand in Swarms. Same code path, same epistemic bounds, same risk scoring.
Why This Matters
The round-trip turns a subjective question (“which framework is better?”) into a structural one (“which framework preserves more topology information, and does that reduce the risk score?”). The answer is measurable and reproducible.
2. Four Compilers, Five Adapters
Each compiler is a pure function: SkillOrganism → dict.
No framework imports. No side effects. The output is a plain, JSON-serializable
dictionary that matches what each framework expects as input.
| Compiler | Output Format | Key Mapping |
|---|---|---|
organism_to_swarms() |
Swarms workflow config | Stages → agents, order → edges, cognitive mode → GPT/Claude model tier |
organism_to_deerflow() |
DeerFlow session config | Lead stage → assistant, rest → sub-agents with skill lists |
organism_to_ralph() |
Ralph hat config | Stages → hats with event transitions, observational mode → backpressure gates |
organism_to_scion() |
Scion grove config | Stages → containerized agents with git worktree isolation + dedicated watcher |
The six adapters go the other direction. Each takes a framework’s
native config dict and returns an ExternalTopology:
# Swarms round-trip
compiled = organism_to_swarms(organism)
topology = parse_swarm_topology(
pattern_name=compiled["workflow_type"],
agent_specs=compiled["agents"],
edges=compiled["edges"],
)
result = analyze_external_topology(topology)
print(f"Risk score: {result.risk_score:.2f}")
Every adapter also has a *_to_stages() function that maps
framework-native concepts back to Operon’s SkillStage
list, enabling full bidirectional template exchange. A Swarms pattern
discovered in production can be imported into Operon’s pattern
library, analyzed, and re-exported to DeerFlow or Ralph.
3. The Evaluation Harness: Does Guidance Help?
The central question behind v0.25 is empirical: does Operon’s structural guidance actually reduce risk, or do agents self-organize effectively without it?
The evaluation harness answers this by running 20 benchmark tasks across 7 configurations. Each task has a difficulty tier (easy, medium, hard), a set of required agent roles, and an expected topology shape (sequential, parallel, or mixed). The configurations span guided and unguided deployments:
| # | Configuration | Guided? | Compiler |
|---|---|---|---|
| 1 | Operon adaptive | Yes | advise_topology + adaptive_skill_organism |
| 2 | Swarms auto | No | organism_to_swarms |
| 3 | DeerFlow default | No | organism_to_deerflow |
| 4 | AnimaWorks default | No | parse_animaworks_org |
| 5 | Ralph default | No | organism_to_ralph |
| 6 | Scion unguided | No | organism_to_scion (no watcher) |
| 7 | Scion + Operon | Yes | organism_to_scion + watcher |
The MockEvaluator: Real Analysis, Synthetic Execution
The evaluator doesn’t call LLMs. Instead, it uses the compile–decompile round-trip to derive metrics from real structural analysis. For each task × config pair:
- Build a
SkillOrganismfrom the task’s required roles - If guided, run
advise_topology(error_tolerance=0.01)to inform stage mode assignment - Compile through the config’s compiler to a native dict
- Parse back through the adapter to
ExternalTopology - Run
analyze_external_topology()to get the risk score - Derive metrics: success probability =
1 − risk_score, token cost from stage count, latency from sequential overhead
The noise is controlled: each task-config pair gets a deterministic
RNG seeded with SHA-256(seed:task_id:config_id). Same seed,
same results. The harness produces ranking tables, per-config aggregate
metrics with Wilson 95% confidence intervals, and structural variation
summaries.
What the Live Evaluation Found
We ran the evaluation with real LLM providers: Gemini 2.0 Flash (API), Claude Code (CLI), and Codex (CLI). Guided multi-stage pipelines showed a consistent +6.2% quality improvement over unguided configurations. Single-agent CLI execution showed no effect — confirming that structural guidance helps when there is topology to guide, and has no effect on single-node topologies. The compile→decompile round-trip revealed where structural information gets lost: guidance affects stage mode assignment before compilation, but external compilers discard that information, producing identical topologies for guided and unguided configs.
Live Results: Three Providers, Four Tasks
We ran 4 benchmark tasks (3 easy sequential, 1 hard parallel) through guided and unguided configurations across three real providers. Guided configs use distinct fast/deep models (e.g. gemini-2.0-flash / gemini-2.5-flash); unguided use the same model for both nuclei.
| Provider | Config | Success | Quality | Tokens | Latency | Risk |
|---|---|---|---|---|---|---|
| Gemini API | unguided | 25% | 0.375 | 4,713 | 12.7s | 0.170 |
| Gemini API | guided | 38% | 0.438 | 5,212 | 12.6s | 0.170 |
| Claude CLI | unguided | 0% | 0.375 | 369 | 26.6s | 0.050 |
| Claude CLI | guided | 12% | 0.438 | 350 | 24.2s | 0.050 |
| Codex CLI | unguided | 25% | 0.625 | 156 | 31.8s | 0.050 |
| Codex CLI | guided | 25% | 0.625 | 156 | 29.2s | 0.050 |
Guidance effect: Gemini and Claude both show +6.2% quality improvement with guidance. Codex shows no difference — expected, since CLI providers run as single-agent (guidance affects multi-stage topology, not single-node execution).
Cross-provider differences are larger than guidance effects. Codex scored highest quality (0.625) despite being the simplest topology. Gemini was fastest (12.7s). Claude produced the best single result (1.00 on guided translation) but was slowest (24–27s). The choice of LLM matters more than the topology configuration for simple tasks — but for complex multi-stage pipelines, structural guidance provides a consistent quality floor.
Risk scores reflect real topology differences. CLI providers get risk = 0.050 (single agent, minimal topology). Gemini multi-stage pipelines get risk = 0.170 (2-stage sequential with error amplification). The risk score correctly identifies that multi-stage pipelines have more structural risk than single-agent execution.
Credit Assignment
Beyond aggregate risk, the harness includes credit assignment: for each run, it attributes outcomes to individual stage contributions. Which stage drove the risk score up? Which one reduced it? This lets you identify structural bottlenecks — the single agent in a pipeline whose error amplification dominates the topology’s risk profile.
4. What Scion Reveals About Isolation
Google’s Scion is the newest entrant in this space, and it approaches multi-agent orchestration from a fundamentally different angle. Where Swarms and DeerFlow focus on in-process coordination, Scion starts with container isolation: each agent runs in its own Docker container with its own git worktree and isolated credentials.
Scion’s philosophy is “less is more” — fewer agents, maximum isolation per agent, minimal shared state. This is a direct contrast to the Swarms approach of orchestrating dozens of lightweight agents in a single process. Both are valid; they optimize for different failure modes.
Scion vs. Swarms: Two Philosophies
| Dimension | Scion | Swarms |
|---|---|---|
| Isolation model | Container + git worktree per agent | In-process, shared memory |
| Agent count preference | Few, heavily isolated | Many, lightweight |
| Communication | Named message channels | Direct function calls / edges |
| Failure mode optimized for | Blast radius (one agent can’t corrupt another) | Coordination overhead (minimize handoff latency) |
| Observability | Dedicated watcher agent + OTEL telemetry | Framework-level logging |
The Scion compiler in Operon (organism_to_scion()) produces
a “grove” config: each stage becomes a containerized agent
with git_worktree: true and credentials: "isolated".
A dedicated operon-watcher agent is automatically injected
for telemetry monitoring and convergence detection. The watcher is the
only agent with shared credentials — it needs to observe all others.
# Scion compilation
from operon_ai.convergence import organism_to_scion
grove = organism_to_scion(organism, runtime="docker")
# Each stage agent gets:
# {
# "name": "researcher",
# "template": {"system_prompt": "...", "skills": [...]},
# "runtime_profile": "medium",
# "isolation": {
# "git_worktree": True,
# "credentials": "isolated"
# }
# }
# Plus an injected watcher:
# {
# "name": "operon-watcher",
# "isolation": {"git_worktree": False, "credentials": "shared"}
# }
The Scion Trade-Off in Numbers
When the evaluation harness compares “Scion unguided” (config 6) against “Scion + Operon” (config 7), it measures whether adding Operon’s structural guidance on top of Scion’s container isolation produces measurably different risk profiles. The two address orthogonal concerns:
- Scion alone provides runtime isolation — one agent’s crash or credential leak doesn’t propagate
- Operon on top provides structural analysis — error amplification, sequential penalty, and topology mismatch warnings detected before deployment
These are complementary, not competing. A topology with a 6.4x error amplification bound is dangerous whether its agents run in Docker containers or not. Container isolation prevents runtime cascades; structural analysis prevents logical cascades. You want both.
5. Implications for the Agent Orchestration Ecosystem
Framework authors get a free structural linter
Writing an adapter is a single parse function. Once your framework can
produce an ExternalTopology, you get error amplification
bounds, sequential penalty estimates, tool density scores, and topology
recommendations — all computed from Operon’s epistemic layer,
with no changes to your own code.
Template exchange becomes framework-portable
A coordination pattern discovered in Swarms can be imported into Operon’s
PatternLibrary, scored against historical run records, and
re-exported as a DeerFlow session, a Ralph hat config, or a Scion grove.
The pattern travels; the framework binding doesn’t.
Evaluation becomes apples-to-apples
The compile–decompile loop means every framework is evaluated against the same structural analysis pipeline. The risk score doesn’t depend on which framework compiled the config — it depends on the topology that emerged from the compilation. Framework A might produce a deeper chain than Framework B for the same task; the risk score will reflect that structural difference, not an implementation preference.
The “less is more” hypothesis is testable
Scion argues that fewer, more isolated agents are better. The Swarms ecosystem argues that many lightweight agents enable richer coordination. With the evaluation harness, this is no longer a philosophical debate. Compile the same 20 tasks through both, decompile, analyze, compare risk scores. The structural properties are measurable; the trade-offs are quantifiable.
The Deeper Point
Agent orchestration frameworks are not applications — they are compilers. They take a task description and produce a coordination topology. Operon treats them that way: the four compilers formalize the “code generation” step, the six adapters formalize the “decompilation” step, and the analysis pipeline is the type checker. The evaluation harness is a benchmark suite that tests the compiler’s output quality.
6. Future Work: Prompt Optimization and Workflow Generation
v0.25 also introduces two protocol interfaces for capabilities that are not yet fully implemented but whose contracts are now stable:
-
PromptOptimizer— a@runtime_checkable Protocolfor prompt-level tuning. The reference implementation (NoOpOptimizer) is a pass-through;EvolutionaryOptimizerdefines the extended protocol for mutation-based optimization with fitness gating.attach_optimizer()wires any implementation into all stages of an organism. -
WorkflowGenerator— a protocol for natural-language-to-topology generation.HeuristicGeneratorprovides rule-based workflow construction as a baseline;ReasoningGeneratordefines the protocol for LLM-powered workflow synthesis.generate_and_register()integrates any generator with thePatternLibrary.
These are deliberately defined as Protocols rather than abstract base classes. Any object with the right method signatures satisfies the contract — no inheritance required. This means DSPy optimizers, LangChain generators, or custom implementations can plug in without importing Operon.
7. The Numbers
| Metric | v0.24.1 | v0.25.1 |
|---|---|---|
| Convergence modules | 20 | 23 |
| Tests | 1,474 | 1,530 |
| Examples | 103 | 107 |
| External frameworks | 5 (Swarms, DeerFlow, AnimaWorks, Ralph, A-Evolve) | 6 (+Scion) |
| Compilers | 4 | 4 |
| Adapters | 5 | 6 |
| TLA+ specifications | 4 | 4 |
| Benchmark tasks | — | 20 |
| Evaluation configurations | — | 7 |
8. Try It
pip install operon-ai==0.25.1
# Run the mock evaluation harness (no API keys needed)
python examples/104_evaluation_harness.py
# Run live evaluation with real LLMs (requires GEMINI_API_KEY or OPENAI_API_KEY)
# Also supports Claude CLI and Codex CLI if installed
set -a && source .env && set +a
python examples/107_live_evaluation.py
# Try prompt optimization protocols
python examples/105_prompt_optimization_interface.py
# Generate and register a workflow
python examples/106_workflow_generation_interface.py
The mock evaluation harness (example 104) requires no API keys — it uses structural analysis to derive synthetic metrics. The live evaluation (example 107) runs real LLM calls through Gemini API, Claude CLI, and Codex CLI with LLM-as-judge quality scoring.
Full documentation: coredipper.github.io/operon. Convergence companion paper: convergence.