What Happens When You Try to Evolve Your Agent Architectures
1. The Question
If your agents have genomes, can you evolve them?
Operon’s Genome class manages immutable configuration traits for individual agents. It tracks genes, mutations, expression levels, and lineage. The question C8 poses: can this same abstraction represent organism-level configurations—the modes, models, and thresholds of a multi-stage pipeline—and can we evolve those configurations to discover better architectures?
This is the natural extension of the biological metaphor. Biology doesn’t just run organisms; it evolves them. If Operon’s abstractions are genuinely biological, they should work at both levels.
2. What We Built
Genome Mapping
Each CandidateConfig (stage modes, models, visibility flags, intervention thresholds) maps to a Genome via ~5 lines of flattening logic. Mode and model become STRUCTURAL genes. Boolean flags become REGULATORY genes. The round-trip is lossless—no type coercion, no special cases.
genome = candidate_to_genome(config)
restored = genome_to_candidate(genome, ...)
assert restored.stage_configs == config.stage_configs # always True
This is the strongest evidence that the gene abstraction generalizes beyond its original per-agent use case.
Scale-Invariant Health Monitoring
The existing EpiplexityMonitor detects epistemic stagnation via Bayesian surprise on embedding similarity. We extended it with a DistanceProvider protocol that plugs in any novelty metric. For configuration space, ConfigHammingDistance counts field-level mismatches. The core formula is unchanged—the monitor triggers STAGNANT/EXPLORING transitions identically to the embedding path.
This confirms that epistemic health monitoring is scale-invariant: the same mechanism works for token-level stagnation (within an organism run) and configuration-level stagnation (across an evolution loop).
Dual Stall Detection
We discovered that config novelty alone doesn’t trigger the LLM proposer: tournament mutations always produce different configs (high novelty) even when scores stagnate. We added a second signal—score plateau (best score not improved in N steps)—and either triggers the switch to LLM reasoning.
The LLM Proposer
When stagnation is detected, a Gemini-backed proposer reads evolution history from the filesystem store—full candidate configs plus execution trace metadata—and proposes a new configuration. This implements the Meta-Harness insight: raw filesystem history outperforms compressed feedback.
3. The Good News
| Abstraction | Generalizes? | Evidence |
|---|---|---|
| Genome → CandidateConfig | Yes | Lossless round-trip, 5 lines |
| EpiplexityMonitor + DistanceProvider | Yes | Scale-invariant transitions |
| DesignProblem wrapping | Yes | Natural composition |
feedback_fixed_point | No | Evolution ≠ convergent iteration |
| TrustRegistry | No | Overkill for 2 proposers |
The biological abstractions that describe structure (genes, health monitoring, composition) generalize cleanly. The ones that describe specific dynamics (fixed-point convergence, trust networks) do not—evolution has different dynamics than the processes they were designed for.
4. The Surprise
| Proposer | Context | Mean Score | N |
|---|---|---|---|
| Tournament | n/a | 0.44 | 9 |
| LLM (compressed history) | index entries only | 0.15 | 2 |
| LLM (rich filesystem context) | configs + traces | 0.49 | 24 |
Rich context improves the LLM proposer 3× over compressed history (0.49 vs 0.15). But it only marginally exceeds tournament mutation (0.49 vs 0.44). Random mutation—flip one field, keep the best—is nearly as good as an LLM reasoning about the full evolution trajectory.
5. Phase B: Topology Mutations
We extended the evolution to mutate topology itself—add stages, remove stages, rewire edges. Candidates carry explicit edges (directed wiring pairs). A WiringDiagram + ResourceAwareExecutor executes stages in parallel groups via DAG layering.
| Proposer | Phase A (config only) | Phase B (topology) |
|---|---|---|
| Tournament | 0.44 | 0.60 |
| LLM | 0.49 | 0.36 |
Tournament improved with topology mutations (0.44 → 0.60). The LLM proposer degraded (0.49 → 0.36). Blind mutation handles structural changes more productively than reasoned proposals.
Why? The LLM spends reasoning capacity on edges that the organism doesn’t fully utilize (shared state still leaks across stages). Meanwhile, tournament mutations that happen to add a useful stage or remove a redundant one improve the pipeline directly.
6. The Honest Conclusion
Biological abstractions generalize to the meta-level as organizing principles, not as optimization advantages.
The Genome mapping, EpiplexityMonitor extension, and DesignProblem wrapping produce clean, composable code. The evolutionary search itself does not outperform random mutation. This is consistent with Ao et al.: without a search space that rewards structural reasoning, the multi-agent approach (LLM proposer) cannot dominate the single-agent baseline (tournament).
The structural guarantee features—immune systems, epiplexity monitoring, developmental gating—remain the core value proposition of biology-inspired agent engineering. C8 confirms they should be the focus of empirical validation, not meta-optimization.
The search code has been moved from the library (operon_ai/convergence/) to the evaluation harness (eval/meta/). The only C8 artifact that stays in the library is DistanceProvider—a genuinely useful extension to epistemic health monitoring.
7. The Categorical Connection
de los Riscos, Corbacho & Arbib (2026) independently developed a category-theoretic framework (ArchAgents) that maps tightly to Operon’s architecture. In their framework: objects are organism architectures, morphisms are structure-preserving translations (our compilers), and agents are monoidal functors (configured organisms). Phase A explores agents within a fixed architecture; Phase B explores architecture morphisms. The finding that blind morphisms outperform reasoned ones suggests that the categorical constraints on valid mutations are more productive than LLM-generated ones.
8. What’s Next
The right question is no longer “can we evolve better organisms?” but “do the structural safety features actually work?”
- Does the immune system catch more errors than ad-hoc validation?
- Does epiplexity monitoring prevent more runaway loops than simple timeouts?
- Does developmental gating prevent premature deployments?
These are the biological features that Operon implements well, and they’re the ones the project claims provide structural guarantees. Testing those claims is the next productive direction.
9. Try It
pip install operon-ai==0.26.0
# Run the meta-evolution experiment
git clone https://github.com/coredipper/operon
cd operon
python run_meta_evolution.py --provider gemini --llm-proposer gemini \
--max-iterations 10 --tasks easy_seq_01,easy_seq_02
The evolution loop, proposers, and evaluation harness live in eval/meta/. The 57 C8-specific tests are in tests/eval/test_meta_harness.py.