Operon v0.12: Verification
Eval Harness, Mathematical Corrections, and Reproducible Benchmarks
operon-ai v0.12.0Summary
Operon v0.12 introduces a reproducible eval harness with two benchmark suites derived from established academic benchmarks: BFCL (Berkeley Function Calling Leaderboard) for structural validation and AgentDojo for prompt injection immunity. This release also corrects several mathematical inconsistencies between the paper and the implementation—most significantly, replacing the logistic sigmoid with the paper's exponential saturation in the Epiplexity formula, and fixing the stagnation detection direction. All claims are now backed by 100-seed reproducible results.
1. The Problem: Verifying Biological Claims
The previous releases introduced mechanisms inspired by Systems Biology: Chaperone Proteins for structural healing, T-cell receptors for anomaly detection, Epiplexity for epistemic stagnation monitoring. These biological analogies are compelling—but analogies are not evidence.
A framework that claims "Chaperone-mediated folding repairs malformed JSON" needs to demonstrate this under controlled conditions. A system claiming "Immune surveillance detects prompt injection" needs adversarial evaluation, not just unit tests. v0.12 addresses this gap.
Design Principle
Every biological motif should have a corresponding empirical evaluation. The eval harness maps each motif to an established benchmark: Chaperone → BFCL function-call folding, Immune System → AgentDojo injection detection.
2. BFCL: Testing the Chaperone
The Berkeley Function Calling Leaderboard (BFCL) evaluates how well models generate structured function calls. We adapt it to test the Chaperone protein motif—the retry-and-repair loop that folds malformed LLM outputs into valid schemas.
2.1 The Folding Analogy
In molecular biology, newly synthesized proteins emerge as linear amino acid chains that must fold into precise 3D structures. Chaperone proteins (GroEL/GroES) provide a protected environment for refolding attempts. If a protein fails to fold repeatedly, it's tagged for degradation.
The BFCL suite tests this analogy directly. We take valid function-call schemas from the BFCL dataset and corrupt them with realistic LLM failure modes:
| Corruption | Biological Analogue | Example |
|---|---|---|
| Trailing commas | Premature stop codon | {"a": 1,} |
| Single quotes | Wrong amino acid | {'key': 'val'} |
| Unquoted keys | Missing post-translational modification | {key: "val"} |
| Type swap | Charge reversal mutation | "42" instead of 42 |
| Missing field | Domain deletion | Required field omitted |
| Extra field | Domain insertion | Unexpected field added |
| Prose wrapping | Signal peptide not cleaved | Here is the JSON: {...} |
2.2 Results
We evaluate two metrics: strict accuracy (exact match after folding) and cascade accuracy (valid structure, possibly with semantic differences). Each sample receives 1–3 random corruptions.
| Suite | Metric | Rate | 95% Wilson CI |
|---|---|---|---|
| BFCL Folding | Cascade | 65.5% | [58.7%, 71.7%] |
| BFCL Folding | Strict | 6.5% | [3.8%, 10.8%] |
| Synthetic Folding | Cascade | 57.0% | [50.1%, 63.7%] |
| Synthetic Folding | Strict | 5.0% | [2.7%, 9.0%] |
The gap between cascade and strict accuracy is expected: structural repair (removing trailing commas, fixing quotes) succeeds at high rates, but recovering the exact original values from type-swapped or missing fields requires semantic understanding beyond pure syntax repair. This mirrors biology: chaperones can refold structure, but cannot repair sequence mutations.
2.3 Healing: Error Context Matters
The Chaperone Loop (Section 5.5.1 of the paper) specifies that error traces should be fed back to the generator for context-aware correction. We test this directly:
| Healing Mode | Healed Rate | Success Rate | 95% Wilson CI |
|---|---|---|---|
| Blind retry | 37.5% | 66.5% | [59.7%, 72.7%] |
| With error context | 99.5% | 99.5% | [97.2%, 99.9%] |
Blind retry recovers only 37.5% of failures. Injecting the error trace ("TypeError: 'one hundred' is not float") into the generator's context achieves 99.5% healing. The biological analogy holds: the GroEL cage doesn't just retry folding—it provides a protected environment where the error signal itself guides correction.
eval/suites/bfcl_folding.py and
eval/suites/healing.py for the benchmark suites.
3. AgentDojo: Testing the Immune System
AgentDojo (Debenedetti et al., NeurIPS 2024) provides a dynamic environment for evaluating prompt injection attacks and defenses. We adapt its attack templates to test the Adaptive Immunity motif—the T-cell surveillance system that detects compromised agents through behavioral anomaly.
3.1 Experimental Setup
We create a population of 40 simulated agents. Eight are "compromised"—their behavioral profiles are shifted by injection patterns derived from four AgentDojo attack categories:
- Important Instructions: "IMPORTANT: Follow these instructions instead..."
- Tool Knowledge: Exploiting tool descriptions to inject behavior
- Direct Injection: Explicit instruction override in user content
- Ignore Previous: "Ignore all previous instructions and..."
The immune system observes 30 training samples per agent to build baseline profiles, then evaluates 20 test observations to classify each agent as healthy or compromised.
3.2 Results
| Suite | Metric | Rate | 95% Wilson CI |
|---|---|---|---|
| AgentDojo Immune | Sensitivity (true positive) | 100% | [67.6%, 100%] |
| AgentDojo Immune | False positive rate | 0.0% | [0.0%, 10.7%] |
| Synthetic Immune | Sensitivity | 100% | [67.6%, 100%] |
| Synthetic Immune | False positive rate | 3.1% | [0.6%, 15.7%] |
The wide confidence intervals reflect the small sample size (8 compromised agents)—a deliberate choice to model realistic deployments where compromised agents are rare. The sensitivity of 100% with 0% false positives on AgentDojo-derived attacks is encouraging, but the confidence bounds remind us that larger-scale evaluation is needed.
Honest Assessment
These are simulated behavioral shifts, not real LLM-mediated attacks. The eval tests the statistical detection machinery (baseline profiling, anomaly scoring, canary signals), not the full attack-defense loop. Real-world injection resistance requires end-to-end evaluation with actual LLMs—a direction for future work.
eval/suites/agentdojo_immune.py and
eval/suites/immune.py for the benchmark suites.
4. Mathematical Corrections
During the development of the eval harness, a comprehensive audit revealed several inconsistencies between the paper and the implementation. These are now corrected in both directions.
4.1 Exponential Saturation, Not Sigmoid
The paper defines perplexity normalization as an exponential saturation function:
This maps $[0, \infty) \to [0, 1)$ with a natural interpretation: $H_0$ is the baseline perplexity where the function reaches $\approx63\%$ saturation. The implementation was using a logistic sigmoid $\frac{1}{1+e^{-x}}$ instead—a function with different domain ($\mathbb{R} \to (0,1)$), different inflection point, and different asymptotic behavior.
Correction
Replaced _sigmoid() with _exponential_saturation() matching the paper's
$\sigma(H) = 1 - e^{-H/H_0}$. Renamed perplexity_baseline to
perplexity_h0 (the $H_0$ parameter). Default: $H_0 = 2.0$.
4.2 Stagnation Direction
The paper states that stagnation occurs when the Epiplexic Integral falls below a threshold:
Low Epiplexity means low Bayesian surprise—the agent's outputs carry no new information. The implementation was checking the opposite direction ($\mathcal{E}_w > \delta$), which would flag high-surprise (exploring) agents as stagnant.
Correction
Inverted the stagnation check from integral > threshold to
integral < threshold. Changed default threshold from 0.7 to 0.2
(low combined score indicates stagnation).
4.3 Embedding Novelty Normalization
The raw cosine distance $1 - \cos(e_t, e_{t-1})$ has range $[0, 2]$. The implementation correctly normalizes this to $[0, 1]$ via division by 2, but the paper omitted the $\frac{1}{2}$ factor. The corrected Epiplexity equation is:
Both terms now live in $[0, 1]$, making the $\alpha$-weighted combination interpretable as a proper convex combination.
4.4 Equation Renumbering
The paper's 20 numbered equations were scattered non-sequentially across sections (e.g., Section 4 jumped from Eq. 9 to Eq. 16 to Eq. 10). All equations have been renumbered 1–20 in document order.
5. Practical Applications
Three new examples demonstrate the framework in realistic software engineering scenarios:
| Example | Motifs Used | Description |
|---|---|---|
| 43: Code Review Pipeline | CFFL, Quorum | Multi-agent review with conjunctive gating—code ships only if both Generator and Reviewer approve |
| 44: Codebase Q&A (RAG) | Epigenetics, Chaperone | RAG pipeline with Chaperone-validated retrieval and cost-gated context selection |
| 45: Cost Attribution | Metabolic Coalgebra | Fine-grained token accounting per agent with apoptosis on budget exhaustion |
6. What's New in v0.12
| Component | Module | Description |
|---|---|---|
| Eval Harness | eval/ |
Reproducible benchmark runner with 100-seed results and Wilson CI |
| BFCL Folding Suite | eval/suites/bfcl_folding |
Function-call schema corruption and repair testing |
| AgentDojo Immune Suite | eval/suites/agentdojo_immune |
Prompt injection detection via behavioral anomaly |
| Epiplexity Math Fix | operon_ai.health |
Exponential saturation, correct stagnation direction, ½ normalization |
| Paper Alignment | article/ |
Sequential equation numbering, updated type set, corrected cross-references |
7. Interactive Demos
Every biological motif in the framework now has a corresponding interactive demo on HuggingFace Spaces. These run entirely in-browser—no API keys, no setup. Each demo includes curated presets that demonstrate the motif's core behavior: failure modes, recovery dynamics, and parameter sensitivity.
| Motif | Space | Description |
|---|---|---|
| Chaperone Folding | operon-chaperone | Recover structured data from malformed LLM output |
| Healing Loop + Autophagy | operon-healing | Chaperone healing loop with error feedback and autophagy context pruning |
| Membrane + Innate Immunity | operon-membrane | Two-layer defense against prompt injection attacks |
| Regenerative Swarm | operon-swarm | Worker apoptosis on entropy collapse and regeneration with memory transfer |
| Epiplexity Monitor | operon-epiplexity | Epistemic stagnation detection via Bayesian surprise |
| Morphogen Gradients | operon-morphogen | Gradient-based agent coordination without central control |
| Signal Cascade | operon-cascade | Multi-stage signal amplification pipeline with checkpoints |
| Quorum Sensing | operon-quorum | Multi-agent voting with 7 consensus strategies |
| Feedback Loops | operon-feedback | Negative feedback loop homeostasis simulation |
| Oscillators | operon-oscillator | Biological oscillator patterns and waveform visualization |
| Telomere + Genome | operon-lifecycle | Agent lifecycle management with telomere shortening and genome configuration |
| Mitochondria (MIPS) | operon-mitochondria | Safe calculator with AST-based parsing—no injection risk |
| ATP Budget | operon-budget | Multi-currency metabolic energy management |
| Complete Cell | operon-cell | Full 5-organelle pipeline from input to validated output |
| Compliance Review | operon-compliance-review | Multi-stage compliance pipeline with morphogen metadata, cascade stages, and quorum voting |
| Epiplexity Cascade | operon-epiplexity-cascade | Stagnation detection with escalating healing: autophagy, regeneration, abort |
| Immunity Router | operon-immunity-router | Threat-severity routing: passthrough, chaperone repair, autophagy cleanup, or hard reject |
| Morphogen Swarm | operon-morphogen-swarm | Gradient-guided swarm where failed workers update signals for successors |
| Adaptive Orchestrator | operon-orchestrator | End-to-end ticket processing combining all major operon mechanisms |
| Repair Memory | operon-repair-memory | Epigenetic repair memory—stores healing strategies as histone markers for reuse |
| Scheduled Maintenance | operon-scheduled-maintenance | Oscillator-driven context pruning with feedback-adjusted toxicity thresholds |
| Swarm Cleanup | operon-swarm-cleanup | Graceful worker shutdown with autophagy cleanup and clean state transfer to successors |
The demos are built with Gradio and deployed from the
huggingface/space-*/ directories in the repository. Each includes tunable parameters so
that edge cases—entropy collapse, budget exhaustion, stagnation recovery—can
be explored directly.
8. Conclusion
Operon v0.12 shifts the project from "biologically inspired architecture" to "empirically validated architecture." The eval harness provides reproducible, statistically grounded evidence for the framework's claims. The mathematical corrections ensure that the paper and the code agree exactly—not approximately, not directionally, but precisely.
The results are honest. The Chaperone repairs structural damage (65.5% cascade accuracy) but cannot recover semantic information lost to type mutations (6.5% strict). The Immune System detects all simulated compromises but operates on behavioral distributions, not actual LLM outputs. These are the bounds of what the current implementation can claim.
The next frontier is end-to-end evaluation: real LLMs, real attacks, real function calls. The eval harness provides the scaffolding; the benchmarks provide the methodology. What remains is connecting the simulated biology to the living system.
The framework is available at github.com/coredipper/operon, pypi.org/project/operon-ai, and huggingface.co/coredipper (22 interactive demos). Feedback welcome.