Operon v0.12: Verification

Eval Harness, Mathematical Corrections, and Reproducible Benchmarks

Bogdan Banu · bogdan@banu.be

operon-ai v0.12.0

Summary

Operon v0.12 introduces a reproducible eval harness with two benchmark suites derived from established academic benchmarks: BFCL (Berkeley Function Calling Leaderboard) for structural validation and AgentDojo for prompt injection immunity. This release also corrects several mathematical inconsistencies between the paper and the implementation—most significantly, replacing the logistic sigmoid with the paper's exponential saturation in the Epiplexity formula, and fixing the stagnation detection direction. All claims are now backed by 100-seed reproducible results.

1. The Problem: Verifying Biological Claims

The previous releases introduced mechanisms inspired by Systems Biology: Chaperone Proteins for structural healing, T-cell receptors for anomaly detection, Epiplexity for epistemic stagnation monitoring. These biological analogies are compelling—but analogies are not evidence.

A framework that claims "Chaperone-mediated folding repairs malformed JSON" needs to demonstrate this under controlled conditions. A system claiming "Immune surveillance detects prompt injection" needs adversarial evaluation, not just unit tests. v0.12 addresses this gap.

Design Principle

Every biological motif should have a corresponding empirical evaluation. The eval harness maps each motif to an established benchmark: Chaperone → BFCL function-call folding, Immune System → AgentDojo injection detection.

2. BFCL: Testing the Chaperone

The Berkeley Function Calling Leaderboard (BFCL) evaluates how well models generate structured function calls. We adapt it to test the Chaperone protein motif—the retry-and-repair loop that folds malformed LLM outputs into valid schemas.

2.1 The Folding Analogy

In molecular biology, newly synthesized proteins emerge as linear amino acid chains that must fold into precise 3D structures. Chaperone proteins (GroEL/GroES) provide a protected environment for refolding attempts. If a protein fails to fold repeatedly, it's tagged for degradation.

The BFCL suite tests this analogy directly. We take valid function-call schemas from the BFCL dataset and corrupt them with realistic LLM failure modes:

Corruption Biological Analogue Example
Trailing commas Premature stop codon {"a": 1,}
Single quotes Wrong amino acid {'key': 'val'}
Unquoted keys Missing post-translational modification {key: "val"}
Type swap Charge reversal mutation "42" instead of 42
Missing field Domain deletion Required field omitted
Extra field Domain insertion Unexpected field added
Prose wrapping Signal peptide not cleaved Here is the JSON: {...}

2.2 Results

We evaluate two metrics: strict accuracy (exact match after folding) and cascade accuracy (valid structure, possibly with semantic differences). Each sample receives 1–3 random corruptions.

Suite Metric Rate 95% Wilson CI
BFCL Folding Cascade 65.5% [58.7%, 71.7%]
BFCL Folding Strict 6.5% [3.8%, 10.8%]
Synthetic Folding Cascade 57.0% [50.1%, 63.7%]
Synthetic Folding Strict 5.0% [2.7%, 9.0%]

The gap between cascade and strict accuracy is expected: structural repair (removing trailing commas, fixing quotes) succeeds at high rates, but recovering the exact original values from type-swapped or missing fields requires semantic understanding beyond pure syntax repair. This mirrors biology: chaperones can refold structure, but cannot repair sequence mutations.

2.3 Healing: Error Context Matters

The Chaperone Loop (Section 5.5.1 of the paper) specifies that error traces should be fed back to the generator for context-aware correction. We test this directly:

Healing Mode Healed Rate Success Rate 95% Wilson CI
Blind retry 37.5% 66.5% [59.7%, 72.7%]
With error context 99.5% 99.5% [97.2%, 99.9%]

Blind retry recovers only 37.5% of failures. Injecting the error trace ("TypeError: 'one hundred' is not float") into the generator's context achieves 99.5% healing. The biological analogy holds: the GroEL cage doesn't just retry folding—it provides a protected environment where the error signal itself guides correction.

Reference Implementation: See eval/suites/bfcl_folding.py and eval/suites/healing.py for the benchmark suites.

3. AgentDojo: Testing the Immune System

AgentDojo (Debenedetti et al., NeurIPS 2024) provides a dynamic environment for evaluating prompt injection attacks and defenses. We adapt its attack templates to test the Adaptive Immunity motif—the T-cell surveillance system that detects compromised agents through behavioral anomaly.

3.1 Experimental Setup

We create a population of 40 simulated agents. Eight are "compromised"—their behavioral profiles are shifted by injection patterns derived from four AgentDojo attack categories:

The immune system observes 30 training samples per agent to build baseline profiles, then evaluates 20 test observations to classify each agent as healthy or compromised.

3.2 Results

Suite Metric Rate 95% Wilson CI
AgentDojo Immune Sensitivity (true positive) 100% [67.6%, 100%]
AgentDojo Immune False positive rate 0.0% [0.0%, 10.7%]
Synthetic Immune Sensitivity 100% [67.6%, 100%]
Synthetic Immune False positive rate 3.1% [0.6%, 15.7%]

The wide confidence intervals reflect the small sample size (8 compromised agents)—a deliberate choice to model realistic deployments where compromised agents are rare. The sensitivity of 100% with 0% false positives on AgentDojo-derived attacks is encouraging, but the confidence bounds remind us that larger-scale evaluation is needed.

Honest Assessment

These are simulated behavioral shifts, not real LLM-mediated attacks. The eval tests the statistical detection machinery (baseline profiling, anomaly scoring, canary signals), not the full attack-defense loop. Real-world injection resistance requires end-to-end evaluation with actual LLMs—a direction for future work.

Reference Implementation: See eval/suites/agentdojo_immune.py and eval/suites/immune.py for the benchmark suites.

4. Mathematical Corrections

During the development of the eval harness, a comprehensive audit revealed several inconsistencies between the paper and the implementation. These are now corrected in both directions.

4.1 Exponential Saturation, Not Sigmoid

The paper defines perplexity normalization as an exponential saturation function:

$$\sigma(H) = 1 - e^{-H/H_0}$$

This maps $[0, \infty) \to [0, 1)$ with a natural interpretation: $H_0$ is the baseline perplexity where the function reaches $\approx63\%$ saturation. The implementation was using a logistic sigmoid $\frac{1}{1+e^{-x}}$ instead—a function with different domain ($\mathbb{R} \to (0,1)$), different inflection point, and different asymptotic behavior.

Correction

Replaced _sigmoid() with _exponential_saturation() matching the paper's $\sigma(H) = 1 - e^{-H/H_0}$. Renamed perplexity_baseline to perplexity_h0 (the $H_0$ parameter). Default: $H_0 = 2.0$.

4.2 Stagnation Direction

The paper states that stagnation occurs when the Epiplexic Integral falls below a threshold:

$$\mathcal{E}_{\text{window}} < \delta \implies \text{Stagnation}$$

Low Epiplexity means low Bayesian surprise—the agent's outputs carry no new information. The implementation was checking the opposite direction ($\mathcal{E}_w > \delta$), which would flag high-surprise (exploring) agents as stagnant.

Correction

Inverted the stagnation check from integral > threshold to integral < threshold. Changed default threshold from 0.7 to 0.2 (low combined score indicates stagnation).

4.3 Embedding Novelty Normalization

The raw cosine distance $1 - \cos(e_t, e_{t-1})$ has range $[0, 2]$. The implementation correctly normalizes this to $[0, 1]$ via division by 2, but the paper omitted the $\frac{1}{2}$ factor. The corrected Epiplexity equation is:

$$\hat{\mathcal{E}}_t = \alpha \cdot \tfrac{1}{2}(1 - \cos(e_t, e_{t-1})) + (1-\alpha) \cdot \sigma(H(m_t|m_{<t}))$$

Both terms now live in $[0, 1]$, making the $\alpha$-weighted combination interpretable as a proper convex combination.

4.4 Equation Renumbering

The paper's 20 numbered equations were scattered non-sequentially across sections (e.g., Section 4 jumped from Eq. 9 to Eq. 16 to Eq. 10). All equations have been renumbered 1–20 in document order.

5. Practical Applications

Three new examples demonstrate the framework in realistic software engineering scenarios:

Example Motifs Used Description
43: Code Review Pipeline CFFL, Quorum Multi-agent review with conjunctive gating—code ships only if both Generator and Reviewer approve
44: Codebase Q&A (RAG) Epigenetics, Chaperone RAG pipeline with Chaperone-validated retrieval and cost-gated context selection
45: Cost Attribution Metabolic Coalgebra Fine-grained token accounting per agent with apoptosis on budget exhaustion

6. What's New in v0.12

Component Module Description
Eval Harness eval/ Reproducible benchmark runner with 100-seed results and Wilson CI
BFCL Folding Suite eval/suites/bfcl_folding Function-call schema corruption and repair testing
AgentDojo Immune Suite eval/suites/agentdojo_immune Prompt injection detection via behavioral anomaly
Epiplexity Math Fix operon_ai.health Exponential saturation, correct stagnation direction, ½ normalization
Paper Alignment article/ Sequential equation numbering, updated type set, corrected cross-references

7. Interactive Demos

Every biological motif in the framework now has a corresponding interactive demo on HuggingFace Spaces. These run entirely in-browser—no API keys, no setup. Each demo includes curated presets that demonstrate the motif's core behavior: failure modes, recovery dynamics, and parameter sensitivity.

Motif Space Description
Chaperone Folding operon-chaperone Recover structured data from malformed LLM output
Healing Loop + Autophagy operon-healing Chaperone healing loop with error feedback and autophagy context pruning
Membrane + Innate Immunity operon-membrane Two-layer defense against prompt injection attacks
Regenerative Swarm operon-swarm Worker apoptosis on entropy collapse and regeneration with memory transfer
Epiplexity Monitor operon-epiplexity Epistemic stagnation detection via Bayesian surprise
Morphogen Gradients operon-morphogen Gradient-based agent coordination without central control
Signal Cascade operon-cascade Multi-stage signal amplification pipeline with checkpoints
Quorum Sensing operon-quorum Multi-agent voting with 7 consensus strategies
Feedback Loops operon-feedback Negative feedback loop homeostasis simulation
Oscillators operon-oscillator Biological oscillator patterns and waveform visualization
Telomere + Genome operon-lifecycle Agent lifecycle management with telomere shortening and genome configuration
Mitochondria (MIPS) operon-mitochondria Safe calculator with AST-based parsing—no injection risk
ATP Budget operon-budget Multi-currency metabolic energy management
Complete Cell operon-cell Full 5-organelle pipeline from input to validated output
Compliance Review operon-compliance-review Multi-stage compliance pipeline with morphogen metadata, cascade stages, and quorum voting
Epiplexity Cascade operon-epiplexity-cascade Stagnation detection with escalating healing: autophagy, regeneration, abort
Immunity Router operon-immunity-router Threat-severity routing: passthrough, chaperone repair, autophagy cleanup, or hard reject
Morphogen Swarm operon-morphogen-swarm Gradient-guided swarm where failed workers update signals for successors
Adaptive Orchestrator operon-orchestrator End-to-end ticket processing combining all major operon mechanisms
Repair Memory operon-repair-memory Epigenetic repair memory—stores healing strategies as histone markers for reuse
Scheduled Maintenance operon-scheduled-maintenance Oscillator-driven context pruning with feedback-adjusted toxicity thresholds
Swarm Cleanup operon-swarm-cleanup Graceful worker shutdown with autophagy cleanup and clean state transfer to successors

The demos are built with Gradio and deployed from the huggingface/space-*/ directories in the repository. Each includes tunable parameters so that edge cases—entropy collapse, budget exhaustion, stagnation recovery—can be explored directly.

8. Conclusion

Operon v0.12 shifts the project from "biologically inspired architecture" to "empirically validated architecture." The eval harness provides reproducible, statistically grounded evidence for the framework's claims. The mathematical corrections ensure that the paper and the code agree exactly—not approximately, not directionally, but precisely.

The results are honest. The Chaperone repairs structural damage (65.5% cascade accuracy) but cannot recover semantic information lost to type mutations (6.5% strict). The Immune System detects all simulated compromises but operates on behavioral distributions, not actual LLM outputs. These are the bounds of what the current implementation can claim.

The next frontier is end-to-end evaluation: real LLMs, real attacks, real function calls. The eval harness provides the scaffolding; the benchmarks provide the methodology. What remains is connecting the simulated biology to the living system.

The framework is available at github.com/coredipper/operon, pypi.org/project/operon-ai, and huggingface.co/coredipper (22 interactive demos). Feedback welcome.