Operon v0.12: Verification

Eval Harness, Mathematical Corrections, and Reproducible Benchmarks

operon-ai v0.12.0

Summary

Operon v0.12 introduces a reproducible eval harness with two benchmark suites derived from established academic benchmarks: BFCL (Berkeley Function Calling Leaderboard) for structural validation and AgentDojo for prompt injection immunity. This release also corrects several mathematical inconsistencies between the paper and the implementation—most significantly, replacing the logistic sigmoid with the paper's exponential saturation in the Epiplexity formula, and fixing the stagnation detection direction. All claims are now backed by 100-seed reproducible results.

1. The Problem: Verifying Biological Claims

The previous releases introduced mechanisms inspired by Systems Biology: Chaperone Proteins for structural healing, T-cell receptors for anomaly detection, Epiplexity for epistemic stagnation monitoring. These biological analogies are compelling—but analogies are not evidence.

A framework that claims "Chaperone-mediated folding repairs malformed JSON" needs to demonstrate this under controlled conditions. A system claiming "Immune surveillance detects prompt injection" needs adversarial evaluation, not just unit tests. v0.12 addresses this gap.

Design Principle

Every biological motif should have a corresponding empirical evaluation. The eval harness maps each motif to an established benchmark: Chaperone → BFCL function-call folding, Immune System → AgentDojo injection detection.

2. BFCL: Testing the Chaperone

The Berkeley Function Calling Leaderboard (BFCL) evaluates how well models generate structured function calls. We adapt it to test the Chaperone protein motif—the retry-and-repair loop that folds malformed LLM outputs into valid schemas.

2.1 The Folding Analogy

In molecular biology, newly synthesized proteins emerge as linear amino acid chains that must fold into precise 3D structures. Chaperone proteins (GroEL/GroES) provide a protected environment for refolding attempts. If a protein fails to fold repeatedly, it's tagged for degradation.

The BFCL suite tests this analogy directly. We take valid function-call schemas from the BFCL dataset and corrupt them with realistic LLM failure modes:

Corruption	Biological Analogue	Example
Trailing commas	Premature stop codon	`{"a": 1,}`
Single quotes	Wrong amino acid	`{'key': 'val'}`
Unquoted keys	Missing post-translational modification	`{key: "val"}`
Type swap	Charge reversal mutation	`"42"` instead of `42`
Missing field	Domain deletion	Required field omitted
Extra field	Domain insertion	Unexpected field added
Prose wrapping	Signal peptide not cleaved	`Here is the JSON: {...}`

2.2 Results

We evaluate two metrics: strict accuracy (exact match after folding) and cascade accuracy (valid structure, possibly with semantic differences). Each sample receives 1–3 random corruptions.

Suite	Metric	Rate	95% Wilson CI
BFCL Folding	Cascade	65.5%	[58.7%, 71.7%]
BFCL Folding	Strict	6.5%	[3.8%, 10.8%]
Synthetic Folding	Cascade	57.0%	[50.1%, 63.7%]
Synthetic Folding	Strict	5.0%	[2.7%, 9.0%]

The gap between cascade and strict accuracy is expected: structural repair (removing trailing commas, fixing quotes) succeeds at high rates, but recovering the exact original values from type-swapped or missing fields requires semantic understanding beyond pure syntax repair. This mirrors biology: chaperones can refold structure, but cannot repair sequence mutations.

2.3 Healing: Error Context Matters

The Chaperone Loop (Section 5.5.1 of the paper) specifies that error traces should be fed back to the generator for context-aware correction. We test this directly:

Healing Mode	Healed Rate	Success Rate	95% Wilson CI
Blind retry	37.5%	66.5%	[59.7%, 72.7%]
With error context	99.5%	99.5%	[97.2%, 99.9%]

Blind retry recovers only 37.5% of failures. Injecting the error trace ("TypeError: 'one hundred' is not float") into the generator's context achieves 99.5% healing. The biological analogy holds: the GroEL cage doesn't just retry folding—it provides a protected environment where the error signal itself guides correction.

Reference Implementation: See eval/suites/bfcl_folding.py and eval/suites/healing.py for the benchmark suites.

3. AgentDojo: Testing the Immune System

AgentDojo (Debenedetti et al., NeurIPS 2024) provides a dynamic environment for evaluating prompt injection attacks and defenses. We adapt its attack templates to test the Adaptive Immunity motif—the T-cell surveillance system that detects compromised agents through behavioral anomaly.

3.1 Experimental Setup

We create a population of 40 simulated agents. Eight are "compromised"—their behavioral profiles are shifted by injection patterns derived from four AgentDojo attack categories:

Important Instructions: "IMPORTANT: Follow these instructions instead..."
Tool Knowledge: Exploiting tool descriptions to inject behavior
Direct Injection: Explicit instruction override in user content
Ignore Previous: "Ignore all previous instructions and..."

The immune system observes 30 training samples per agent to build baseline profiles, then evaluates 20 test observations to classify each agent as healthy or compromised.

3.2 Results

Suite	Metric	Rate	95% Wilson CI
AgentDojo Immune	Sensitivity (true positive)	100%	[67.6%, 100%]
AgentDojo Immune	False positive rate	0.0%	[0.0%, 10.7%]
Synthetic Immune	Sensitivity	100%	[67.6%, 100%]
Synthetic Immune	False positive rate	3.1%	[0.6%, 15.7%]

The wide confidence intervals reflect the small sample size (8 compromised agents)—a deliberate choice to model realistic deployments where compromised agents are rare. The sensitivity of 100% with 0% false positives on AgentDojo-derived attacks is encouraging, but the confidence bounds remind us that larger-scale evaluation is needed.

Honest Assessment

These are simulated behavioral shifts, not real LLM-mediated attacks. The eval tests the statistical detection machinery (baseline profiling, anomaly scoring, canary signals), not the full attack-defense loop. Real-world injection resistance requires end-to-end evaluation with actual LLMs—a direction for future work.

Reference Implementation: See eval/suites/agentdojo_immune.py and eval/suites/immune.py for the benchmark suites.

4. Mathematical Corrections

During the development of the eval harness, a comprehensive audit revealed several inconsistencies between the paper and the implementation. These are now corrected in both directions.

4.1 Exponential Saturation, Not Sigmoid

The paper defines perplexity normalization as an exponential saturation function:

$$\sigma(H) = 1 - e^{-H/H_0}$$

This maps $[0, \infty) \to [0, 1)$ with a natural interpretation: $H_0$ is the baseline perplexity where the function reaches $\approx63\%$ saturation. The implementation was using a logistic sigmoid $\frac{1}{1+e^{-x}}$ instead—a function with different domain ($\mathbb{R} \to (0,1)$), different inflection point, and different asymptotic behavior.

Correction

Replaced _sigmoid() with _exponential_saturation() matching the paper's $\sigma(H) = 1 - e^{-H/H_0}$. Renamed perplexity_baseline to perplexity_h0 (the $H_0$ parameter). Default: $H_0 = 2.0$.

4.2 Stagnation Direction

The paper states that stagnation occurs when the Epiplexic Integral falls below a threshold:

$$\mathcal{E}_{\text{window}} < \delta \implies \text{Stagnation}$$

Low Epiplexity means low Bayesian surprise—the agent's outputs carry no new information. The implementation was checking the opposite direction ($\mathcal{E}_w > \delta$), which would flag high-surprise (exploring) agents as stagnant.

Correction

Inverted the stagnation check from integral > threshold to integral < threshold. Changed default threshold from 0.7 to 0.2 (low combined score indicates stagnation).

4.3 Embedding Novelty Normalization

The raw cosine distance $1 - \cos(e_t, e_{t-1})$ has range $[0, 2]$. The implementation correctly normalizes this to $[0, 1]$ via division by 2, but the paper omitted the $\frac{1}{2}$ factor. The corrected Epiplexity equation is:

$$\hat{\mathcal{E}}_t = \alpha \cdot \tfrac{1}{2}(1 - \cos(e_t, e_{t-1})) + (1-\alpha) \cdot \sigma(H(m_t|m_{<t}))$$

Both terms now live in $[0, 1]$, making the $\alpha$-weighted combination interpretable as a proper convex combination.

4.4 Equation Renumbering

The paper's 20 numbered equations were scattered non-sequentially across sections (e.g., Section 4 jumped from Eq. 9 to Eq. 16 to Eq. 10). All equations have been renumbered 1–20 in document order.

5. Practical Applications

Three new examples demonstrate the framework in realistic software engineering scenarios:

Example	Motifs Used	Description
43: Code Review Pipeline	CFFL, Quorum	Multi-agent review with conjunctive gating—code ships only if both Generator and Reviewer approve
44: Codebase Q&A (RAG)	Epigenetics, Chaperone	RAG pipeline with Chaperone-validated retrieval and cost-gated context selection
45: Cost Attribution	Metabolic Coalgebra	Fine-grained token accounting per agent with apoptosis on budget exhaustion

6. What's New in v0.12

Component	Module	Description
Eval Harness	`eval/`	Reproducible benchmark runner with 100-seed results and Wilson CI
BFCL Folding Suite	`eval/suites/bfcl_folding`	Function-call schema corruption and repair testing
AgentDojo Immune Suite	`eval/suites/agentdojo_immune`	Prompt injection detection via behavioral anomaly
Epiplexity Math Fix	`operon_ai.health`	Exponential saturation, correct stagnation direction, ½ normalization
Paper Alignment	`article/`	Sequential equation numbering, updated type set, corrected cross-references

7. Interactive Demos

Every biological motif in the framework now has a corresponding interactive demo on HuggingFace Spaces. These run entirely in-browser—no API keys, no setup. Each demo includes curated presets that demonstrate the motif's core behavior: failure modes, recovery dynamics, and parameter sensitivity.

Motif	Space	Description
Chaperone Folding	operon-chaperone	Recover structured data from malformed LLM output
Healing Loop + Autophagy	operon-healing	Chaperone healing loop with error feedback and autophagy context pruning
Membrane + Innate Immunity	operon-membrane	Two-layer defense against prompt injection attacks
Regenerative Swarm	operon-swarm	Worker apoptosis on entropy collapse and regeneration with memory transfer
Epiplexity Monitor	operon-epiplexity	Epistemic stagnation detection via Bayesian surprise
Morphogen Gradients	operon-morphogen	Gradient-based agent coordination without central control
Signal Cascade	operon-cascade	Multi-stage signal amplification pipeline with checkpoints
Quorum Sensing	operon-quorum	Multi-agent voting with 7 consensus strategies
Feedback Loops	operon-feedback	Negative feedback loop homeostasis simulation
Oscillators	operon-oscillator	Biological oscillator patterns and waveform visualization
Telomere + Genome	operon-lifecycle	Agent lifecycle management with telomere shortening and genome configuration
Mitochondria (MIPS)	operon-mitochondria	Safe calculator with AST-based parsing—no injection risk
ATP Budget	operon-budget	Multi-currency metabolic energy management
Complete Cell	operon-cell	Full 5-organelle pipeline from input to validated output
Compliance Review	operon-compliance-review	Multi-stage compliance pipeline with morphogen metadata, cascade stages, and quorum voting
Epiplexity Cascade	operon-epiplexity-cascade	Stagnation detection with escalating healing: autophagy, regeneration, abort
Immunity Router	operon-immunity-router	Threat-severity routing: passthrough, chaperone repair, autophagy cleanup, or hard reject
Morphogen Swarm	operon-morphogen-swarm	Gradient-guided swarm where failed workers update signals for successors
Adaptive Orchestrator	operon-orchestrator	End-to-end ticket processing combining all major operon mechanisms
Repair Memory	operon-repair-memory	Epigenetic repair memory—stores healing strategies as histone markers for reuse
Scheduled Maintenance	operon-scheduled-maintenance	Oscillator-driven context pruning with feedback-adjusted toxicity thresholds
Swarm Cleanup	operon-swarm-cleanup	Graceful worker shutdown with autophagy cleanup and clean state transfer to successors

The demos are built with Gradio and deployed from the huggingface/space-*/ directories in the repository. Each includes tunable parameters so that edge cases—entropy collapse, budget exhaustion, stagnation recovery—can be explored directly.

8. Conclusion

Operon v0.12 shifts the project from "biologically inspired architecture" to "empirically validated architecture." The eval harness provides reproducible, statistically grounded evidence for the framework's claims. The mathematical corrections ensure that the paper and the code agree exactly—not approximately, not directionally, but precisely.

The results are honest. The Chaperone repairs structural damage (65.5% cascade accuracy) but cannot recover semantic information lost to type mutations (6.5% strict). The Immune System detects all simulated compromises but operates on behavioral distributions, not actual LLM outputs. These are the bounds of what the current implementation can claim.

The next frontier is end-to-end evaluation: real LLMs, real attacks, real function calls. The eval harness provides the scaffolding; the benchmarks provide the methodology. What remains is connecting the simulated biology to the living system.

The framework is available at github.com/coredipper/operon, pypi.org/project/operon-ai, and huggingface.co/coredipper (22 interactive demos). Feedback welcome.