Score-Rejection Isn’t Cert-Firing (and n=10 Isn’t Enough)

I ran the same structural-critic experiment twice on the same 10 SWE-bench-lite instances. Two different pass@1 deltas. Zero certificate fires on either run. An honest update from an earlier post that conflated two different signals.
Bogdan Banu · April 21, 2026 (updated)
operon-openhands-gates v0.1.0a3

What changed since the first post

An earlier version of this post (same URL, rewritten) reported a single n=10 run with a null correctness delta and argued “the critic fires, the outcome doesn’t move.” Two things turned out to be wrong about that framing:

  1. The critic didn’t fire. The behavioral_stability_windowed certificate was never emitted on that run. What I was calling “the critic firing” was actually the iterative-refinement loop retrying because CriticResult.score < success_threshold. That’s a different signal from sustained-stagnation cert-firing, and I was conflating them.
  2. n=10 is single-run variance, not a stable measurement. I reran the same experiment after shipping a library fix (v0.1.0a3 added [CERT-FIRE] stdout logging to disambiguate this exact question). Same 10 instances, same model, same prompts. Different pass@1: 60% on the first run, 70% on the second. Retry depth also moved: max attempt=2 on run one, max attempt=3 on run two.

The useful finding isn’t either of those specific numbers. It’s the distinction those numbers force: score-rejection and cert-firing are two separate signals in OperonStagnationCritic, and only the first one appears on this slice.

The two signals

When OpenHands’ iterative-refinement loop decides whether to retry a trajectory, it reads CriticResult.score and compares it to a configurable success_threshold. OperonStagnationCritic returns the epiplexic integral as its score — a rolling-window measure of how much the agent’s output embeddings are changing. Low integral ⇒ low score ⇒ retry.

Separately, the critic tracks whether the integral has stayed below self.threshold=0.2 for critical_duration=3 consecutive measurements. When that condition holds, it emits a behavioral_stability_windowed certificate — structured evidence that the agent is in sustained stagnation, not just a temporary dip.

Score-rejection is “the integral is low right now, retry.”
Cert-firing is “the integral has been consistently low, this isn’t going to unstick.”

Both signals are real. Only one of them fired on either of my runs.

Two runs, same slice

Same setup on both: OpenHands’ CodeActAgent, GPT-5 with reasoning_effort=high, 10 SWE-bench-lite instances (the ones whose Docker images built cleanly on my Mac — 9 django + 1 astropy). Same prompts. Same OperonStagnationCritic(threshold=0.2, window=10, critical_duration=3). Patches evaluated through SWE-bench’s test harness, not judged by the critic.

MetricBaselineRun 1 (v0.1.0a2)Run 2 (v0.1.0a3)
pass@160% (6/10)60% (6/10)70% (7/10)
Instances rejected-by-score066
Retry rounds06 (5 completed + 1 aborted)12 (all 6 reached attempt=3)
Retry-flip improved01 (django-11815)
Retry-flip broke00
Cumulative cost$4.21$6.58 (+56%)$9.94 (+136%)
Per-retry-round cost$0.47$0.48
Certificates emitted00

The score-rejection rate is stable across runs (60%, same 6 instances both times). The retry depth and the outcome delta are not stable — run two went deeper into the retry ladder and actually recovered one instance that run one left unresolved.

Zero cert fires across both runs. On 12 retry rounds in run two with integrals dipping below success_threshold repeatedly, not once did the integral stay below self.threshold=0.2 for three consecutive measurements. The score oscillates; the sustained-stagnation predicate doesn’t hold.

What that tells us about the critic

1. Score-rejection is doing the retry work

The 60 pp gap in “instances with ≥1 rejection” between baseline and treatment is real and stable. Six of ten instances produced low enough integrals on some measurement that the iterative-refinement loop retried. That’s a measurable behavioral divergence from the default LLM critic, which accepts the first patch on every instance.

But the score is a cheap proxy: any dip below threshold triggers retry, even a brief one. It doesn’t carry the evidential payload of the certificate (which records the specific violating windows). It’s an on/off signal, not a structural attestation.

2. Cert-firing is the structural signal, and it’s strict

The behavioral_stability_windowed theorem says: every window in the last critical_duration rolling windows had a mean severity above the stability threshold. That’s a stronger claim than “the score dipped low once.” On GPT-5’s trajectories, this condition didn’t hold on any of the 10 instances in either run — the integral fluctuates enough that three-consecutive-below is rare.

That’s useful information on its own. It tells me that at this threshold and window size, on this model, the certificate is a stricter predicate than the retry loop uses. The critic has two thresholds to care about: one for “retry now” (loose) and one for “emit a certificate of stagnation” (tight). GPT-5 apparently oscillates above the tight one.

3. Outcome delta rides the noise

On the same slice, same prompts, same critic config, I got +0 pp pass@1 one day and +10 pp the next. That’s a single instance flipping (django-11815 went unresolved → resolved on the second run, via an attempt=3 retry). At n=10 that’s a 10 pp swing from one data point.

Which means: if I’d only run it once, I’d have reported either “null delta” or “+10 pp improvement,” and both would have been wrong as structural claims. They’d be right as run-specific observations, but wrong if quoted as “what the critic does.” n=10 on a deterministic-seeming experiment is still within model-noise territory.

The blog-post-sized takeaway from n=10: the critic’s score-rejection signal is stable across runs (60% rate both times); the downstream outcome effect is not (0 pp to +10 pp swing on identical inputs); the cert predicate is strict enough that it doesn’t fire at all on this model/threshold combination. Any narrative about “does the structural critic improve agent correctness” needs more than one run to adjudicate.

What this does and doesn’t say

It says the epiplexic-integral detector is producing a non-trivial signal: 60% rejection rate, stable across runs, with a 2×-12× cost overhead depending on retry depth. That’s measurable. The detector is doing work.

It also says the threshold=0.2 + critical_duration=3 combination tuned in Paper 4 §4.3 is calibrated for a different model family (sentence-MiniLM embeddings at n=300). On GPT-5, the sustained-below-0.2 condition is too strict to fire under normal retry pressure — the agent’s outputs oscillate enough that it never stays low for three consecutive measurements. That’s not a bug; it’s a calibration finding. If you want cert fires on GPT-5, threshold or critical_duration needs to be loosened for this model.

It doesn’t say “the retry helps” or “the retry doesn’t help” as a structural claim. On run one it didn’t; on run two it did, by exactly one instance. That’s a distribution question, not a point estimate one. The honest answer is “probably close to zero mean effect with high variance at n=10 on GPT-5 django.”

It doesn’t generalize beyond this slice. n=10 is small; the slice is django-heavy because Docker-Desktop-on-Mac’s apt-signature bug blocked the matplotlib/sympy/flask/pytest images during inference; the model is fixed (GPT-5, high-reasoning).

Implications for structural critics

Two that I didn’t fully appreciate before running this twice.

Distinguish the signals in the public API. OperonStagnationCritic.score is a cheap continuous signal for the retry loop; self._certificate is the expensive structural attestation for downstream auditing. The first version of scripts/generate_delta_artifact.py counted “rejections” via retry presence and called them “cert fires.” The v0.1.0a3 update splits them: instances_with_rejection (retry-triggering) and certificates_emitted (cert payload on disk). On these runs, the first is 6 and the second is 0 — surprising, and worth surfacing in the artifact schema.

Calibrate per model, not per theorem. The structural-guarantee thesis is model-independent at the theorem layer — behavioral_stability_windowed means the same thing regardless of embedder. But the thresholds that make the theorem useful are model-dependent. threshold=0.2 was meaningful for the sentence-MiniLM substrate that Paper 4 validated at n=300. For a 2026-tier reasoning model with different output distributions, it apparently needs to be looser to fire at all. Fixed thresholds don’t port; the structural guarantee ports, the calibration doesn’t.

What ships

operon-openhands-gates v0.1.0a3 on PyPI:

Raw artifact + both runs’ reports are in eval/results/. Harness code is in scripts/.

Caveats for citation

  1. n=10 single-run variance is ±10 pp on pass@1. Two runs on identical inputs produced 60% and 70%. Any structural claim about the critic’s outcome effect on this slice requires more runs and a distribution, not a point estimate. Treat single-run deltas at this scale as noise unless corroborated.
  2. Sample bias. n=10, 9 django + 1 astropy — the 10 that survived Docker-Desktop-on-Mac’s apt-signature bug. Not representative of full SWE-bench-lite. A Linux-host or --workspace remote rerun at n=30 is the natural follow-up.
  3. Threshold calibration. threshold=0.2, critical_duration=3 tuned on sentence-MiniLM at n=300 (Paper 4). On GPT-5 in this harness, that combination didn’t produce a single cert fire across 22 measurement-points-across-rejected-instances. Different model, different ideal threshold.
  4. Single model. GPT-5 only. Claude Sonnet 4.6 or another family would likely show a different score/cert-fire distribution.

What’s next

  1. Threshold sweep on a second model. If Sonnet 4.6 or Claude Opus produces different integral distributions on the same 10 instances, the per-model calibration finding becomes concrete. One extra run at a second model is cheap.
  2. Unbiased n=30 rerun on Linux. Kills the sample-bias caveat and gets the variance estimate onto a real sample.
  3. Sibling replication on operon-langgraph-gates. Same substrate, different harness. Would test whether the cert-vs-score distinction generalizes across frameworks — the strongest evidence for the framework-portability claim I’ve been making.
  4. Terminate-on-fire, not retry-on-fire. Still worth testing, but only on trajectories where the cert actually fires. Until the threshold is recalibrated for GPT-5, there’s nothing on this slice to trigger it.

Honest summary

A structural stagnation critic pointed at GPT-5 on SWE-bench-lite rejects 60% of first-attempt patches by score across two independent runs (stable). The rejected instances get retried and the downstream outcome delta is 0 pp on run one and +10 pp on run two — n=10 noise, not a stable measurement. The certificate — the actual structural attestation that distinguishes this critic from a probabilistic judge — didn’t fire on either run. The threshold calibrated on Paper 4’s MiniLM substrate is too strict for GPT-5’s output distribution.

The useful part isn’t any single number. It’s that two runs on the same inputs tell two different outcome stories, and both are wrong as structural claims.

Code: harness (#3) · eval (#5) · cert stdout log (#6) · attempt-union accounting (#7). Released as operon-openhands-gates v0.1.0a3 on PyPI. Artifact: swebench_lite_delta.md.