When Your Diagnostic Beats Your Result

We built a patch-apply pipeline to disambiguate Phase 2’s “model or harness?” failure mode. The score didn’t move — but the failure became attributable.

Bogdan Banu · April 17, 2026

operon-ai v0.34.5

The question the previous run couldn’t answer

Two weeks ago we shipped SWE-bench Phase 2 and got a clean 0/30 across baseline, organism, and LangGraph on SWE-bench-lite. We framed it as a model-capability ceiling. The trouble: 28 of those 30 submissions never reached a test verdict at all. They were rejected at git apply. The model produced text that looked like diffs and the harness refused them.

That mixes two failure modes:

The model selecting the wrong file path (or inventing one that doesn’t exist in the target repo).
The model writing hunk headers and bodies that git apply can’t consume — placeholder line numbers, mismatched counts, paths doubled with the repo name.

If most of the failures are (1), the fix is grounding the prompt in the actual repository. If they’re (2), grounding doesn’t help — you’re looking at a diff-format-discipline ceiling. The original write-up couldn’t tell the difference, so we couldn’t tell which fix to invest in.

v0.34.5 is the diagnostic. It separates the failure modes. The headline number is still bad — in fact slightly worse on raw count — but the failure is now attributable. That turns out to be the result.

The diagnostic, in three pieces

1. A patch sanitizer that refuses known-bad output

eval/_patch_sanitizer.py sits between extract_patch() and the harness. It runs four passes over each candidate diff:

Path normalization. The model often emits --- a/django/django/forms/foo.py for repo django/django — doubling the repo name onto the real path. The sanitizer strips one django/ when the repo name and owner segment match, or strips the full {owner}/{repo}/ prefix when they don’t. Same logic applies to diff --git, rename from/to, and copy from/to lines — partial normalization would leave inconsistent old/new paths and still fail to apply.
Placeholder hunk rejection. Hunk headers like @@ -XXX,10 +XXX,10 @@ get rejected. git apply would refuse with a cryptic “missing line number” message; the sanitizer catches it earlier and returns "".
Hunk body validation. For each @@ -a,b +c,d @@, the sanitizer walks the body and checks that additions plus context equal d, deletions plus context equal b. Truncated or overlong hunks — declared 10 lines, body has 7 — get rejected. After a long review chain on this code (more on that below), the validator is count-driven: it consumes exactly the declared body length, so adversarial body content shaped like file headers (--- a/foo.py as the deletion of a line whose content starts with -- a/foo.py) can’t fool it.
Bare-empty context repair. Whitespace-stripping editors sometimes turn a single-space context line into an empty line. git apply rejects bare empty lines in hunk bodies; the sanitizer rewrites them back to " " before validation.

The contract: if sanitize() returns a non-empty string, git apply will at least make a sincere attempt at it. If it returns "", the harness records empty_patch — and that’s honest categorization, because the alternative would have been silent error conflated with infrastructure failures.

2. Repo grounding that puts real files in the prompt

--grounding turns on a second pipeline. For each instance, the script shallow-clones {repo}@{base_commit} via git fetch --depth 1 origin {sha} into .cache/swebench/{owner}__{repo}-{commit12}/. Then a heuristic ranker mines the problem_statement + hints_text for plausible identifiers (.py paths, CamelCase class names, snake_case functions), walks the repo, and ranks Python files by stem-match (+3) and path-match (+1). Top 5 candidate files get their first 200 lines injected into the prompt as a Repository context block.

For organism and LangGraph, the same context flows through every stage — SkillOrganism.run(task) passes task unchanged to each stage’s prompt, so injecting context once reaches localize, edit, and verify.

The sanitizer also gets a tree oracle: when it sees a path that doesn’t exist in the cloned tree, it tries to fuzzy-correct to a unique basename match (e.g. the model wrote django/settings.py but the real file is django/conf/global_settings.py — if exactly one settings.py exists in the tree, rewrite). If the basename has zero or multiple matches, reject. Per-file-diff state means create / rename / copy targets pass through unchanged: a file that doesn’t exist at base_commit is the entire point of a creation patch.

3. Honest reporting that distinguishes failure modes

Two small but consequential additions to the results artifact:

EVAL_RUNTIME_ERROR as a status distinct from sanitizer-rejected empty_patch. When a model call raises (Ollama API timeout, network glitch, OOM), the prediction carries an error_reason field. classify_prediction(error_reason, harness_status) overrides any harness verdict to runtime_error when an error_reason is set — including not_evaluated under --skip-harness and even harness-reported error. The override is unconditional because the harness can’t see exceptions; only we can.
mean_latency_ms divides by completed predictions, not by n. Two timeouts on a 10-instance run was deflating the baseline mean by 25% before this fix.

And a CLI mode for keeping the artifact honest after the fact: python -m eval.swebench_phase2 --rewrite-envelope eval/results/swebench_phase2.json re-resolves the model identity against live Ollama, recomputes the post-run digest check, and rewrites the envelope through the same writer the live run uses. The verification tag is taken from the artifact (not from --model), so a stale CLI default can’t silently verify the wrong model.

Sidebar: nine roborev iterations on the diff parser. The patch sanitizer’s diff parser went through nine code-review rounds (#724 through #748) before converging. Each round narrowed the same axis: shape-based heuristics for distinguishing file headers from hunk-body content kept being defeated by adversarial content. The structural fix was count-driven hunk consumption — inside a hunk body, a --prefixed line is always a delete, regardless of what comes after it. Only at boundaries between hunks do shape checks come back, and only as a full ---/+++/@@ triplet for bare multi-file diffs. The lesson: when parsing untrusted content, length-prefixed formats are robust; delimiter-terminated ones are fragile. Unified diffs are sort-of-both, and the correct parser uses each where it’s available.

The grounded rerun

Same model (gemma4:latest, digest c6eb396dbd59, 8B Q4_K_M), same 10 SWE-bench-lite instances, --grounding active across all three conditions:

Condition	Unresolved	Sanitizer-rejected	Runtime error	Evaluated	Mean latency
baseline	1	7	2	1/10	131s
organism	0	10	0	0/10	170s
langgraph	0	10	0	0/10	172s

Comparison with the original (un-grounded, un-sanitized) run:

	Original (v0.34.4)	Grounded (v0.34.5)
Total evaluated (resolved + unresolved)	2/30	1/30
Harness `error` (apply-fail)	20	0
Sanitizer-rejected / `empty_patch`	8	27
Runtime errors (model never returned)	0 (counted as `empty_patch`)	2 (separate status)
Resolved	0	0

Every error from the original run collapsed to either sanitizer-rejected or runtime-error. The sanitizer’s pre-rejection means no malformed patch reaches the harness anymore, so what previously appeared as 20 application failures is now correctly attributed to the model.

The single survivor: django/django-11001 baseline reached the harness as unresolved — patch applied cleanly, tests still failed. That’s also the same instance that survived in the original run. Something about the issue’s structure (likely a small, localized hunk) makes it more amenable to small-model output. It’s the only positive control we have.

The reframe

Grounding solves failure mode (1). Every prompt now contains the actual files at base_commit, the sanitizer’s tree oracle catches near-miss paths, and the heuristic ranker reliably surfaces the right files (smoke-tested on a real astropy issue: all 5 candidate files came from astropy/modeling/ for a modeling problem).

The fact that 27 of the 28 model-returning submissions still drop to sanitizer-rejection — with the dominant reasons being placeholder hunk headers, mismatched header counts, and paths that don’t exist in the cloned tree even when the right files are right there in the prompt — localizes the bottleneck to (2). At 8B / Q4_K_M, the binding constraint isn’t file selection. It’s the model’s ability to emit format-correct unified diffs.

That’s a real capability statement, not “inconclusive.” Paper 5 §6.3 was retitled accordingly: 8B Format Discipline Is the Ceiling.

Side observations

The organism format leak closed silently

The original run had a 4-vs-0 empty_patch gap between organism and baseline — the three-stage pipeline was leaking critique into where a diff should have been. Phase A’s tightened [edit]-stage instruction (“your entire response for this stage must be a single fenced diff block and nothing else”) eliminated it. In the grounded rerun, both baseline and organism produce 0 extracted patches across the 8 non-timeout cases — the gap is closed in the wrong direction (baseline gained one, organism didn’t), which is consistent with the localize/edit/verify decomposition asking the model to maintain diff format while juggling stage outputs. A discipline tax the single-shot baseline avoids.

Grounding is overhead at 8B

Baseline mean latency went from 44s in the original run to 131s with grounding. Organism from 88s to 170s, langgraph from 90s to 172s. The 30 KB of repository context per prompt roughly doubles to triples per-call wall-clock, with zero compensating gain in evaluated instances. Grounding’s cost-benefit changes with stronger models — if format discipline weren’t the binding constraint, the file context would actually translate to better patches. At 8B, it’s overhead.

Two API timeouts and what they teach

Two of the 30 model calls (both baseline, on astropy-12907 and astropy-14995) timed out at the OpenAI-compatible client’s default. The exception-handler path constructed empty-patch predictions with latency_ms=0.0. In an earlier draft, those got serialized as empty_patch indistinguishable from sanitizer-rejections, and the mean latency divided by 10 instead of 8. Code review caught it (Roborev #747). The fix was the EVAL_RUNTIME_ERROR status with explicit error_reason; the test suite now pins the exact two known-failed rows so a regeneration can’t silently lose them.

Importantly, those two instances also failed under organism and langgraph (which completed normally). So the runtime errors don’t refute the format-discipline claim — the model never produced a usable diff for those instances regardless of which condition.

What this run does and does not say

It does say: at 8B / Q4_K_M, the model’s ability to emit git apply-clean unified diffs is the binding constraint on SWE-bench-lite resolution. Three-stage decomposition does not relax it; if anything, it adds a discipline tax. File context in the prompt does not relax it either — the model’s problem isn’t finding the right file.

It does not say: the organism architecture is intrinsically worse than direct prompting. The sample is too small (1 evaluated baseline vs 0 organism) to discriminate. Nor does it say grounding is universally overhead — at this model scale it is, but the same infrastructure should add real value once format discipline isn’t the bottleneck.

It also doesn’t say the structural guarantees of the categorical harness are weakened. Certificate preservation across compilers (Paper 5 §5), priority gating, stagnation detection — these are Know-level properties verified independently of SWE-bench task resolution. The grounded null doesn’t touch them.

The meta-insight

Building a diagnostic localizes the problem before any fix is possible. Phase A and Phase B were infrastructure investments that did not improve the headline number on SWE-bench-lite. They did convert an unattributable failure mode (error — could be model, could be infrastructure) into an attributable one (empty_patch with sanitizer-reject reasons, distinct from runtime_error). That’s the value pattern: when the failure mode isn’t identifiable, building the diagnostic is upstream of any fix. Future runs — stronger models, format-correction retry loops, RL on diff format — now have a clean attribution surface to test against.

The unstated subtext of a lot of agent-evaluation work is “the resolved-rate is the result.” Sometimes it isn’t. Sometimes the result is “we built the thing that lets us tell why the rate is what it is, and now there’s a clean next experiment.”

What’s next

Three named wedges, in decreasing likelihood-of-moving-the-result:

Stronger model. Same code, swap --model qwen3:latest or escape the 8B class via Modal/RunPod for a 70B+. Format discipline is a capability that scales; the bottleneck this run identified is exactly the one that bigger / better-trained models tend to clear first.
Format-correction retry. Re-prompt the model on sanitizer rejection with the specific failure reason (“your hunk header had XXX placeholders — please use real line numbers from the file shown above”). Doubles cost per failure but might recover 20-40% of sanitizer drops. Code change in the runners; not yet shipped.
LLM-localized candidate ranking. Use the organism’s localize stage as the file ranker for baseline too. Smoke-tested with the heuristic ranker already finds plausible files; this is marginal.

None of these need more harness work. The patch-apply pipeline shipped in v0.34.5 is the substrate they all run on top of.

Honest summary

The shipped artifact is a patch-apply pipeline (sanitizer + grounding + classification) you can drop into any future SWE-bench experiment, plus a clean v0.34.5 baseline showing the format-discipline ceiling for our local 8B model. The headline number is bad and we’re reporting it as bad. The infrastructure paid off as a diagnostic, not as a performance booster — and that’s the result we’re shipping. The next data point is a model swap, not more infrastructure.

Code: PR #53 (sanitizer + prompts), PR #55 (grounding + fuzzy correction + hardened parser), PR #56 (grounded rerun + paper §6.3 + runtime_error classification). Released as operon-ai v0.34.5 on PyPI. Release notes.