When Your Diagnostic Beats Your Result
The question the previous run couldn’t answer
Two weeks ago we shipped SWE-bench Phase 2 and got a clean 0/30 across baseline, organism, and LangGraph on SWE-bench-lite. We framed it as a model-capability ceiling. The trouble: 28 of those 30 submissions never reached a test verdict at all. They were rejected at git apply. The model produced text that looked like diffs and the harness refused them.
That mixes two failure modes:
- The model selecting the wrong file path (or inventing one that doesn’t exist in the target repo).
- The model writing hunk headers and bodies that
git applycan’t consume — placeholder line numbers, mismatched counts, paths doubled with the repo name.
If most of the failures are (1), the fix is grounding the prompt in the actual repository. If they’re (2), grounding doesn’t help — you’re looking at a diff-format-discipline ceiling. The original write-up couldn’t tell the difference, so we couldn’t tell which fix to invest in.
v0.34.5 is the diagnostic. It separates the failure modes. The headline number is still bad — in fact slightly worse on raw count — but the failure is now attributable. That turns out to be the result.
The diagnostic, in three pieces
1. A patch sanitizer that refuses known-bad output
eval/_patch_sanitizer.py sits between extract_patch() and the harness. It runs four passes over each candidate diff:
- Path normalization. The model often emits
--- a/django/django/forms/foo.pyfor repodjango/django— doubling the repo name onto the real path. The sanitizer strips onedjango/when the repo name and owner segment match, or strips the full{owner}/{repo}/prefix when they don’t. Same logic applies todiff --git,rename from/to, andcopy from/tolines — partial normalization would leave inconsistent old/new paths and still fail to apply. - Placeholder hunk rejection. Hunk headers like
@@ -XXX,10 +XXX,10 @@get rejected.git applywould refuse with a cryptic “missing line number” message; the sanitizer catches it earlier and returns"". - Hunk body validation. For each
@@ -a,b +c,d @@, the sanitizer walks the body and checks that additions plus context equald, deletions plus context equalb. Truncated or overlong hunks — declared 10 lines, body has 7 — get rejected. After a long review chain on this code (more on that below), the validator is count-driven: it consumes exactly the declared body length, so adversarial body content shaped like file headers (--- a/foo.pyas the deletion of a line whose content starts with-- a/foo.py) can’t fool it. - Bare-empty context repair. Whitespace-stripping editors sometimes turn a single-space context line into an empty line.
git applyrejects bare empty lines in hunk bodies; the sanitizer rewrites them back to" "before validation.
The contract: if sanitize() returns a non-empty string, git apply will at least make a sincere attempt at it. If it returns "", the harness records empty_patch — and that’s honest categorization, because the alternative would have been silent error conflated with infrastructure failures.
2. Repo grounding that puts real files in the prompt
--grounding turns on a second pipeline. For each instance, the script shallow-clones {repo}@{base_commit} via git fetch --depth 1 origin {sha} into .cache/swebench/{owner}__{repo}-{commit12}/. Then a heuristic ranker mines the problem_statement + hints_text for plausible identifiers (.py paths, CamelCase class names, snake_case functions), walks the repo, and ranks Python files by stem-match (+3) and path-match (+1). Top 5 candidate files get their first 200 lines injected into the prompt as a Repository context block.
For organism and LangGraph, the same context flows through every stage — SkillOrganism.run(task) passes task unchanged to each stage’s prompt, so injecting context once reaches localize, edit, and verify.
The sanitizer also gets a tree oracle: when it sees a path that doesn’t exist in the cloned tree, it tries to fuzzy-correct to a unique basename match (e.g. the model wrote django/settings.py but the real file is django/conf/global_settings.py — if exactly one settings.py exists in the tree, rewrite). If the basename has zero or multiple matches, reject. Per-file-diff state means create / rename / copy targets pass through unchanged: a file that doesn’t exist at base_commit is the entire point of a creation patch.
3. Honest reporting that distinguishes failure modes
Two small but consequential additions to the results artifact:
EVAL_RUNTIME_ERRORas a status distinct from sanitizer-rejectedempty_patch. When a model call raises (Ollama API timeout, network glitch, OOM), the prediction carries anerror_reasonfield.classify_prediction(error_reason, harness_status)overrides any harness verdict toruntime_errorwhen anerror_reasonis set — includingnot_evaluatedunder--skip-harnessand even harness-reportederror. The override is unconditional because the harness can’t see exceptions; only we can.mean_latency_msdivides by completed predictions, not byn. Two timeouts on a 10-instance run was deflating the baseline mean by 25% before this fix.
And a CLI mode for keeping the artifact honest after the fact: python -m eval.swebench_phase2 --rewrite-envelope eval/results/swebench_phase2.json re-resolves the model identity against live Ollama, recomputes the post-run digest check, and rewrites the envelope through the same writer the live run uses. The verification tag is taken from the artifact (not from --model), so a stale CLI default can’t silently verify the wrong model.
--prefixed line is always a delete, regardless of what comes after it. Only at boundaries between hunks do shape checks come back, and only as a full ---/+++/@@ triplet for bare multi-file diffs. The lesson: when parsing untrusted content, length-prefixed formats are robust; delimiter-terminated ones are fragile. Unified diffs are sort-of-both, and the correct parser uses each where it’s available.
The grounded rerun
Same model (gemma4:latest, digest c6eb396dbd59, 8B Q4_K_M), same 10 SWE-bench-lite instances, --grounding active across all three conditions:
| Condition | Resolved | Unresolved | Sanitizer-rejected | Runtime error | Evaluated | Mean latency |
|---|---|---|---|---|---|---|
| baseline | 0 | 1 | 7 | 2 | 1/10 | 131s |
| organism | 0 | 0 | 10 | 0 | 0/10 | 170s |
| langgraph | 0 | 0 | 10 | 0 | 0/10 | 172s |
Comparison with the original (un-grounded, un-sanitized) run:
| Original (v0.34.4) | Grounded (v0.34.5) | |
|---|---|---|
| Total evaluated (resolved + unresolved) | 2/30 | 1/30 |
Harness error (apply-fail) | 20 | 0 |
Sanitizer-rejected / empty_patch | 8 | 27 |
| Runtime errors (model never returned) | 0 (counted as empty_patch) | 2 (separate status) |
| Resolved | 0 | 0 |
Every error from the original run collapsed to either sanitizer-rejected or runtime-error. The sanitizer’s pre-rejection means no malformed patch reaches the harness anymore, so what previously appeared as 20 application failures is now correctly attributed to the model.
django/django-11001 baseline reached the harness as unresolved — patch applied cleanly, tests still failed. That’s also the same instance that survived in the original run. Something about the issue’s structure (likely a small, localized hunk) makes it more amenable to small-model output. It’s the only positive control we have.
The reframe
Grounding solves failure mode (1). Every prompt now contains the actual files at base_commit, the sanitizer’s tree oracle catches near-miss paths, and the heuristic ranker reliably surfaces the right files (smoke-tested on a real astropy issue: all 5 candidate files came from astropy/modeling/ for a modeling problem).
The fact that 27 of the 28 model-returning submissions still drop to sanitizer-rejection — with the dominant reasons being placeholder hunk headers, mismatched header counts, and paths that don’t exist in the cloned tree even when the right files are right there in the prompt — localizes the bottleneck to (2). At 8B / Q4_K_M, the binding constraint isn’t file selection. It’s the model’s ability to emit format-correct unified diffs.
That’s a real capability statement, not “inconclusive.” Paper 5 §6.3 was retitled accordingly: 8B Format Discipline Is the Ceiling.
Side observations
The organism format leak closed silently
The original run had a 4-vs-0 empty_patch gap between organism and baseline — the three-stage pipeline was leaking critique into where a diff should have been. Phase A’s tightened [edit]-stage instruction (“your entire response for this stage must be a single fenced diff block and nothing else”) eliminated it. In the grounded rerun, both baseline and organism produce 0 extracted patches across the 8 non-timeout cases — the gap is closed in the wrong direction (baseline gained one, organism didn’t), which is consistent with the localize/edit/verify decomposition asking the model to maintain diff format while juggling stage outputs. A discipline tax the single-shot baseline avoids.
Grounding is overhead at 8B
Baseline mean latency went from 44s in the original run to 131s with grounding. Organism from 88s to 170s, langgraph from 90s to 172s. The 30 KB of repository context per prompt roughly doubles to triples per-call wall-clock, with zero compensating gain in evaluated instances. Grounding’s cost-benefit changes with stronger models — if format discipline weren’t the binding constraint, the file context would actually translate to better patches. At 8B, it’s overhead.
Two API timeouts and what they teach
Two of the 30 model calls (both baseline, on astropy-12907 and astropy-14995) timed out at the OpenAI-compatible client’s default. The exception-handler path constructed empty-patch predictions with latency_ms=0.0. In an earlier draft, those got serialized as empty_patch indistinguishable from sanitizer-rejections, and the mean latency divided by 10 instead of 8. Code review caught it (Roborev #747). The fix was the EVAL_RUNTIME_ERROR status with explicit error_reason; the test suite now pins the exact two known-failed rows so a regeneration can’t silently lose them.
Importantly, those two instances also failed under organism and langgraph (which completed normally). So the runtime errors don’t refute the format-discipline claim — the model never produced a usable diff for those instances regardless of which condition.
What this run does and does not say
It does say: at 8B / Q4_K_M, the model’s ability to emit git apply-clean unified diffs is the binding constraint on SWE-bench-lite resolution. Three-stage decomposition does not relax it; if anything, it adds a discipline tax. File context in the prompt does not relax it either — the model’s problem isn’t finding the right file.
It does not say: the organism architecture is intrinsically worse than direct prompting. The sample is too small (1 evaluated baseline vs 0 organism) to discriminate. Nor does it say grounding is universally overhead — at this model scale it is, but the same infrastructure should add real value once format discipline isn’t the bottleneck.
It also doesn’t say the structural guarantees of the categorical harness are weakened. Certificate preservation across compilers (Paper 5 §5), priority gating, stagnation detection — these are Know-level properties verified independently of SWE-bench task resolution. The grounded null doesn’t touch them.
The meta-insight
error — could be model, could be infrastructure) into an attributable one (empty_patch with sanitizer-reject reasons, distinct from runtime_error). That’s the value pattern: when the failure mode isn’t identifiable, building the diagnostic is upstream of any fix. Future runs — stronger models, format-correction retry loops, RL on diff format — now have a clean attribution surface to test against.
The unstated subtext of a lot of agent-evaluation work is “the resolved-rate is the result.” Sometimes it isn’t. Sometimes the result is “we built the thing that lets us tell why the rate is what it is, and now there’s a clean next experiment.”
What’s next
Three named wedges, in decreasing likelihood-of-moving-the-result:
- Stronger model. Same code, swap
--model qwen3:latestor escape the 8B class via Modal/RunPod for a 70B+. Format discipline is a capability that scales; the bottleneck this run identified is exactly the one that bigger / better-trained models tend to clear first. - Format-correction retry. Re-prompt the model on sanitizer rejection with the specific failure reason (“your hunk header had
XXXplaceholders — please use real line numbers from the file shown above”). Doubles cost per failure but might recover 20-40% of sanitizer drops. Code change in the runners; not yet shipped. - LLM-localized candidate ranking. Use the organism’s
localizestage as the file ranker for baseline too. Smoke-tested with the heuristic ranker already finds plausible files; this is marginal.
None of these need more harness work. The patch-apply pipeline shipped in v0.34.5 is the substrate they all run on top of.
Honest summary
The shipped artifact is a patch-apply pipeline (sanitizer + grounding + classification) you can drop into any future SWE-bench experiment, plus a clean v0.34.5 baseline showing the format-discipline ceiling for our local 8B model. The headline number is bad and we’re reporting it as bad. The infrastructure paid off as a diagnostic, not as a performance booster — and that’s the result we’re shipping. The next data point is a model swap, not more infrastructure.
Code: PR #53 (sanitizer + prompts), PR #55 (grounding + fuzzy correction + hardened parser), PR #56 (grounded rerun + paper §6.3 + runtime_error classification). Released as operon-ai v0.34.5 on PyPI. Release notes.