Phase C: The 8B Ceiling Holds Cross-Model, and Retry Is the Wrong Lever

22 hours of Ollama, 30 submissions across a second 8B model with format-correction retry active, zero recoveries. A sharper negative than the v0.34.5 post predicted.
Bogdan Banu · April 18, 2026
operon-ai v0.35.0

Recap: the wedge list

The v0.34.5 post localized SWE-bench Phase 2's failure mode to 8B diff-format discipline (not file selection) and named three wedges for moving the number:

  1. Stronger model. Most likely to move the result.
  2. Format-correction retry. Re-prompt on sanitizer rejection with the specific failure reason.
  3. LLM-localized candidate ranking. Marginal.

v0.35.0 runs the intersection of the first two on a second locally-available 8B model: deepseek-r1:8b (DeepSeek-R1 reasoning training distilled onto a Qwen3-8B base, Q4_K_M, 8.2B parameters), with --retry-on-reject active. Not a score-chasing run — 8B was already called out as below the SWE-bench-lite threshold. A test of whether the v0.34.5 ceiling claim generalizes across training regimes, and whether reason-coded retry prompts can recover any of the sanitizer-rejected submissions.

The run

Same 10 SWE-bench-lite instances as v0.34.5, same grounded prompt pipeline, three conditions (baseline, organism, langgraph). New in v0.35.0: on every sanitizer rejection the runner re-prompts the model once with the specific reason code (one of placeholder_hunk, truncated_hunk, overlong_hunk, path_not_found, ambiguous_path, empty_extraction, malformed_metadata), the failed output embedded verbatim, and reason-specific guidance. At _FORMAT_RETRY_MAX = 1, the retry fires exactly once per instance per condition.

PR #57 raised the per-call LLM timeout from 120s to 900s so reasoning models' long <think> blocks don't trip the OpenAI-compatible client's default. PR #58 shipped the sanitizer's tuple-return sanitize_with_reason(patch, slug, tree_paths), the _build_retry_prompt helper, the runner-level retry callbacks for all three conditions, and the --output CLI flag so Phase C artifacts don't overwrite v0.34.5's. All of that landed on main before this run started.

Total wall-clock: ~22 hours. Reasoning models are slow — the mean per-call latency was 18–19 minutes across all conditions, vs gemma4's 2–3 minutes in v0.34.5. The 900s timeout absorbed every call: zero runtime errors in the artifact. (v0.34.5 had two, both astropy instances that hit the old 120s ceiling; those completed cleanly in this run.)

Results

0/30 evaluated, 0 retry-recovered patches. Every submission across every condition was sanitizer-rejected. The retry fired on every rejection and every retry was also rejected. No instance crossed the sanitizer under any condition.
ConditionResolvedUnresolvedSanitizer-rejectedRuntime errorEvaluatedMean latency
baseline001000/101084 s (18 min)
organism001000/101073 s (18 min)
langgraph001000/101130 s (19 min)

Comparison with v0.34.5 gemma4

gemma4:latest (v0.34.5)deepseek-r1:8b (v0.35.0)
Total evaluated1/300/30
Sanitizer-rejected2730
Runtime errors20
Resolved00
Retry activenoyes
Retry-recovered patches0
Mean latency (baseline, completed)131 s1084 s

Two load-bearing findings

1. The cross-model ceiling is sharp

Both 8B models produce zero resolved instances. Gemma4 crossed the sanitizer once; deepseek-r1 crossed it zero times. The single v0.34.5 "survivor" — django/django-11001 baseline, unresolved — was gemma4-specific. Deepseek-r1 could not produce a git apply-clean diff for that instance under any of the three conditions (baseline, organism, or langgraph).

This matters because the v0.34.5 post flagged the single survivor as "a structural property of that issue (likely a small, localized hunk)." Under that theory, any 8B model should be able to handle django-11001 in baseline. The cross-model data disproves it: the issue isn't structurally easy in a model-independent sense; the single survivor was an idiosyncrasy of gemma4's output distribution on that specific prompt. The 8B-class format-discipline ceiling in v0.34.5 isn't a single-model artifact — it generalizes to at least one other 8B model, a reasoning-distilled one at that.

2. Retry-with-reason-code did not break the ceiling

The retry mechanism isn't broken — the retry callback fired on every rejection, the retry prompts embedded the reason code and the failed output verbatim with reason-specific guidance ("use real integer line numbers" for placeholder_hunk, "use exact paths from the repository context" for path_not_found, etc.), and the v0.35 artifact records every submission's retry_attempted=true with retry_recovered=false.

The reason distribution itself is informative. Across 30 submissions: 26 were empty_extraction (the model's output contained no diff-shaped content at all), 3 were overlong_hunk, and 1 was truncated_hunk. Zero submissions hit placeholder_hunk, path_not_found, or ambiguous_path — the grounding-specific reasons didn't fire even once, consistent with v0.34.5's claim that file selection is not the bottleneck.

What the retry cannot do is move the model across a capability boundary. At 8B, the dominant failure mode is doesn't produce diff-shaped output in the first place (26/30), not produces diff-shaped output with fixable errors (4/30). Retry-with-guidance is calibrated for the second regime — it helps models that know what correct looks like and occasionally miss — but deepseek-r1:8b sits mostly in the first regime. Even the empty_extraction retry prompt ("respond with a single fenced diff block and nothing else") didn't bring the model into diff-shape. Retry is a competence-with-lapses lever, not a capability-ceiling lever.

This is a sharper negative than the v0.34.5 post predicted. That post said retry "might recover 20–40% of sanitizer drops." At 8B, it recovered 0 of 30. The 20–40% band was an intuition about ``what retry does to models that are close.'' Neither gemma4 8B nor deepseek-r1 8B is close. For the retry wedge to earn its 2× cost, we'd need a model that's already occasionally producing valid patches — likely 70B+, or a stronger 30B class.

What this does and does not say

It says: (i) the 8B format-discipline ceiling holds across training regimes, with reasoning-distilled models showing no advantage over instruction-tuned ones for diff format; (ii) targeted retry prompts don't help when the failure is capability-bound, not competence-bound; (iii) the single v0.34.5 "survivor" was model-specific, not issue-structural. The cross-model confirmation is worth more than an additional gemma4 rerun would have been for the generality of the claim.

It does not say retry is universally useless. For models that are close to producing correct output — likely 70B+ via Modal / Groq / Together — the retry pipeline shipped in v0.35 is the instrument to test exactly that. The infrastructure is now there; what's missing is the compute.

It also doesn't say the structural guarantees of the categorical harness are weakened. Certificate preservation across compilers, priority gating, stagnation detection — these are Know-level properties verified independently of SWE-bench task resolution. The grounded null doesn't touch them.

Side observations

Repo-dependent latency is a hidden confounder

Astropy instances ran 25–30 min per call. Django instances ran 6–15 min. Same model, same grounded context injection, same sanitizer, same retry. The difference is issue complexity: astropy's stack traces reference more import paths and class hierarchies, producing longer <think> blocks. Benchmark latency numbers pooled across repos mask this. Worth flagging for future cross-repo cost-benefit claims: "reasoning-model latency is ~18 min per call" pools over a 2× spread that tracks with repo structure, not with reasoning effort.

The timeout fix pays off

v0.34.5's gemma4 run had two baseline timeouts on astropy instances (120s ceiling). Both completed cleanly in this run at 900s. The mean per-call latency in v0.35 is 18 minutes — well past the old ceiling but well under the new one. Raising a default that was set for interactive callers when the use case is an extended-reasoning benchmark isn't a tuning knob, it's a correctness fix.

The reason vocabulary has head room

Eight reason codes in SANITIZE_REASONS; this run exercised at least two in logs (overlong_hunk, truncated_hunk). The remaining codes are unit-tested but haven't fired in live data yet because the specific failure modes (placeholder line numbers, basename-collision paths, etc.) are less common than truncation/overrun at 8B. At larger models the distribution may shift — placeholder_hunk in particular is a common failure mode of models that reason about diff structure abstractly instead of grounding in real file content.

What's next

  1. 70B+ cloud rerun. The natural follow-up. Modal charges for GPU after their free tier; Groq's free-tier Llama-3.3-70B is a cheaper first pass. Same code, different --model, same artifact writer. If retry recovers anything at 70B, that's the first positive signal for the retry wedge.
  2. Per-stage retry for organism. Currently a retry re-runs the whole organism. A per-stage retry (only re-run the edit stage) would halve retry cost and sharpen "does decomposition help retry." Not yet shipped.
  3. Paper 4 / Paper 5 arXiv push. The §6.3 Phase C subsection means the Paper 5 claim is now cross-model verified. arXiv-ready.

Honest summary

v0.35.0 ships Phase C: the patch-apply pipeline v0.34.5 described but didn't run, plus a cross-model artifact that confirms the ceiling and a retry null result that tightens the why. The headline number is the same (0 resolved on SWE-bench-lite). The claim is stronger (cross-model, with retry active, zero recoveries). And the instrument to test a bigger model is now on PyPI and ready to point at 70B compute whenever that's available.

Code: PR #57 (timeout fix), PR #58 (retry infrastructure), PR #59 (Phase C artifact + paper §6.3 + release). Released as operon-ai v0.35.0 on PyPI. Release notes.