This page tracks the recent direction of the project.
Focus:
New:
_LLM_TIMEOUT_SECONDS) to absorb reasoning-model <think> blocks; separate 60s _LLM_PROBE_TIMEOUT_SECONDS for the startup reachability check so misconfigurations still surface fast (PR #57).sanitize_with_reason(patch, slug, *, tree_paths) returning a (cleaned_patch, reason) tuple with 8 machine-readable reason codes in SANITIZE_REASONS: placeholder_hunk, truncated_hunk, overlong_hunk, malformed_metadata, path_not_found, ambiguous_path, empty_extraction (+ empty-string success). The legacy sanitize() is kept as a backward-compat wrapper (PR #58)._sanitize_for_submission(..., retry_callback=cb) invokes cb(reason, failed_output) -> str on rejection. _build_retry_prompt embeds the reason + failed output + reason-specific guidance. All three runners (run_baseline, run_organism, run_langgraph) accept retry_on_reject: bool = False. _FORMAT_RETRY_MAX = 1 (PR #58).--retry-on-reject (enables retry), --output PATH (preserves artifacts per-model without overwriting; also honored under --rewrite-envelope). Existing callers see zero behavior change (PR #58, review #755 follow-up).Findings (deepseek-r1:8b with retry, 10 SWE-bench-lite instances per condition):
django/django-11001 baseline, unresolved) is gemma4-specific. Deepseek-r1 could not produce a git apply-clean diff for that instance under any of the three conditions. The 8B-class format-discipline ceiling is not a single-model artifact.Paper updates:
Focus:
New:
eval/_patch_sanitizer.py: pre-rejects model output that git apply would refuse — placeholder hunks (@@ -XXX,N +XXX,N @@), path doubling (a/django/django/foo.py), truncated/overlong hunks, malformed rename/copy metadata, bare empty context lines. Count-driven hunk consumption so adversarial body content can’t false-match file headers. 47 unit tests lock the contract.eval/_repo_cache.py + eval/_repo_grounding.py: opt-in via --grounding. Shallow-fetches {repo}@{base_commit} into a local cache; ranks candidate files by issue-text heuristics; injects up to 5 file snippets into the task prompt. The sanitizer also gets a tree oracle and fuzzy-corrects near-miss paths to unique basename matches.EVAL_RUNTIME_ERROR status distinct from sanitizer-rejected empty_patch; mean_latency_ms divides by completed predictions (was deflated when zero-latency timeout rows were included). Schema test pins specific runtime-error rows so silent regressions can’t ship.Findings (Phase 2 v2 grounded rerun, 10 instances per condition):
django-11001), 7 sanitizer-rejected, 2 runtime errors (Ollama API timeouts) — 1/10 evaluated.error outcomes (patches that failed at git apply); the rerun has zero. Sanitizer pre-rejection ensures no malformed patch reaches the harness, so the failures are now correctly attributed to the model rather than to infrastructure.Conclusion (Paper 5 §6.3):
base_commit. The fact that 27 of 28 model-returning submissions are still sanitizer-dropped localizes the bottleneck to the second: at 8B / Q4_K_M, diff-format discipline is the binding constraint, not file selection.Focus:
New:
eval/swebench_phase2.py: official swebench.harness.run_evaluation wrapper. Produces eval/results/swebench_phase2.json with per-instance eval_status (resolved, unresolved, error, empty_patch, not_evaluated).eval/_patch_extraction.py: unified-diff extractor. Accepts bare diffs, fenced ```diff / ```patch blocks, and git metadata (rename, new file, deleted file, /dev/null add/delete). Rejects hunk-only snippets without file headers.resolved_rate=None and harness_ran=False are reported explicitly instead of being silently counted as 0%.Findings (later superseded by v0.34.5 grounded rerun):
git apply rejected them), and the actual ceiling is diff-format discipline at the 8B / Q4_K_M scale.[edit]-stage prompt was tightened to emit a single fenced diff and nothing else.gemma4:latest (8B Q4_K_M, digest c6eb396dbd59). The original write-up incorrectly described this as “Gemma 4 27B MoE / 4B active”; the model-identity record was added in v0.34.5 to prevent this kind of mis-attribution going forward.Paper updates:
Focus:
New:
stages=[[s1, s2], [s3]] runs s1 and s2 concurrently, then s3.organism_to_langgraph() compiles parallel groups to fork→stages→join topology.rho=1.000 across 6 configs.Focus:
New:
seed_library_from_atomic_skills(): 5 composable coding skill patterns (localize, edit, test, reproduce, review) from Ma et al. (arXiv:2604.05013). Topology derived via shared _shape_to_topology() — parallel review maps to specialist_swarm, sequential skills to skill_organism.get_atomic_skill_patterns(): returns deep copies of the built-in catalogKey insight:
organism.run() transfers all guarantees because they reside in the harness.Focus:
New:
organism_to_langgraph(): compile SkillOrganism to a LangGraph StateGraph. Wraps organism.run() as a single node — all structural guarantees (CertificateGate, WatcherComponent, VerifierComponent, halt_on_block) enforced by the organism’s own run loop.run_organism_langgraph(): compile + execute in one call with certificate verificationexecute_deerflow(): single-agent execution via DeerFlow’s create_deerflow_agent (LangGraph runtime)guarded_graph.py: earlier per-stage approach (superseded by langgraph_compiler but kept for reference)pip install operon-ai[deerflow]Findings:
SkillOrganism.run() logic as LangGraph nodes caused 8 rounds of review fixes. Wrapping organism.run() directly eliminated all divergence bugs: 215 lines instead of 520.StateGraph is structurally isomorphic to Operon’s wiring diagram model (nodes = stages, conditional edges = interventions).Focus:
New:
RunContext: typed dict subclass wrapping shared_state with property accessors for watcher interventions, verifier signals, and telemetry events. Supports custom WatcherConfig.state_key.deerflow_to_topology() / swarms_to_topology(): decompilers enabling compile→decompile round-trips with certificate preservationExternalTopology.capabilities: structured per-agent capability annotations with EXTERNAL_CAPABILITY_MAP (27 tool→Capability mappings) for ToolDensity theoremTelemetryProbe enrichment: run_start event includes organism config (stage_count, stage_names, mode_assignments, certificate_theorems)hard_par_08: subtle bug detection eval task (off-by-one, TOCTOU, float precision, exception handling). 21 benchmark tasks total.Findings:
hard_par_08 discriminates: phi3:mini scores 0.72, gemma4 scores 1.00 (delta = 0.28). Hints in the prompt are for the judge, not the examinee.Focus:
New:
VerifierComponent: rubric-based quality evaluation for stage outputs (adaptive immune / B-cell analogy). Emits WatcherSignal(source="verifier") that triggers ESCALATE on low qualityCertificateGateComponent: pre-execution DNARepair.scan() in on_stage_start() — halts before LLM call if genome corruption detected (G1/S DNA damage checkpoint)SkillOrganism.run() (enables CertificateGate)on_stage_result() for correct signal ordering--judge-url and --judge-model flags for e2e evalFindings:
Focus:
New:
eval/e2e_real_agent.py: evaluation harness with --model, --max-tokens, --tasks, --repetitions flagsmax_tokens (aligned with live evaluator)Findings:
collect_certificates() on SkillOrganism, verify_compiled() for post-compilation verificationNew:
Certificate framework: self-verifiable structural guarantees with derivation-replay verify()certify() on QuorumSensingBio (no-false-activation), MTORScaler (no-oscillation), ATP_Store (priority gating)run_topology_validation.py)Focus:
New:
QuorumSensingBio: autoinducer signal accumulation with temporal decay (KEGG map02024), auto-calibrated thresholds via categorical certificate (de los Riscos et al. Prop 5.1)MTORScaler: AMPK ratio + rate-of-change sensing with hysteresis (KEGG hsa04152), adaptive worker scalingeval/benchmarks/): metabolism, quorum sensing, epiplexity — all three biological winsFocus:
New:
FilesystemOptimizer protocol — distinct from C7's EvolutionaryOptimizerEvolutionLoop — meta-harness glue (DesignProblem wrapping, EpiplexityMonitor stall detection)CandidateConfig / StageConfig with lossless Genome round-tripTournamentMutator + LLMProposer hybrid proposer strategyEvolutionStore — candidate-first filesystem persistence with index.jsonlDistanceProvider protocol for EpiplexityMonitor (scale-invariant epistemic health)ConfigHammingDistance for config-space novelty measurementrun_meta_evolution.py CLI runner with --llm-proposer gemini supportFindings:
Note: C8 meta-optimization code moved from operon_ai/convergence/ to eval/meta/ — experimental evaluation code, not part of the library. DistanceProvider remains in operon_ai/health/.
Focus:
New:
LiveEvaluator — runs real LLM calls through SkillOrganism pipelinescli_handler() (Claude Code, Codex)FilesystemOptimizer, HarnessSearchDP, Pareto convergence, causal diagnosisFocus:
New:
MockEvaluator — evaluation harness with structural variation and credit assignmentPromptOptimizer, EvolutionaryOptimizer, NoOpOptimizer — prompt optimization protocolsattach_optimizer — attach optimizer to SkillStageWorkflowGenerator, ReasoningGenerator, HeuristicGenerator — workflow generation protocolsgenerate_and_register — generate workflow and register in PatternLibraryFocus:
New:
organism_to_swarms(), managed_to_swarms() — compile organism to Swarms workflow configorganism_to_deerflow(), managed_to_deerflow() — compile organism to DeerFlow session configorganism_to_ralph(), managed_to_ralph() — compile organism to Ralph event-driven hat configorganism_to_scion(), managed_to_scion() — compile organism to Scion containerized grove configDistributedWatcher with InMemoryTransport and HttpTransport (webhook payload stub) — transport-abstracted convergence detectionoperon_watcher_node() — LangGraph-compatible convergence detection nodecreate_watcher_config() — helper for LangGraph watcher configurationFocus:
New:
operon_ai.convergence package with 12 modulesExternalTopology, AdapterResult — shared adapter typesanalyze_external_topology() — epistemic theorems as structural linterseed_library_from_swarms/deerflow/acg_survey — catalog seedingskill_to_template(), template_to_skill() — bidirectional DeerFlow skill bridgehybrid_skill_organism() — library-first + LLM generator fallbackPrimingView — multi-channel SubstrateView subclass (immutable via MappingProxyType)HeartbeatDaemon — idle-time consolidation via WatcherComponent extensionAsyncOrganizer, async_stage_handler() — Fork/Join within stagesDesignProblem, compose_series/parallel, feedback_fixed_point — Zardini co-designprompt_optimizer hook on SkillStage (interface for future DSPy integration)parse_ralph_config(), ralph_hats_to_stages() — Ralph adapterparse_aevolve_workspace(), aevolve_skills_to_stages() — A-Evolve adapterseed_library_from_ralph/aevolve — catalog seedingEvolutionGating.tla — TLA+ spec for evolution loopFocus:
New:
cli_handler() — factory that wraps any CLI command as a SkillStage handlercli_organism() — convenience for multi-CLI workflows via managed_organismCLIResult — structured output with stdout, stderr, returncode, latency, timed_out_action_type convention in handler output for signaling FAILURE to the watcherparse_json(), parse_lines()examples/83_cli_stage_handler.pyFocus:
managed_organism() factory wiring the full stackconsolidate() convenience functionNew:
ManagedOrganism, ManagedRunResult — full-stack organism with run/consolidate/export/scaffoldmanaged_organism() — batteries-included factory with sensible defaultsconsolidate() — one-call sleep consolidationadvise_topology() gains optional library and fingerprint paramsexamples/82_managed_organism.pyFocus:
New:
histone_to_bitemporal(), episodic_to_bitemporal() — memory bridge adaptersFocus:
New:
DevelopmentController, DevelopmentConfig, DevelopmentalStage, DevelopmentStatusCriticalPeriod, StageTransition, stage_reached()Plasmid.min_stage — developmental gating on tool acquisitionSocialLearning.scaffold_learner() + ScaffoldingResultexamples/80_developmental_staging.py — lifecycle progression and gatingexamples/81_critical_periods.py — teacher-learner scaffoldingFocus:
New:
SocialLearning, PeerExchange, TrustRegistry, AdoptionResult, AdoptionOutcomecuriosity_escalation_thresholdexamples/78_social_learning.py — template sharing with trustexamples/79_curiosity_driven_exploration.py — curiosity-driven escalationFocus:
New:
CognitiveMode enum, resolve_cognitive_mode() helperSleepConsolidation, ConsolidationResult, CounterfactualResultcounterfactual_replay() — static analysis of corrected factsmode_balance() for System A/B distributionexamples/76_cognitive_modes.py — mode annotations and watcher balanceexamples/77_sleep_consolidation.py — full consolidation cycleFocus:
New:
AdaptiveSkillOrganism, AdaptiveRunResult — compose-run-record lifecycle wrapperadaptive_skill_organism() — public factory for adaptive assemblyassemble_pattern() — convert PatternTemplate into runnable topologyExperienceRecord — cross-run intervention memory on WatcherComponentrecord_experience(), retrieve_similar_experiences(), recommend_intervention()examples/74_adaptive_assembly.py — full adaptive loopexamples/75_experience_driven_watcher.py — experience-driven recommendationsFocus:
New:
PatternLibrary, TaskFingerprint, PatternTemplate, PatternRunRecordWatcherComponent, WatcherConfig, WatcherSignal, SignalCategoryInterventionKind, WatcherIntervention — run-loop intervention typesexamples/72_pattern_repository.py — register, score, and retrieve templatesexamples/73_watcher_component.py — signal classification and interventionsFocus:
New:
SubstrateView — frozen read-only envelope for substrate queriesSkillStage fields: read_query, fact_extractor, emit_output_fact, fact_tagsSkillOrganism.substrate — optional BiTemporalMemory for auditable shared factsexamples/71_bitemporal_skill_organism.py — enterprise workflow with substrateFocus:
New:
BiTemporalMemory, BiTemporalFact, BiTemporalQuery, FactSnapshot, CorrectionResultexamples/69_bitemporal_memory.py — core API demoexamples/70_bitemporal_compliance_audit.py — enterprise audit scenarioFocus:
Related writing:
Focus:
Related writing: