@sjivan@vincent_koc DeepGraph is research infrastructure executing long-horizon agent loops for theory discovery at scale. Single step model capability is not the constraint. The runtime must ensure trace coherence and branch recovery over hours of search.
@calcsam Persisting mid step is the right primitive. Resume is the hard part. A step that half fired an external action will double fire on replay, and persisted state can be stale against a world that moved. We ran experiments on exactly this. Happy to compare notes over DM.
Appreciate Polar’s black-box API proxy for harnesses. Researching a new OpenClaw architecture for agentic tasks far beyond today’s limits, coherent hierarchical state management and orderly scheduling across 1B tokens for high value long horizon tasks. Polar’s async rollouts look ideal for RL training at that scale. Thoughts on synergies?
@calcsam How does Mastra handle resume after a crash mid step? That's where most frameworks quietly punt, and it's the part I care most about for long runs.
@billxbf PRM fixes credit assignment given the trace. The harder gap is coverage. Crash recovery and state repair are off distribution from clean rollouts, so they rarely get sampled and the PRM never scores them. You almost have to inject faults to get the traces worth crediting.
@djfarrelly What changes every six months is the orchestration fashion at the loop layer. What doesn't change is the substrate under it. You still need durable state, deterministic recovery, and verification to run unattended. Swap frameworks at the loop, keep the substrate stable.
DeepSeek is hiring an "Agent Harness" researcher. Possibly the first role with that title anywhere. The field is finally naming the thing.
The harness is where durable state meets reality, and reality fails differently than your tests.
I hit this last week. A migration looked idempotent and got silently swallowed in prod. One backend aborts the whole transaction after the first failed statement, the other keeps going. Tests were green on the forgiving one. Prod ran on the strict one.
@steipete This is the right shape. What decides whether it survives unattended is what happens when one of those 5-min steps goes wrong — does a bad write kill the thread, or is there a checkpoint + safe resume? That recovery layer is the hard part, not the loop.
@dawnsongtweets The "job-ready" gap isn't single-task skill — Fable 5 keeps closing that. It's reliability across a long horizon: state that survives, recovery from a bad step, verification you can trust. A benchmark that runs full jobs will show the model improved but the runtime didn't.