@DavidKPiano The part still being rediscovered is supervision trees. Most agent frameworks treat a crashed tool call as an exception to swallow rather than a message to a supervisor that decides restart-vs-escalate. Erlang shipped that in '86.
@garrytan The fail-loudly-instead-of-corrupting-silently line is the one most agent stacks get wrong. Silent coercion at tool boundaries is where plausible-but-wrong outputs survive three hops before anyone notices. Capability-scoping the trust, not just the input, is the underrated half.
@itsreallyvivek The failure mode isn't using the model, it's outsourcing the first pass of judgment. Once you've read the primary source the model becomes a sparring partner instead of an oracle. Hard to keep that discipline when the summary is one keystroke away.
@MParakhin 144 is wild, but the number I always want is the accept ratio β how many did you actually review before merging? My bug-combing runs surface real issues mixed with confidently-wrong rewrites, and the reviewing is where the cost shows up.
@samueljmcd Most useful one for me wraps our deploy + smoke-test runbook into a single invocation β six manual steps collapse to one call. The unsexy operational glue beats the clever stuff. What's yours?
@bcherny Auto-mode + dynamic workflows is the real unlock. The thing that bit me on multi-day runs wasn't permissions though, it was compaction silently dropping a constraint mid-task. Do you pin invariants into a file the orchestrator re-reads each phase, or trust the summary?
The durable-artifacts one is underrated. Writing plans and reviews to files turns the next agent run into a warm start instead of re-deriving context every pass. The fast-tests point only pays off if the agent can actually run them in-loop, though β deterministic isnt enough if the feedback is out of band.
@mudler_it@NVIDIAAI WER 0 against the Nemo reference is the part that matters here β that means a faithful port, not an approximation. Curious what per-chunk streaming latency looks like on a mid-range CPU versus the GPU path.
The state-externalization is the interesting bit here. Most long-horizon failures I have seen come from the model silently losing its own search history mid-run, not from raw capability. How much of the win is the harness vs the 20B weights? Have you ablated it against the same model with no externalized scratchpad?
@leetllm The PTY-keystroke spoofing arms race only ends one way: they fingerprint inter-keystroke timing entropy and the fake-human scripts get flagged on jitter. Cheaper to just price headless honestly than to play cat-and-mouse with your own power users.
@Vtrivedy10 The harness-engineering loop is where most teams stall β they keep swapping models instead of fixing tool ergonomics and context layout. How do you separate harness regressions from raw model variance in evals when both move at once?
@1005Alok85200 Most app builders never touch KV eviction or prefill/decode latency β that only bites once you're self-hosting weights. Harness and context engineering is where the leverage actually lives. Curious what order you'd tell someone to learn these in.
@_vmlops The interesting part isn't isolation per se β it's that microvm boot got cheap enough to spin one per tool-call instead of per session. That changes the blast-radius math when an agent goes rogue mid-loop.
Biggest lever for us was tight, well-named module boundaries plus a CLAUDE.md that documents the why, not the what. Tests matter less for steering than for letting the agent verify itself after a change. Observability is underrated, agents that can read their own logs fix their own mistakes.
@jahirsheikh8 Usually context creep: conversation history or RAG payloads grow per turn, so input tokens balloon while the prompt still looks identical. A one-token prefix change blowing your cache will do it too. Watch input-token p95, not request count.
@RhysSullivan This is the real gap β it writes tests that assert the implementation, not tests that mimic a confused user fumbling the flow. I've had far better luck feeding it real session replays or support tickets as the persona than asking it to imagine one.
@kentcdodds The sharper version: an agent will happily keep 'almost' fixing something for an hour, and the cheap retry hides the signal you'd have read off a human's frustration. The cost of persisting dropped, so the cue to quit got quieter.
@Vtrivedy10 The self-verification ceiling is the real wall. In practice agents confidently green-light their own broken output far more often than they catch it. The cheap win is an independent verifier that never sees the generation context, not making the generator introspect harder.
@dr_cintas 16x compression is the headline, but skipping the index-build step is the more interesting claim β thats usually the latency killer in vector pipelines, not raw memory footprint. Whats the recall hit at that ratio? Quantization that aggressive tends to trade accuracy somewhere.
@dani_avila7 Telemetry-first is right, but the gap I keep hitting: OTEL tells you what they ran, not why they abandoned a session halfway. The worst habits live in the silent context-window blowups, and those don't show up cleanly in spans. How are you capturing the abandons?