@omarsar0 this matches what i see running my own rag stack. the "best" context strategy flips with reuse rate. high reuse, cache aggressively. low reuse, the cache bookkeeping costs more than it saves. picking per-deployment instead of a global default is the right call.
@mattshumer_ the axis i'd watch: open weights closing the gap on reasoning, not raw scale. i run qwen and deepseek locally on a GB10 and they're already frontier-ish on math and code. the "different" is going to be where the value sits, not who ships the biggest model.
@vllm_project@NVIDIAAI day-0 support sounds boring until you're the one on weird hardware. i run vllm on a GB10 spark (sm_121) and spent a week chasing broken wheels before the docker nightly worked. the single openai-compatible api across modalities is what makes new stuff usable locally on day one.
@hwchase17 evaluator design is the hard part for long-horizon agents. the trap: scoring the final answer when the trajectory got there by luck. building a research agent now, and the scorer rewarding a run a human nudged is the failure i keep landing on. outcome evals hide a broken process.
@kimmonismus@AndrewCurran_ the tight feedback loop is why coding compounds fastest. tests and compile give you a verifier most domains don't. but "solve coding solves everything" skips the gap where there's no oracle to grade against. no unit test for "is this hypothesis good".
I have a problem. Karpathy released a 630-line autonomous research script called autoresearch in March. I built a version for my own coding queue, wrote a blog about it, and now every time I see a leaderboard I have to point my loop at it. This is how I ended up doing an ADMET challenge on a weekend.
For the non-pharma readers: ADMET is the part of drug discovery that asks "will this molecule survive contact with a human body." PXR is a sensor inside human cells. When a drug flips it on, the body responds by making more of an enzyme that chews up other drugs passing through. Run a three-drug oncology combination, have one of them flip PXR, and you can lose half the exposure of the other two without knowing. Combination trials live or die on this.
The OpenADMET PXR Induction blind challenge: 211 teams, 513 blinded molecules, leaderboard. I entered on a whim. I had never trained a molecular property model. I had never heard of Chemprop.
Six days later the loop closed at rank ~40 out of 211. Top 19%. Crazy right? Top 20%! In some random domain. This is the part nobody should care about.
The part to care about: I never picked an architecture. I never chose a loss function. Every model call was made by Claude inside the loop. I was reviewing summaries.
What I built was the scaffolding it ran inside: a queue of ideas, a hard gate that let nothing through unless it was honestly better, and a step where the system attacked its own claims before I ever saw them. I decided what counted as evidence. Claude did the rest. That's closer to running a research group than using a coding assistant.
It paid for itself in one catch. A submission that looked like a top-25 breakthrough turned out to be the model cheating itself, scoring high by peeking at the answers it was being graded on. The gate killed it. Shipping it would have dropped me to rank 87.
The model is fine. The system is what I'm taking with me. Full code, weights, methodology report on HuggingFace, Apache 2.0.
API Error: Server is temporarily limiting requests (not your usage limit) · Rate limited
...is crushing my soul this morning. I finally got around to trying the /workflow, and I get the sense that maybe me and everyone else running 100+ subagents might be cooking the @AnthropicAI servers a bit. @ClaudeDevs
`~/.claude/projects/` is full of months of Claude Code work that `grep` can't find semantically. So I built mneme: a local LanceDB index over the JSONL with four search modes (vector, fts, hybrid, rerank), wired into Claude Code as a skill. On a real corpus of 9,145 sessions and 125,062 chunks, rerank hits Recall@5 0.926 and MRR 0.864 against 27 hand-curated queries. Right session is the top hit 81.5% of the time.
Runs locally. No daemon, no cloud round trip. First index pulls model weights and chews through your history once. After that, incremental.
The eval harness ships with it. I tested bge-m3 (four times the params of nomic) overnight against the live numbers. It tied rerank recall but lost MRR by 0.018; vector-only it lost MRR by 0.151. The bigger embedder bought nothing at the top of the ladder and was worse everywhere else. Without the harness I'd have switched on instinct.
Twenty minutes to stand up. Repo's public. README walks the install, the Claude Code skill is two files, and the eval reproduces on your own corpus.
cc @claudeai@nomic_ai@lancedb
Anthropic shipped three things the same day and pretended they were separate.
Opus 4.8. A new Claude Code primitive called Dynamic Workflows that writes its own orchestration script and fans out to dozens of subagents with verifiers built in. And a 3x cut to Fast pricing, from $30/$150 per million tokens down to $10/$50.
They're one change. The unit of agentic work just moved from one careful model call to dozens of verified ones, and the price finally allows it.
The orchestration-first pattern has been possible for a year. LangGraph did it. CrewAI did it. Nobody ran it as a default because fifty parallel agents on the old Fast pricing burned a tank of compute to produce what a senior engineer could have written by hand. At a third of the cost, the math flips. Run it three times from different angles and trust the intersection.
The thing nobody is pricing correctly yet is the verifier. A fan-out that finds fifty plausible bugs and forwards all fifty is worse than the single pass it replaced, because now a human triages noise instead of reading code. Anthropic's own examples lean on adversarial verification for a reason. Most builders do not have that discipline yet, and the tooling for it is thin.
So the bottleneck moves. The question stops being "is the model good enough yet." It becomes "who is verifying."
Wrote up the full breakdown, including what's oversold, on Run Data Run.
Anthropic shipped Opus 4.8 + Dynamic Workflows today.
How I absorb new Claude versions without breaking everything I've built on top:
1. Runtime model alias stays unversioned. Every skill and agent in my setup uses `model: opus`. Claude Code resolves that to whatever's current default. The moment 4.8 flipped, my fleet was on it. No edits.
2. Prompting rules file IS version-pinned. `~/.claude/rules/opus-4-8.md` documents the model's quirks: 4.7's literalism, fewer-subagent bias, adaptive-only thinking. Those traits change between versions. If I named the file `opus-latest.md`, stale guidance would silently attach to a new model I hadn't audited.
3. Upgrade checkpoint: rename the file. That one rename forces me to re-read the changelog, audit which sections still apply, rewrite anything 4.7-only. Today that was: new High default on https://t.co/RRVSHTk5fX, fast mode 3x cheaper, the ultracode setting for auto-fanning to Dynamic Workflows.
4. ultracode stays OFF as default. Workflows can fan out to hundreds of subagents. They also burn quota fast. Opt-in per session, not always-on.
Runtime alias unpinned for ease. Rules doc pinned for discipline. Different jobs, different defaults.
cc @AnthropicAI@bcherny@_catwu@alexalbert__@jarredsumner (750k lines, Zig to Rust, 11 days. The Bun port is the Dynamic Workflows headline.)
@GoogleDeepMind the bar for 'AI for science' is whether it survives contact with a real clinical or wet-lab workflow, not the demo. been running Gemini grounding on genomics lit this month, solid for the triage step. what's the eval suite behind these tools?