Justin H. Johnson

@BioInfo

F500 AI exec who still ships. 50 projects in 18 months. Writing the book "Builder-Leader: The AI Exoskeleton That Crosses the Gap"

Washington, DC

Joined April 2009

720 Following

4.8K Followers

3.5K Posts

Justin H. Johnson

@BioInfo

3 days ago

@omarsar0 built a rag platform on the dgx and ran into the same thing, wrote it up: https://t.co/Dml6BysjMP

Justin H. Johnson

@BioInfo

3 days ago

@omarsar0 this matches what i see running my own rag stack. the "best" context strategy flips with reuse rate. high reuse, cache aggressively. low reuse, the cache bookkeeping costs more than it saves. picking per-deployment instead of a global default is the right call.

Justin H. Johnson

@BioInfo

3 days ago

@mattshumer_ made the longer argument here, the model's not the frontier anymore: https://t.co/wCb1QDuP4w

Justin H. Johnson

@BioInfo

3 days ago

@mattshumer_ the axis i'd watch: open weights closing the gap on reasoning, not raw scale. i run qwen and deepseek locally on a GB10 and they're already frontier-ish on math and code. the "different" is going to be where the value sits, not who ships the biggest model.

188

Who to follow

Shameer Khader, PhD, MPH

@kshameer

Global Head @Sanofi: Techbio, AI/ML, Data Science, Precision Medicine, Bioinformatics & Drug Discovery. Ex @AstraZeneca @MayoClinic @Philips @mountsinainyc

Strand Life Sciences

@StrandLife

We Solve Problems in Genomics

Bio-IT World

@bioitworld

A leading source of news on technology and innovation in life sciences IT, informatics, genomics, next-gen, drug discovery, development, and clinical trials.

Justin H. Johnson

@BioInfo

3 days ago

@vllm_project @NVIDIAAI did a week-one writeup on the spark stack, it fights you before it helps: https://t.co/H0bmchiZbz

Justin H. Johnson

@BioInfo

3 days ago

@vllm_project @NVIDIAAI day-0 support sounds boring until you're the one on weird hardware. i run vllm on a GB10 spark (sm_121) and spent a week chasing broken wheels before the docker nightly worked. the single openai-compatible api across modalities is what makes new stuff usable locally on day one.

171

Justin H. Johnson

@BioInfo

3 days ago

@hwchase17 did a writeup on this after a 21k-run eval sweep, if useful: https://t.co/umODQVz6Wl

Justin H. Johnson

@BioInfo

3 days ago

@hwchase17 evaluator design is the hard part for long-horizon agents. the trap: scoring the final answer when the trajectory got there by luck. building a research agent now, and the scorer rewarding a run a human nudged is the failure i keep landing on. outcome evals hide a broken process.

Justin H. Johnson

@BioInfo

3 days ago

@kimmonismus @AndrewCurran_ the tight feedback loop is why coding compounds fastest. tests and compile give you a verifier most domains don't. but "solve coding solves everything" skips the gap where there's no oracle to grade against. no unit test for "is this hypothesis good".

Justin H. Johnson

@BioInfo

3 days ago

Full writeup: https://t.co/GXwjwEsIgV

Justin H. Johnson

@BioInfo

3 days ago

I have a problem. Karpathy released a 630-line autonomous research script called autoresearch in March. I built a version for my own coding queue, wrote a blog about it, and now every time I see a leaderboard I have to point my loop at it. This is how I ended up doing an ADMET challenge on a weekend. For the non-pharma readers: ADMET is the part of drug discovery that asks "will this molecule survive contact with a human body." PXR is a sensor inside human cells. When a drug flips it on, the body responds by making more of an enzyme that chews up other drugs passing through. Run a three-drug oncology combination, have one of them flip PXR, and you can lose half the exposure of the other two without knowing. Combination trials live or die on this. The OpenADMET PXR Induction blind challenge: 211 teams, 513 blinded molecules, leaderboard. I entered on a whim. I had never trained a molecular property model. I had never heard of Chemprop. Six days later the loop closed at rank ~40 out of 211. Top 19%. Crazy right? Top 20%! In some random domain. This is the part nobody should care about. The part to care about: I never picked an architecture. I never chose a loss function. Every model call was made by Claude inside the loop. I was reviewing summaries. What I built was the scaffolding it ran inside: a queue of ideas, a hard gate that let nothing through unless it was honestly better, and a step where the system attacked its own claims before I ever saw them. I decided what counted as evidence. Claude did the rest. That's closer to running a research group than using a coding assistant. It paid for itself in one catch. A submission that looked like a top-25 breakthrough turned out to be the model cheating itself, scoring high by peeking at the answers it was being graded on. The gate killed it. Shipping it would have dropped me to rank 87. The model is fine. The system is what I'm taking with me. Full code, weights, methodology report on HuggingFace, Apache 2.0.

475

Justin H. Johnson

@BioInfo

3 days ago

https://t.co/GXwjwEsIgV

Justin H. Johnson

@BioInfo

5 days ago

API Error: Server is temporarily limiting requests (not your usage limit) · Rate limited ...is crushing my soul this morning. I finally got around to trying the /workflow, and I get the sense that maybe me and everyone else running 100+ subagents might be cooking the @AnthropicAI servers a bit. @ClaudeDevs

Justin H. Johnson

@BioInfo

6 days ago

https://t.co/zEeBDPQVfb

Justin H. Johnson

@BioInfo

6 days ago

`~/.claude/projects/` is full of months of Claude Code work that `grep` can't find semantically. So I built mneme: a local LanceDB index over the JSONL with four search modes (vector, fts, hybrid, rerank), wired into Claude Code as a skill. On a real corpus of 9,145 sessions and 125,062 chunks, rerank hits Recall@5 0.926 and MRR 0.864 against 27 hand-curated queries. Right session is the top hit 81.5% of the time. Runs locally. No daemon, no cloud round trip. First index pulls model weights and chews through your history once. After that, incremental. The eval harness ships with it. I tested bge-m3 (four times the params of nomic) overnight against the live numbers. It tied rerank recall but lost MRR by 0.018; vector-only it lost MRR by 0.151. The bigger embedder bought nothing at the top of the ladder and was worse everywhere else. Without the harness I'd have switched on instinct. Twenty minutes to stand up. Repo's public. README walks the install, the Claude Code skill is two files, and the eval reproduces on your own corpus. cc @claudeai @nomic_ai @lancedb

BioInfo's tweet photo. `~/.claude/projects/` is full of months of Claude Code work that `grep` can't find semantically. So I built mneme: a local LanceDB index over the JSONL with four search modes (vector, fts, hybrid, rerank), wired into Claude Code as a skill. On a real corpus of 9,145 sessions and 125,062 chunks, rerank hits Recall@5 0.926 and MRR 0.864 against 27 hand-curated queries. Right session is the top hit 81.5% of the time.

Runs locally. No daemon, no cloud round trip. First index pulls model weights and chews through your history once. After that, incremental.

The eval harness ships with it. I tested bge-m3 (four times the params of nomic) overnight against the live numbers. It tied rerank recall but lost MRR by 0.018; vector-only it lost MRR by 0.151. The bigger embedder bought nothing at the top of the ladder and was worse everywhere else. Without the harness I'd have switched on instinct.

Twenty minutes to stand up. Repo's public. README walks the install, the Claude Code skill is two files, and the eval reproduces on your own corpus.

cc @claudeai @nomic_ai @lancedb

Justin H. Johnson

@BioInfo

6 days ago

Full post: https://t.co/F1SxUCdvTx

Justin H. Johnson

@BioInfo

6 days ago

Anthropic shipped three things the same day and pretended they were separate. Opus 4.8. A new Claude Code primitive called Dynamic Workflows that writes its own orchestration script and fans out to dozens of subagents with verifiers built in. And a 3x cut to Fast pricing, from $30/$150 per million tokens down to $10/$50. They're one change. The unit of agentic work just moved from one careful model call to dozens of verified ones, and the price finally allows it. The orchestration-first pattern has been possible for a year. LangGraph did it. CrewAI did it. Nobody ran it as a default because fifty parallel agents on the old Fast pricing burned a tank of compute to produce what a senior engineer could have written by hand. At a third of the cost, the math flips. Run it three times from different angles and trust the intersection. The thing nobody is pricing correctly yet is the verifier. A fan-out that finds fifty plausible bugs and forwards all fifty is worse than the single pass it replaced, because now a human triages noise instead of reading code. Anthropic's own examples lean on adversarial verification for a reason. Most builders do not have that discipline yet, and the tooling for it is thin. So the bottleneck moves. The question stops being "is the model good enough yet." It becomes "who is verifying." Wrote up the full breakdown, including what's oversold, on Run Data Run.

BioInfo's tweet photo. Anthropic shipped three things the same day and pretended they were separate.

Opus 4.8. A new Claude Code primitive called Dynamic Workflows that writes its own orchestration script and fans out to dozens of subagents with verifiers built in. And a 3x cut to Fast pricing, from $30/$150 per million tokens down to $10/$50.

They're one change. The unit of agentic work just moved from one careful model call to dozens of verified ones, and the price finally allows it.

The orchestration-first pattern has been possible for a year. LangGraph did it. CrewAI did it. Nobody ran it as a default because fifty parallel agents on the old Fast pricing burned a tank of compute to produce what a senior engineer could have written by hand. At a third of the cost, the math flips. Run it three times from different angles and trust the intersection.

The thing nobody is pricing correctly yet is the verifier. A fan-out that finds fifty plausible bugs and forwards all fifty is worse than the single pass it replaced, because now a human triages noise instead of reading code. Anthropic's own examples lean on adversarial verification for a reason. Most builders do not have that discipline yet, and the tooling for it is thin.

So the bottleneck moves. The question stops being "is the model good enough yet." It becomes "who is verifying."

Wrote up the full breakdown, including what's oversold, on Run Data Run.

113

Justin H. Johnson

@BioInfo

7 days ago

Opus 4.8: https://t.co/lrSadi66wU Dynamic Workflows: https://t.co/EHea5kN3Lf

Justin H. Johnson

@BioInfo

7 days ago

Anthropic shipped Opus 4.8 + Dynamic Workflows today. How I absorb new Claude versions without breaking everything I've built on top: 1. Runtime model alias stays unversioned. Every skill and agent in my setup uses `model: opus`. Claude Code resolves that to whatever's current default. The moment 4.8 flipped, my fleet was on it. No edits. 2. Prompting rules file IS version-pinned. `~/.claude/rules/opus-4-8.md` documents the model's quirks: 4.7's literalism, fewer-subagent bias, adaptive-only thinking. Those traits change between versions. If I named the file `opus-latest.md`, stale guidance would silently attach to a new model I hadn't audited. 3. Upgrade checkpoint: rename the file. That one rename forces me to re-read the changelog, audit which sections still apply, rewrite anything 4.7-only. Today that was: new High default on https://t.co/RRVSHTk5fX, fast mode 3x cheaper, the ultracode setting for auto-fanning to Dynamic Workflows. 4. ultracode stays OFF as default. Workflows can fan out to hundreds of subagents. They also burn quota fast. Opt-in per session, not always-on. Runtime alias unpinned for ease. Rules doc pinned for discipline. Different jobs, different defaults. cc @AnthropicAI @bcherny @_catwu @alexalbert__ @jarredsumner (750k lines, Zig to Rust, 11 days. The Bun port is the Dynamic Workflows headline.)

150

Justin H. Johnson

@BioInfo

8 days ago

@GoogleDeepMind the bar for 'AI for science' is whether it survives contact with a real clinical or wet-lab workflow, not the demo. been running Gemini grounding on genomics lit this month, solid for the triage step. what's the eval suite behind these tools?

204

Justin H. Johnson

@BioInfo

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users