really well written blog on the emerging shift from optimization-limited to rollout-limited frontier RL. what stood out to me was the observation that as reasoning trajectories become longer and more heterogeneous GPU utilization and learner-generator synchronization increasingly dominate training efficiency. one underexplored question is whether capability jumps themselves can be detected through staleness statistics. if old trajectories suddenly become much less useful for learning that may indicate the model has discovered a qualitatively new reasoning strategy. in that sense replay-buffer value decay could become an observable signal of phase transitions in capability acquisition. if you're into async-RL/large-scale training systems this is a really worthwhile read.
The open-source protein ML space just got a massive upgrade. Phenomenal work by @anindyadeeps and @try_litefold on dropping the biggest protein data collection on Hugging Face
open sourcing Marlin-2B 🐟
a tiny VLM to extract structured information from videos
Marlin is finetuned for two questions devs want to ask in their videos: what is happening, and when?
Best open model in its weight class, competitive with Gemini-2.5-flash at only 2B params 🧵
Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.
The bitter lesson in 26 words:
Don’t be distracted by human knowledge, as AI has been historically.
Instead focus on methods for creating knowledge that scale with computation, like search and learning.
🚨Excited to announce Agent-BRACE!
LLM agents in long-horizon POMDPs either blow up their context with raw history or summarize it, discarding uncertainty by collapsing belief into a point estimate. Agent-BRACE decouples the agent into belief state + policy models, jointly trained via RL.
Key takeaways:
1️⃣ 🎯The belief state model produces a structured approximation of the belief distribution as a set of atomic natural-language claims with ordinal verbalized certainty labels ranging from certain to unknown. The policy conditions on this compact belief rather than the full history.
2️⃣ 📈 Outperforms strong RL baselines on long-horizon partially observable embodied language environments while maintaining a near-constant context window independent of episode length.
3️⃣ 🔄 The learned belief becomes increasingly calibrated as evidence accumulates, and epistemic belief decreases over time: the proportion of claims that the agent has the strongest level of belief in grows from 21% → 52% over an episode.
👇🧵
@thinkymachines released interaction models! ☄️
➡️ decoder-only transformer running over a single interleaved (in_k, out_k) stream of 200ms micro-turns across audio, video, and text.
Encoder-free early fusion: dMel intensity-binned mel filterbanks + light embed for audio, 40×40 patches through an hMLP stem for video, token embed for text; all collapsed into a bag of embeddings per 200ms chunk and co-trained from scratch.
➡️ Audio out is a flow-matching head over mel frames. No Whisper, no separate TTS, no VAD, no turn boundaries, no dialog manager.
276B/12B-active interaction model stays in real-time; an async background model handles tool use and long-horizon reasoning over shared context, with partial results merged back into the live stream at moments appropriate to user state.
KAME and MoshiRAG did this pattern for retrieval; TML generalizes it.
Streaming sessions (upstreamed to SGLang) ship each 200ms chunk as a separate request and append to a persistent in-GPU KV sequence, eliminating per-turn realloc and metadata recompute.
MoE kernels use gather+gemv instead of grouped GEMM for decode shapes. Bitwise trainer-sampler alignment via batch-invariant kernels from their Sep 2025 work: NVLS deterministic all-reduce on Blackwell, Split-KV attention with consistent accumulation order between prefill and decode (SM-aligned 4096-token left splits), under 5% e2e overhead.
Moshi hits 200ms but is audio-only. Qwen3-Omni is ~234ms but uses a separate AuT encoder + sliding-window DiT, not encoder-free. AURA does visual proactivity but wraps ASR/TTS around a VideoLLM (text-out, half-duplex). gpt-realtime-2 and Gemini Live still rely on VAD turn detection.
Amazing talk by @DrJimFan -- exciting times for Physical AGI!
TL;DR
VLA architecture is parameter-misallocated toward language and should be replaced by World Action Models
→ pretrained video diffusion models that jointly predict future world states and robot actions, instantiated by Dream Zero (a 14B model running real-time control at 7Hz with 2× generalization gains over VLAs, @SeonghyeonYe ).
https://t.co/c9GEg7JAjw
His central data claim is that egocentric human video is the FSD-equivalent ambient data flywheel for robotics, and EgoScale (@ruijie_zheng12) demonstrates a near-perfect log-linear scaling law (ℒ(N) = a − b log N, R² = 0.9983) between 1K and 20K hours of pretraining data and downstream dexterity performance.
https://t.co/79gbH6tV0X
His central environment claim is that classical physics simulators will be replaced by neural simulators, and Dream Dojo (@ShenyuanGao) demonstrates this with 44K hours of human video pretraining, 10 FPS real-time interaction, and Pearson r = 0.995 policy-evaluation fidelity.
https://t.co/F1XigZg2fE
Significant gaps in the talk:
- it does not address runtime semantics (skill installation, behavior consistency, run-update separation)
- it does not address the model-exploitation failure mode of training policies against learned simulators or learned rewards.
My running notes w/ Opus 4.7 👇
https://t.co/lAnuorsUC6
@NVIDIAAI
Notes on Robotics' End Game: Nvidia's Jim Fan!
- @sidgraph and Opus 4.7
https://t.co/LT7eFh29Lo
TL;DR
Vision-Language-Action architecture is parameter-misallocated toward language and should be replaced by World Action Models
→ pretrained video diffusion models that jointly predict future world states and robot actions, instantiated by Dream Zero (a 14B model running real-time control at 7Hz with 2× generalization gains over VLAs).
His central data claim is that egocentric human video is the FSD-equivalent ambient data flywheel for robotics, and EgoScale demonstrates a near-perfect log-linear scaling law (ℒ(N) = a − b log N, R² = 0.9983) between 1K and 20K hours of pretraining data and downstream dexterity performance.
His central environment claim is that classical physics simulators will be replaced by neural simulators, and Dream Dojo demonstrates this with 44K hours of human video pretraining, 10 FPS real-time interaction, and Pearson r = 0.995 policy-evaluation fidelity.
The framework is strongest on data (genuinely empirically grounded), partial on architecture (Dream Zero shows the substrate works but is GPT-2-stage, not GPT-3-stage), and weakest on the predictive claims about timeline and the "VLA is dead" rhetorical framing.
Three significant gaps in the talk: it does not address runtime semantics (skill installation, behavior consistency, run-update separation), it does not address safety, and it does not address the model-exploitation failure mode of training policies against learned simulators or learned rewards.
https://t.co/lAnuorsUC6
Amazing talk by @DrJimFan -- exciting times for Physical AGI!
TL;DR
VLA architecture is parameter-misallocated toward language and should be replaced by World Action Models
→ pretrained video diffusion models that jointly predict future world states and robot actions, instantiated by Dream Zero (a 14B model running real-time control at 7Hz with 2× generalization gains over VLAs, @SeonghyeonYe ).
https://t.co/c9GEg7JAjw
His central data claim is that egocentric human video is the FSD-equivalent ambient data flywheel for robotics, and EgoScale (@ruijie_zheng12) demonstrates a near-perfect log-linear scaling law (ℒ(N) = a − b log N, R² = 0.9983) between 1K and 20K hours of pretraining data and downstream dexterity performance.
https://t.co/79gbH6tV0X
His central environment claim is that classical physics simulators will be replaced by neural simulators, and Dream Dojo (@ShenyuanGao) demonstrates this with 44K hours of human video pretraining, 10 FPS real-time interaction, and Pearson r = 0.995 policy-evaluation fidelity.
https://t.co/F1XigZg2fE
Significant gaps in the talk:
- it does not address runtime semantics (skill installation, behavior consistency, run-update separation)
- it does not address the model-exploitation failure mode of training policies against learned simulators or learned rewards.
My running notes w/ Opus 4.7 👇
https://t.co/lAnuorsUC6
@NVIDIAAI
I promise this will be the best 20 min you spend today! Robotics: Endgame, the sequel to my last year's Sequoia AI Ascent talk, "Physical Turing Test". I laid out the roadmap for solving Physical AGI as a simple parallel to the LLM success story. Be a good scientist, copy homework ;)
And stay till the end, more easter eggs and predictions for your polymarket!
00:30 DGX-1 origin story at OpenAI, I was there in 2016 signing with Jensen and Elon. Heading to the Computer History Museum!
01:42 The Great Parallel
03:31 Robotics, the Endgame
03:39 Why VLAs fall short
04:32 Video world models as the 2nd pretraining paradigm
06:09 World Action Models (WAM)
07:46 Strategies for robot data collection and the FSD equivalent to physical data flywheel for robot manipulation
11:06 EgoScale and the Dexterity Scaling Law we discovered recently
14:00 Physical RL: bridging the last mile
15:39 DreamDojo: an end-to-end neural physics engine for scaling RL in silico
17:00 Civilizational Technology Tree and my predictions for the near future. Spoiler: it's closer than you think.
Thanks to my friends at Sequoia for inviting me back to AI Ascent this year! I had a blast! Last year's talk is attached in the thread if you missed it.
Vibe Coding is not vibing? Agents perform <60% on SWE-WebDevBench 👀.
Code-level benchmarks (HumanEval, SWE-bench, FeatBench) take the specification as given and grade the patch. Vibe coding inverts this: the user gives natural-language intent, the platform must do PM, engineering, and ops in one pipeline.
We evaluate across three orthogonal axes:
➡️ Interaction Mode (ACR=App Creation Request, vs AMR=App Modification Request),
➡️Agency Angle (PM × Engineering × Ops),
➡️Complexity Tier (T4 multi-role SaaS, T5 AI-native).
Introducing SWE-WebDevBench: a comprehensive eval framework to assess AI coding platforms as virtual software development agencies, covering not just the middle step of coding, but the entire software lifecycle: Requirements gathering, planning, deployment and change management.
@QwikBuild, @emergentlabs, @Lovable, @v0
see https://t.co/RwqmCQgKeV 🧵
We’re hosting an invitation only gathering of researchers primarily in AI (but not limited to) over dinner in Indiranagar BLR!
Just thoughtful conversations over good food, about seminal papers, emerging fields, research of your group/ recent papers you’ve published 🫶
If you are interested/ doing research in:
- Coding Agents
- RL (Synthetic RL envs, Rl post training, Alignment, On policy distillation)
- World Models
- Neural Computers / Self adaptive neural systems
We would love to have you over dinner!
PS: Amazing venue, chill vibes, good food and most intellect people from BLR all together
Hosted by Me and @thesisofsarthak from @Basethesislabs <3!
Register through link below👇
https://t.co/QlshnK1jDt
We’re hosting an invitation only gathering of researchers primarily in AI (but not limited to) over dinner in Indiranagar BLR!
Just thoughtful conversations over good food, about seminal papers, emerging fields, research of your group/ recent papers you’ve published 🫶
If you are interested/ doing research in:
- Coding Agents
- RL (Synthetic RL envs, Rl post training, Alignment, On policy distillation)
- World Models
- Neural Computers / Self adaptive neural systems
We would love to have you over dinner!
PS: Amazing venue, chill vibes, good food and most intellect people from BLR all together
Hosted by Me and @thesisofsarthak from @Basethesislabs <3!
Register through link below👇
https://t.co/QlshnK1jDt