Sid @sidgraph - Twitter Profile

really well written blog on the emerging shift from optimization-limited to rollout-limited frontier RL. what stood out to me was the observation that as reasoning trajectories become longer and more heterogeneous GPU utilization and learner-generator synchronization increasingly dominate training efficiency. one underexplored question is whether capability jumps themselves can be detected through staleness statistics. if old trajectories suddenly become much less useful for learning that may indicate the model has discovered a qualitatively new reasoning strategy. in that sense replay-buffer value decay could become an observable signal of phase transitions in capability acquisition. if you're into async-RL/large-scale training systems this is a really worthwhile read.

2

297

27

346

43K

sidgraph retweeted

Radhika Sharma @Radhika00586574

10 days ago

The open-source protein ML space just got a massive upgrade. Phenomenal work by @anindyadeeps and @try_litefold on dropping the biggest protein data collection on Hugging Face

1

10

5

3

2K

Who to follow

anshuman

@athleticKoder

maximizing shareholder value; mostly ml here;

Professor in CS@IIT Delhi. Interested in Machine learning for graphs. Hobbies: Cricket, fitness, Traveling.

Sid

@sidgraph

18 days ago

Check this out 👇

Shubham Sharma

@HappyyPablo

18 days ago

open sourcing Marlin-2B 🐟 a tiny VLM to extract structured information from videos Marlin is finetuned for two questions devs want to ask in their videos: what is happening, and when? Best open model in its weight class, competitive with Gemini-2.5-flash at only 2B params 🧵

135

5K

523

5K

305K

0

6

0

1

440

Sid

@sidgraph

18 days ago

Damn — its soo over for open ai 🫡

Andrej Karpathy

@karpathy

18 days ago

Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.

8K

150K

11K

14K

27M

0

2

0

256

Sid

@sidgraph

19 days ago

@nileshtrivedi yes i mean it was very obvious because this exists; but amazing work anyways :) https://t.co/269gan7RtS

0

1

23

sidgraph retweeted

Richard Sutton

@RichardSSutton

19 days ago

The bitter lesson in 26 words: Don’t be distracted by human knowledge, as AI has been historically. Instead focus on methods for creating knowledge that scale with computation, like search and learning.

136

7K

977

3K

572K

Sid

@sidgraph

23 days ago

@Kautukkundan 🔥🔥🚀🚀

0

142

sidgraph retweeted

Joykirat

@joykiratsingh

24 days ago

🚨Excited to announce Agent-BRACE! LLM agents in long-horizon POMDPs either blow up their context with raw history or summarize it, discarding uncertainty by collapsing belief into a point estimate. Agent-BRACE decouples the agent into belief state + policy models, jointly trained via RL. Key takeaways: 1️⃣ 🎯The belief state model produces a structured approximation of the belief distribution as a set of atomic natural-language claims with ordinal verbalized certainty labels ranging from certain to unknown. The policy conditions on this compact belief rather than the full history. 2️⃣ 📈 Outperforms strong RL baselines on long-horizon partially observable embodied language environments while maintaining a near-constant context window independent of episode length. 3️⃣ 🔄 The learned belief becomes increasingly calibrated as evidence accumulates, and epistemic belief decreases over time: the proportion of claims that the agent has the strongest level of belief in grows from 21% → 52% over an episode. 👇🧵

joykiratsingh's tweet photo. 🚨Excited to announce Agent-BRACE!

LLM agents in long-horizon POMDPs either blow up their context with raw history or summarize it, discarding uncertainty by collapsing belief into a point estimate. Agent-BRACE decouples the agent into belief state + policy models, jointly trained via RL.

Key takeaways:

1️⃣ 🎯The belief state model produces a structured approximation of the belief distribution as a set of atomic natural-language claims with ordinal verbalized certainty labels ranging from certain to unknown. The policy conditions on this compact belief rather than the full history.

2️⃣ 📈 Outperforms strong RL baselines on long-horizon partially observable embodied language environments while maintaining a near-constant context window independent of episode length.

3️⃣ 🔄 The learned belief becomes increasingly calibrated as evidence accumulates, and epistemic belief decreases over time: the proportion of claims that the agent has the strongest level of belief in grows from 21% → 52% over an episode.

👇🧵

2

67

39

23

16K

Sid

@sidgraph

25 days ago

@thinkymachines released interaction models! ☄️ ➡️ decoder-only transformer running over a single interleaved (in_k, out_k) stream of 200ms micro-turns across audio, video, and text. Encoder-free early fusion: dMel intensity-binned mel filterbanks + light embed for audio, 40×40 patches through an hMLP stem for video, token embed for text; all collapsed into a bag of embeddings per 200ms chunk and co-trained from scratch. ➡️ Audio out is a flow-matching head over mel frames. No Whisper, no separate TTS, no VAD, no turn boundaries, no dialog manager. 276B/12B-active interaction model stays in real-time; an async background model handles tool use and long-horizon reasoning over shared context, with partial results merged back into the live stream at moments appropriate to user state. KAME and MoshiRAG did this pattern for retrieval; TML generalizes it. Streaming sessions (upstreamed to SGLang) ship each 200ms chunk as a separate request and append to a persistent in-GPU KV sequence, eliminating per-turn realloc and metadata recompute. MoE kernels use gather+gemv instead of grouped GEMM for decode shapes. Bitwise trainer-sampler alignment via batch-invariant kernels from their Sep 2025 work: NVLS deterministic all-reduce on Blackwell, Split-KV attention with consistent accumulation order between prefill and decode (SM-aligned 4096-token left splits), under 5% e2e overhead. Moshi hits 200ms but is audio-only. Qwen3-Omni is ~234ms but uses a separate AuT encoder + sliding-window DiT, not encoder-free. AURA does visual proactivity but wraps ASR/TTS around a VideoLLM (text-out, half-duplex). gpt-realtime-2 and Gemini Live still rely on VAD turn detection.

sidgraph's tweet photo. @thinkymachines released interaction models! ☄️

➡️ decoder-only transformer running over a single interleaved (in_k, out_k) stream of 200ms micro-turns across audio, video, and text.

Encoder-free early fusion: dMel intensity-binned mel filterbanks + light embed for audio, 40×40 patches through an hMLP stem for video, token embed for text; all collapsed into a bag of embeddings per 200ms chunk and co-trained from scratch.

➡️ Audio out is a flow-matching head over mel frames. No Whisper, no separate TTS, no VAD, no turn boundaries, no dialog manager.
276B/12B-active interaction model stays in real-time; an async background model handles tool use and long-horizon reasoning over shared context, with partial results merged back into the live stream at moments appropriate to user state.

KAME and MoshiRAG did this pattern for retrieval; TML generalizes it.
Streaming sessions (upstreamed to SGLang) ship each 200ms chunk as a separate request and append to a persistent in-GPU KV sequence, eliminating per-turn realloc and metadata recompute.

MoE kernels use gather+gemv instead of grouped GEMM for decode shapes. Bitwise trainer-sampler alignment via batch-invariant kernels from their Sep 2025 work: NVLS deterministic all-reduce on Blackwell, Split-KV attention with consistent accumulation order between prefill and decode (SM-aligned 4096-token left splits), under 5% e2e overhead.

Moshi hits 200ms but is audio-only. Qwen3-Omni is ~234ms but uses a separate AuT encoder + sliding-window DiT, not encoder-free. AURA does visual proactivity but wraps ASR/TTS around a VideoLLM (text-out, half-duplex). gpt-realtime-2 and Gemini Live still rely on VAD turn detection.

Thinking Machines

@thinkymachines

26 days ago

The technical report includes our motivation, early evaluation results, and technical approach. https://t.co/AFJZ5kH7Ku

11

388

21

151

93K

0

3

0

3

296

sidgraph retweeted

Sid

@sidgraph

27 days ago

Amazing talk by @DrJimFan -- exciting times for Physical AGI! TL;DR VLA architecture is parameter-misallocated toward language and should be replaced by World Action Models → pretrained video diffusion models that jointly predict future world states and robot actions, instantiated by Dream Zero (a 14B model running real-time control at 7Hz with 2× generalization gains over VLAs, @SeonghyeonYe ). https://t.co/c9GEg7JAjw His central data claim is that egocentric human video is the FSD-equivalent ambient data flywheel for robotics, and EgoScale (@ruijie_zheng12) demonstrates a near-perfect log-linear scaling law (ℒ(N) = a − b log N, R² = 0.9983) between 1K and 20K hours of pretraining data and downstream dexterity performance. https://t.co/79gbH6tV0X His central environment claim is that classical physics simulators will be replaced by neural simulators, and Dream Dojo (@ShenyuanGao) demonstrates this with 44K hours of human video pretraining, 10 FPS real-time interaction, and Pearson r = 0.995 policy-evaluation fidelity. https://t.co/F1XigZg2fE Significant gaps in the talk: - it does not address runtime semantics (skill installation, behavior consistency, run-update separation) - it does not address the model-exploitation failure mode of training policies against learned simulators or learned rewards. My running notes w/ Opus 4.7 👇 https://t.co/lAnuorsUC6 @NVIDIAAI

1

12

2

13

1K

Sid

@sidgraph

27 days ago

Notes on Robotics' End Game: Nvidia's Jim Fan! - @sidgraph and Opus 4.7 https://t.co/LT7eFh29Lo TL;DR Vision-Language-Action architecture is parameter-misallocated toward language and should be replaced by World Action Models → pretrained video diffusion models that jointly predict future world states and robot actions, instantiated by Dream Zero (a 14B model running real-time control at 7Hz with 2× generalization gains over VLAs). His central data claim is that egocentric human video is the FSD-equivalent ambient data flywheel for robotics, and EgoScale demonstrates a near-perfect log-linear scaling law (ℒ(N) = a − b log N, R² = 0.9983) between 1K and 20K hours of pretraining data and downstream dexterity performance. His central environment claim is that classical physics simulators will be replaced by neural simulators, and Dream Dojo demonstrates this with 44K hours of human video pretraining, 10 FPS real-time interaction, and Pearson r = 0.995 policy-evaluation fidelity. The framework is strongest on data (genuinely empirically grounded), partial on architecture (Dream Zero shows the substrate works but is GPT-2-stage, not GPT-3-stage), and weakest on the predictive claims about timeline and the "VLA is dead" rhetorical framing. Three significant gaps in the talk: it does not address runtime semantics (skill installation, behavior consistency, run-update separation), it does not address safety, and it does not address the model-exploitation failure mode of training policies against learned simulators or learned rewards. https://t.co/lAnuorsUC6

sidgraph's tweet photo. Notes on Robotics' End Game: Nvidia's Jim Fan!
- @sidgraph and Opus 4.7

https://t.co/LT7eFh29Lo

TL;DR
Vision-Language-Action architecture is parameter-misallocated toward language and should be replaced by World Action Models
→ pretrained video diffusion models that jointly predict future world states and robot actions, instantiated by Dream Zero (a 14B model running real-time control at 7Hz with 2× generalization gains over VLAs).

His central data claim is that egocentric human video is the FSD-equivalent ambient data flywheel for robotics, and EgoScale demonstrates a near-perfect log-linear scaling law (ℒ(N) = a − b log N, R² = 0.9983) between 1K and 20K hours of pretraining data and downstream dexterity performance.

His central environment claim is that classical physics simulators will be replaced by neural simulators, and Dream Dojo demonstrates this with 44K hours of human video pretraining, 10 FPS real-time interaction, and Pearson r = 0.995 policy-evaluation fidelity.

The framework is strongest on data (genuinely empirically grounded), partial on architecture (Dream Zero shows the substrate works but is GPT-2-stage, not GPT-3-stage), and weakest on the predictive claims about timeline and the "VLA is dead" rhetorical framing.

Three significant gaps in the talk: it does not address runtime semantics (skill installation, behavior consistency, run-update separation), it does not address safety, and it does not address the model-exploitation failure mode of training policies against learned simulators or learned rewards.

https://t.co/lAnuorsUC6

0

13

0

6

408

Sid

@sidgraph

27 days ago

https://t.co/fut0b1xdwn

0

1

99

Sid

@sidgraph

27 days ago

Amazing talk by @DrJimFan -- exciting times for Physical AGI! TL;DR VLA architecture is parameter-misallocated toward language and should be replaced by World Action Models → pretrained video diffusion models that jointly predict future world states and robot actions, instantiated by Dream Zero (a 14B model running real-time control at 7Hz with 2× generalization gains over VLAs, @SeonghyeonYe ). https://t.co/c9GEg7JAjw His central data claim is that egocentric human video is the FSD-equivalent ambient data flywheel for robotics, and EgoScale (@ruijie_zheng12) demonstrates a near-perfect log-linear scaling law (ℒ(N) = a − b log N, R² = 0.9983) between 1K and 20K hours of pretraining data and downstream dexterity performance. https://t.co/79gbH6tV0X His central environment claim is that classical physics simulators will be replaced by neural simulators, and Dream Dojo (@ShenyuanGao) demonstrates this with 44K hours of human video pretraining, 10 FPS real-time interaction, and Pearson r = 0.995 policy-evaluation fidelity. https://t.co/F1XigZg2fE Significant gaps in the talk: - it does not address runtime semantics (skill installation, behavior consistency, run-update separation) - it does not address the model-exploitation failure mode of training policies against learned simulators or learned rewards. My running notes w/ Opus 4.7 👇 https://t.co/lAnuorsUC6 @NVIDIAAI

Jim Fan

@DrJimFan

29 days ago

I promise this will be the best 20 min you spend today! Robotics: Endgame, the sequel to my last year's Sequoia AI Ascent talk, "Physical Turing Test". I laid out the roadmap for solving Physical AGI as a simple parallel to the LLM success story. Be a good scientist, copy homework ;) And stay till the end, more easter eggs and predictions for your polymarket! 00:30 DGX-1 origin story at OpenAI, I was there in 2016 signing with Jensen and Elon. Heading to the Computer History Museum! 01:42 The Great Parallel 03:31 Robotics, the Endgame 03:39 Why VLAs fall short 04:32 Video world models as the 2nd pretraining paradigm 06:09 World Action Models (WAM) 07:46 Strategies for robot data collection and the FSD equivalent to physical data flywheel for robot manipulation 11:06 EgoScale and the Dexterity Scaling Law we discovered recently 14:00 Physical RL: bridging the last mile 15:39 DreamDojo: an end-to-end neural physics engine for scaling RL in silico 17:00 Civilizational Technology Tree and my predictions for the near future. Spoiler: it's closer than you think. Thanks to my friends at Sequoia for inviting me back to AI Ascent this year! I had a blast! Last year's talk is attached in the thread if you missed it.

160

3K

545

4K

559K

1

12

2

13

1K

Sid

@sidgraph

28 days ago

@yacineMTB check out SWE-WebDevBench🥵 https://t.co/6ptEG9AgL5

Sid

@sidgraph

29 days ago

Vibe Coding is not vibing? Agents perform <60% on SWE-WebDevBench 👀. Code-level benchmarks (HumanEval, SWE-bench, FeatBench) take the specification as given and grade the patch. Vibe coding inverts this: the user gives natural-language intent, the platform must do PM, engineering, and ops in one pipeline. We evaluate across three orthogonal axes: ➡️ Interaction Mode (ACR=App Creation Request, vs AMR=App Modification Request), ➡️Agency Angle (PM × Engineering × Ops), ➡️Complexity Tier (T4 multi-role SaaS, T5 AI-native). Introducing SWE-WebDevBench: a comprehensive eval framework to assess AI coding platforms as virtual software development agencies, covering not just the middle step of coding, but the entire software lifecycle: Requirements gathering, planning, deployment and change management. @QwikBuild, @emergentlabs, @Lovable, @v0 see https://t.co/RwqmCQgKeV 🧵

sidgraph's tweet photo. Vibe Coding is not vibing? Agents perform <60% on SWE-WebDevBench 👀.

Code-level benchmarks (HumanEval, SWE-bench, FeatBench) take the specification as given and grade the patch. Vibe coding inverts this: the user gives natural-language intent, the platform must do PM, engineering, and ops in one pipeline.

We evaluate across three orthogonal axes:
➡️ Interaction Mode (ACR=App Creation Request, vs AMR=App Modification Request),
➡️Agency Angle (PM × Engineering × Ops),
➡️Complexity Tier (T4 multi-role SaaS, T5 AI-native).

Introducing SWE-WebDevBench: a comprehensive eval framework to assess AI coding platforms as virtual software development agencies, covering not just the middle step of coding, but the entire software lifecycle: Requirements gathering, planning, deployment and change management.

@QwikBuild, @emergentlabs, @Lovable, @v0
see https://t.co/RwqmCQgKeV 🧵

1

9

3

2

3K

0

1

0

978

Sid

@sidgraph

28 days ago

@GauravSarkar99 No worries, see you next time :)

0

2

0

38

Sid

@sidgraph

28 days ago

Annnddd its a house full 😄🥳 its soo fun to bring best researchers of blr together — conversations are sickk cool

Sid

@sidgraph

about 1 month ago

We’re hosting an invitation only gathering of researchers primarily in AI (but not limited to) over dinner in Indiranagar BLR! Just thoughtful conversations over good food, about seminal papers, emerging fields, research of your group/ recent papers you’ve published 🫶 If you are interested/ doing research in: - Coding Agents - RL (Synthetic RL envs, Rl post training, Alignment, On policy distillation) - World Models - Neural Computers / Self adaptive neural systems We would love to have you over dinner! PS: Amazing venue, chill vibes, good food and most intellect people from BLR all together Hosted by Me and @thesisofsarthak from @Basethesislabs <3! Register through link below👇 https://t.co/QlshnK1jDt

7

60

3

23

18K

5

47

1

7

3K

Sid

@sidgraph

28 days ago

@freshlimesofa yes :)

0

1

0

55

Sid

@sidgraph

28 days ago

Amazing weather to discuss science and computing 🫶

Sid

@sidgraph

about 1 month ago

We’re hosting an invitation only gathering of researchers primarily in AI (but not limited to) over dinner in Indiranagar BLR! Just thoughtful conversations over good food, about seminal papers, emerging fields, research of your group/ recent papers you’ve published 🫶 If you are interested/ doing research in: - Coding Agents - RL (Synthetic RL envs, Rl post training, Alignment, On policy distillation) - World Models - Neural Computers / Self adaptive neural systems We would love to have you over dinner! PS: Amazing venue, chill vibes, good food and most intellect people from BLR all together Hosted by Me and @thesisofsarthak from @Basethesislabs <3! Register through link below👇 https://t.co/QlshnK1jDt

7

60

3

23

18K

4

110

1

15

8K

Sid

@sidgraph

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users