Senthilkumar Gopal

10 days ago

https://t.co/JqjsQ7BnQI

Student | SEO Strategist | Former CMO — https://t.co/iV4kgVkWhd | Former VP technology — Intellectual Consultancy Services Pvt. Ltd | RT/No Reply ≠ Endor

10 days ago

vLLM's PegaFlow and Dynamo's KVBM are converging on the same bet: external KV cache as a standalone Rust service over a connector boundary. The interesting design choice between them - does the inference engine own the prefix index, or does the storage layer?

Who to follow

🕉️ Ganesh J. Acharya🕉️

@GaneshJAcharya

Ari Waller

@ariwaller

#DevRel #CommunityManager @Neo4j, @atlantajug, #StarWarsLego, @AtlHawks ex-JFrog #cypher #AuraDB

Mathieu Acher

@acherm

Professor @INSA_Rennes Researcher @DiverSE_Inria #IUF @InstUnivFr FIDE #Chess Master #SciencesDuLogiciel, Software #Variability, Artificial Intelligence

11 days ago

@AlicanKiraz0 I wonder if you swap this with Dynamo and disagg serving, how much more throughput you can push this to.

14 days ago

And this will work with Nvidia Dynamo seamlessly 😊

vLLM

@vllm_project

15 days ago

A vLLM MoE deployment's DP/EP topology used to be locked in at launch — scaling or swapping config meant a full restart, in-flight traffic dropped. Elastic Expert Parallelism changes that. One API call resizes a live deployment: curl -X POST localhost:8000/scale_elastic_ep \ -d '{"new_data_parallel_size": 16}' Under the hood: standby comm groups span the target topology, EPLB redistributes experts across the new EP group, and weights are transferred directly between GPUs over NVIDIA NVLink/RDMA. The same runtime reconfiguration path is what fault-tolerant serving needs: evict failed ranks, redistribute their experts, bring replacements back, no restart. Thanks to @NVIDIAAI, Sky Computing, @anyscalecompute, @RedHat_AI, and the community. 📖 https://t.co/bHmyFNZPEg

$vllm_project's tweet photo. A vLLM MoE deployment's DP/EP topology used to be locked in at launch — scaling or swapping config meant a full restart, in-flight traffic dropped. Elastic Expert Parallelism changes that. One API call resizes a live deployment: curl -X POST localhost:8000/scale_elastic_ep \ -d '{"new_data_parallel_size": 16}' Under the hood: standby comm groups span the target topology, EPLB redistributes experts across the new EP group, and weights are transferred directly between GPUs over NVIDIA NVLink/RDMA. The same runtime reconfiguration path is what fault-tolerant serving needs: evict failed ranks, redistribute their experts, bring replacements back, no restart. Thanks to @NVIDIAAI, Sky Computing, @anyscalecompute, @RedHat_AI, and the community. 📖 https://t.co/bHmyFNZPEg$

207

25K

18 days ago

@trq212 how often do you use AskUserQuestion 😊... Curious to see if up front clarification vs. post works better for the spec plan

Thariq

@trq212

18 days ago

okay this is going kinda viral and tbh my original text was kind of messy, so here's a second pass with the help of Claude: -- Implement <SPEC>. As you work maintain a running implementation-notes.html file that captures anything I should know about how the implementation diverges from or interprets the spec, including: - Design decisions: choices you made where the spec was ambiguous - Deviations: places where you intentionally departed from the spec, and why - Tradeoffs: alternatives you considered and why you picked what you did - Open questions: anything you'd want me to confirm or revise

86K

18 days ago

You still need to do this within the original world size 😊 so arbitrary grow/shrink might still be pending. But a great API for Fault tolerance 😉

Matej Sirovatka

@m_sirovatka

19 days ago

TIL that NCCL has ncclCommGrow/Shrink?? How did it take me so long to find this??

sengopal retweeted

Richard Sutton

@RichardSSutton

18 days ago

The bitter lesson in 26 words: Don’t be distracted by human knowledge, as AI has been historically. Instead focus on methods for creating knowledge that scale with computation, like search and learning.

136

975

571K

18 days ago

When empirical experiments are run, I ask CC to review the decisions made and track evidence supported analysis and give itself a reward and cite this particular technique for future analysis where relevant 😊

18 days ago

I have a variation of this for analysis of papers. To write the intuition gained, thinking behind why this method works vs. does not work for <problem>, evidence to support and the exact citation from the paper to support the decision.

Thariq

@trq212

18 days ago

a prompt I've been using a lot recently: implement <SPEC> and while you do, keep a running implementation-notes.html file (or markdown) with decisions you had to make weren't in the spec, things you had to change, tradeoffs you had to make or anything else I should know

trq212's tweet photo. a prompt I've been using a lot recently:

implement <SPEC> and while you do, keep a running implementation-notes.html file (or markdown) with decisions you had to make weren't in the spec, things you had to change, tradeoffs you had to make or anything else I should know https://t.co/qQFTES4fjo

343

10K

582

12K

820K

sengopal retweeted

22 days ago

Interestingly, long context attention actually enables more opportunities to stream weights into HBM, decreasing the memory requirements for weights stored in HBM at any given time, which pairs well with managing the larger KV cache. Check out our work on this:

761

25 days ago

Was waiting for something like this to show up. It's high time 😊

Claude

@claudeai

25 days ago

New in Claude Code: agent view. One list of all your sessions, available today as a research preview.

29K

11K

sengopal retweeted

27 days ago

@himanshustwts Another reason to use Dynamo 😊 BTW the blog cited is a follow up to this blog: https://t.co/8oKtKZPyWX

407

30 days ago

Does this translate the same way for the unsloth Lora and qlora recipes as well?

Unsloth AI

@UnslothAI

about 1 month ago

We collaborated with NVIDIA to teach you how we made LLM training ~25% faster! 🚀 Learn how 3 optimizations help your home GPU train models faster: 1. Packed-sequence metadata caching 2. Double-buffered checkpoint reloads 3. Faster MoE routing Guide: https://t.co/nwvVfNC8XE

UnslothAI's tweet photo. We collaborated with NVIDIA to teach you how we made LLM training ~25% faster! 🚀

Learn how 3 optimizations help your home GPU train models faster:
1. Packed-sequence metadata caching
2. Double-buffered checkpoint reloads
3. Faster MoE routing

Guide: https://t.co/nwvVfNC8XE https://t.co/j4NCke2F5o

940

159

656

61K

30 days ago

♥️ that Dynamo is the fulcrum around which Agents run their whole session . Incredible!

about 1 month ago

Check out this awesome article by some incredible NVIDIA peers about workload differences for agents! https://t.co/JxXvHWku7A

157

about 1 month ago

Let's go!!

about 1 month ago

Some awesome work by the SGLang and NVIDIA teams to drive GB200 performance forwards!

sengopal retweeted

Aran Komatsuzaki

@arankomatsuzaki

about 1 month ago

This feels like confusing a serving-runtime problem for a chip-startup opportunity. Agents do change inference patterns: loops, tool calls, branching, long context, KV reuse, burstiness. But most of that is an inference systems problem: scheduling, routing, KV-cache management, etc. Think Dynamo. By the time a new chip co tapes out + builds a compiler stack + wins cloud distribution, NVIDIA/AMD will likely have baked the obvious hardware-level optimizations into existing platforms.

28K