Vikranth Srivatsa @vikranth22446 - Twitter Profile

30 days ago

In today's LLM serving, a single system handles requests with different demands. We built a multi-tier-SLO LLM serving system that treats Tensor Parallelism as a runtime knob instead of a fixed setting. Result: 5.3x better SLO-compliant goodput than SoTA.

yiying__zhang's tweet photo. In today's LLM serving, a single system handles requests with different demands. We built a multi-tier-SLO LLM serving system that treats Tensor Parallelism as a runtime knob instead of a fixed setting. Result: 5.3x better SLO-compliant goodput than SoTA. https://t.co/6EwcA0cJYH

2

31

8

29

7K

vikranth22446 retweeted

Hao AI Lab

@haoailab

6 months ago

🔥CAD: Efficient Long-context Language Model Training by Core Attention Disaggregation Repo: https://t.co/QdNk8iXy6c Blog: https://t.co/O5xRrl22UJ Training a long-context LLM model can suffer from severe workload imbalance caused by core-attention - the softmax(QK^T)V part. Core-attention disaggregation (CAD) fundamentally eliminates workload imbalance by disaggregating core-attention from the rest of the model.

4

251

49

195

90K

vikranth22446 retweeted

Hao AI Lab

@haoailab

7 months ago

🔥 New Blog: “Disaggregated Inference: 18 Months Later” 18 months in LLM inference feels like a new Moore’s Law cycle – but this time not just 2x per year: 💸 Serving cost ↓10–100x 🚀 Throughput ↑10x ⚡ Latency ↓5x A big reason? Disaggregated Inference. From DistServe, our early research system on prefill-decode disaggregation, to today’s production frameworks, disaggregation has become the backbone of modern LLM serving. So what is disaggregated inference? Why does the LLM inference community love it? And how far have we come? As the inventors of this technique, we take a look back – 18 months later - at how the idea reshaped the landscape and what comes next. 🔗 Read the full story: https://t.co/Kh7e6xq0Gx

7

173

48

141

40K

vikranth22446 retweeted

Yiying Zhang

@yiying__zhang

almost 2 years ago

WukLab's new study reveals CPU scheduling overhead can dominate LLM inference time—up to 50% in systems like vLLM! Scheduling overhead can no longer be ignored as model forwarding speeds increase and more scheduling tasks get added.#LLM #vLLM #SGLang Read https://t.co/6gVkdTZWkz

yiying__zhang's tweet photo. WukLab's new study reveals CPU scheduling overhead can dominate LLM inference time—up to 50% in systems like vLLM! Scheduling overhead can no longer be ignored as model forwarding speeds increase and more scheduling tasks get added.#LLM #vLLM #SGLang

Read https://t.co/6gVkdTZWkz https://t.co/eBRr3Btw1L

3

55

12

24

6K

vikranth22446 retweeted

Yiying Zhang

@yiying__zhang

about 2 years ago

Today, LLMs are constantly being augmented with tools, agents, models, RAG, etc. We built InferCept [ICML'24], the first serving framework designed for augmented LLMs. InferCept sustains a 1.6x-2x higher serving load than SOTA LLM serving systems. #AugLLM https://t.co/KvkRWAS7Z8

1

29

2

12

3K

vikranth22446 retweeted

Yiying Zhang

@yiying__zhang

about 2 years ago

LLM prompts are getting longer and increasingly shared with agents, tools, documents, etc. We introduce Preble, the first distributed LLM serving system targeting long and shared prompts. Preble reduces latency by 1.5-14.5x over SOTA serving systems. #LLM https://t.co/CNn3qIH7ui

2

25

5

8

4K

Vikranth Srivatsa

@vikranth22446

Last Seen Users on Sotwe

Trends for you

Most Popular Users