kwatra @kwatra - Twitter Profile

kwatra retweeted

5 months ago

Long-context inference is hitting a wall. 🛑 As context grows, Attention becomes the villain. Why? • Decode: Attention scales linearly (O(N)), while the rest of the model stays constant (O(1)). • Prefill: Attention explodes quadratically(O(N²)). Can we do better?(1/9)

DhruvDeshmukh12's tweet photo. Long-context inference is hitting a wall. 🛑
As context grows, Attention becomes the villain.
Why?
• Decode: Attention scales linearly (O(N)), while the rest of the model stays constant (O(1)).
• Prefill: Attention explodes quadratically(O(N²)).
Can we do better?(1/9) https://t.co/bgqnKqYUG8

1

4

1

0

574

kwatra retweeted

Ram Ramjee @ramaramjee

11 months ago

Evaluation of LLM serving systems is tricky because several factors influence performance (prefill length, decode length, parallelization) and there are multiple metrics we care about (throughput, ttft, tpot/tbt). We identify common pitfalls and a checklist to avoid them.

0

7

2

0

515

kwatra retweeted

Amey Agrawal @agrawalamey12

11 months ago

The bitter lesson of AI infra: The hardest part about building faster LLM inference systems is not designing the systems, but rather it is evaluating if the system is actually faster! 🤔 This graph from a recent top systems venue paper about long-context serving shows average normalized input token latency for a trace with both short and 100K+ token requests. System X looks like a clear win: lower normalized latency and higher request rates. But normalized metrics can obscure the actual user experience: at those rates, long inputs see >2hr delays to the first token! Let’s do the math!🧮

agrawalamey12's tweet photo. The bitter lesson of AI infra: The hardest part about building faster LLM inference systems is not designing the systems, but rather it is evaluating if the system is actually faster! 🤔

This graph from a recent top systems venue paper about long-context serving shows average normalized input token latency for a trace with both short and 100K+ token requests. System X looks like a clear win: lower normalized latency and higher request rates. But normalized metrics can obscure the actual user experience: at those rates, long inputs see >2hr delays to the first token!

Let’s do the math!🧮

1

23

10

5

2K

kwatra retweeted

Raja

@_raja_gond

12 months ago

We have released the source code and benchmarks of TokenWeave. TokenWeave speeds up distributed LLM inference via compute–communication overlap and fused AllReduce, RMSNorm, and residual addition. Code: https://t.co/jJtRirY0IC Paper: https://t.co/n0SDVdjyQn Try it out!

1

6

2

3

3K

Who to follow

Anku Rani

@anku__rani

PhD @mit | prev. @adobe @verisk @cactusglobal @apptio @pixiu_in @NITIAayog

Arpan Gujarati

@arpanbg

Assistant Professor @UBC_CS | Former graduate @mpi_sws_ @bitspilaniindia

Sana Tonekaboni

@sana_tonekaboni

EWSC postdoc fellow @broadinstitute of MIT and Harvard | Prev. PhD in Machine Learning for Healthcare @UofTCompSci @VectorInst

kwatra @kwatra

about 1 year ago

Dive into the details → https://t.co/tNJCVcsJTz. Code on the way—stay tuned! (10/10)

0

4

0

142

kwatra @kwatra

about 1 year ago

TokenWeave – Efficient Compute-Communication Overlap for Distributed LLM Inference. Why? Even with highspeed NVLink on H100 DGX, communication overhead for distributed LLM inference can be > 20 %! Can we recover this overhead? (1/10)

kwatra's tweet photo. TokenWeave – Efficient Compute-Communication Overlap for Distributed LLM Inference.

Why? Even with highspeed NVLink on H100 DGX, communication overhead for distributed LLM inference can be > 20 %! Can we recover this overhead? (1/10) https://t.co/4D8bGNDGnz

1

18

6

4

1K

kwatra @kwatra

about 1 year ago

Small token batches enable chunked-prefill schedulers (e.g., Sarathi). (9/10)

1

3

0

144

kwatra retweeted

Abhinav Dutta @abhinavdutta555

almost 2 years ago

🚨 Are LLM compression methods (𝘲𝘶𝘢𝘯𝘵𝘪𝘻𝘢𝘵𝘪𝘰𝘯, 𝘱𝘳𝘶𝘯𝘪𝘯𝘨, 𝘦𝘢𝘳𝘭𝘺 𝘦𝘹𝘪𝘵) too good to be true and are existing eval metrics sufficient? We've looked into it in our latest research at @MSFTResearch 🧵 (1/n) https://t.co/aW6cGMvTPv

2

20

7

16

5K

kwatra retweeted

main @main_horse

almost 2 years ago

[MSFT] Accuracy is Not All You Need https://t.co/2atYvOW9cc in comparing quantized/pruned/sparsified vs 16bit models, * observes drastic flipping in correct<->wrong answer pairs, even with otherwise good accuracy * proposes replacing eval accuracy w/ either KL-Divergence or flips * explains this phenomenon as a consequence of the difference in eval Top Margin for correct vs wrong answers

main_horse's tweet photo. [MSFT] Accuracy is Not All You Need

https://t.co/2atYvOW9cc

in comparing quantized/pruned/sparsified vs 16bit models,
* observes drastic flipping in correct<->wrong answer pairs, even with otherwise good accuracy
* proposes replacing eval accuracy w/ either KL-Divergence or flips
* explains this phenomenon as a consequence of the difference in eval Top Margin for correct vs wrong answers

8

165

20

101

14K

kwatra retweeted

Amey Agrawal @agrawalamey12

almost 2 years ago

🚀 Introducing Metron: Redefining LLM Serving Benchmarks! 📊 Tired of misleading metrics for LLM performance? Our new paper introduces a holistic framework that captures what really matters - the user experience! 🧠💬 https://t.co/Q02Fj0IUKa #LLM #AI #Benchmark

2

34

15

13

7K

kwatra retweeted

Amey Agrawal @agrawalamey12

almost 2 years ago

Did you ever feel that @chatgpt is done generating your response and then suddenly a burst of tokens show up? This happens when the serving system is prioritizing someone else’s request before generating your response. But why? well to reduce cost. 🧵

1

26

5

12

7K

kwatra retweeted

Amey Agrawal @agrawalamey12

almost 3 years ago

Ever wondered why @OpenAI charges 2x price for output tokens compared to input? Turns out that an output token can be up to 200x more compute time than an input token. Why? We explored this phenomenon during my internship at @MSFTResearch. 🧵

7

377

46

321

78K

kwatra @kwatra

over 17 years ago

something

0

2

0

kwatra

@kwatra

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users