Nitin Kedia

@nitinkedi

CS PhD Student at @UTAustin | ex @MSFTResearch @zetasuite @IITGuwahati | Systems for ML

Joined May 2023

74 Following

18 Followers

20 Posts

nitinkedi retweeted

Pratyush Kumar

@pratykumar

4 months ago

Drop 13/14: The 30B and 105B models, benchmarks, and HF links will all come. But today it is a drop about people. About how our team of just 15 folks gave it their all to do what many doubted as not doable - ie train usefully large, globally competitive models from scratch in India. This team of 15 has now firmly launched @sarvam into its second innings. Yes, we can! @_mohit_singla @anand_404 @kediaharshit9 @AashaySachdeva @sumanthd17 @ArpitDwivedi100 @HarveenChadha @rkal4 @sushil_khyalia @ManavSinghal157 @sohampetkar missing in the pictuere - @selfawareatom @AnnaUpreti Anand @MeghMakwan33973 Utkarsh

pratykumar's tweet photo. Drop 13/14: The 30B and 105B models, benchmarks, and HF links will all come. But today it is a drop about people. About how our team of just 15 folks gave it their all to do what many doubted as not doable - ie train usefully large, globally competitive models from scratch in India. This team of 15 has now firmly launched @sarvam into its second innings. Yes, we can!

@_mohit_singla
@anand_404
@kediaharshit9
@AashaySachdeva
@sumanthd17
@ArpitDwivedi100
@HarveenChadha
@rkal4
@sushil_khyalia
@ManavSinghal157
@sohampetkar
missing in the pictuere -
@selfawareatom
@AnnaUpreti
Anand
@MeghMakwan33973
Utkarsh

198

727

315

310K

nitinkedi retweeted

kwatra @kwatra

about 1 year ago

TokenWeave – Efficient Compute-Communication Overlap for Distributed LLM Inference. Why? Even with highspeed NVLink on H100 DGX, communication overhead for distributed LLM inference can be > 20 %! Can we recover this overhead? (1/10)

kwatra's tweet photo. TokenWeave – Efficient Compute-Communication Overlap for Distributed LLM Inference.

Why? Even with highspeed NVLink on H100 DGX, communication overhead for distributed LLM inference can be > 20 %! Can we recover this overhead? (1/10) https://t.co/4D8bGNDGnz

nitinkedi retweeted

Vima Gupta @vima_gupta

over 1 year ago

1/7 🧵 MoEs: A tale of expectation vs reality Marketing: "Only compute the expert parameters you need!" Reality: Batch 16 requests → ALL experts activate At serving time (vLLM/TGI), arithmetic intensity: AI ≈ (num_tokens * top_k) / total_experts In simpler terms: Your decode arithmetic intensity scales inversely with expert count 🤔 #MoE #LLMs #ChatGPT #Claude #vllm #AI #ML

vima_gupta's tweet photo. 1/7 🧵 MoEs: A tale of expectation vs reality

Marketing: "Only compute the expert parameters you need!"
Reality: Batch 16 requests → ALL experts activate
At serving time (vLLM/TGI), arithmetic intensity:
AI ≈ (num_tokens * top_k) / total_experts
In simpler terms: Your decode arithmetic intensity scales inversely with expert count 🤔

#MoE #LLMs #ChatGPT #Claude #vllm #AI #ML

nitinkedi retweeted

Amey Agrawal @agrawalamey12

over 1 year ago

@Google has silently but surely developed an edge over @OpenAI. Long context processing seems to be the key to Google's AI strategy. NotebookLM is a prime example of what long context processing can unlock. In our latest paper, we talk about how systems can be built to support multi-million context length matching Google's capabilities. In case you missed the paper, here is the NotebookLM generated podcast! Podcast: https://t.co/pnQfxV914S Arxiv: https://t.co/i75NEz7h5F

835

Nitin Kedia @nitinkedi

almost 2 years ago

Are you getting the performance you paid for from your LLM provider? Benchmark it using Metron. It is one our biggest learning while working on LLM Inference for the last year at @MSFTResearch and @gtcomputing when we shipped Chunked Prefill at OSDI'24 and Vidur @MLSysConf.

Amey Agrawal @agrawalamey12

almost 2 years ago

🚀 Introducing Metron: Redefining LLM Serving Benchmarks! 📊 Tired of misleading metrics for LLM performance? Our new paper introduces a holistic framework that captures what really matters - the user experience! 🧠💬 https://t.co/Q02Fj0IUKa #LLM #AI #Benchmark

Nitin Kedia @nitinkedi

almost 2 years ago

Excited to present Sarathi-Serve at OSDI'24. Learn how chunked prefills make your llm chat buttery. @usenix @ChatGPTapp @vllm_project

Amey Agrawal @agrawalamey12

almost 2 years ago

Did you ever feel that @chatgpt is done generating your response and then suddenly a burst of tokens show up? This happens when the serving system is prioritizing someone else’s request before generating your response. But why? well to reduce cost. 🧵

123

nitinkedi retweeted

fly51fly @fly51fly

about 2 years ago

[LG] Vidur: A Large-Scale Simulation Framework For LLM Inference https://t.co/c3dKpGLDue - This paper presents Vidur, a high fidelity and easily extensible simulator for large language model (LLM) inference, along with a benchmark and search suite. - Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates end-to-end inference performance for different workloads. - It estimates metrics like latency, throughput, model FLOPs utilization, memory utilization, etc. with high accuracy. - Vidur addresses challenges unique to simulating LLM inference like finer time granularity, varying iteration times, and cascading errors. - It uses insights like architectural uniformity of LLMs, operator triaging, and automated profiling for parallelism strategies to achieve fidelity. - Vidur-Search uses Vidur to automatically identify optimal cost-effective deployment configurations meeting performance constraints.

fly51fly's tweet photo. [LG] Vidur: A Large-Scale Simulation Framework For LLM Inference
https://t.co/c3dKpGLDue
- This paper presents Vidur, a high fidelity and easily extensible simulator for large language model (LLM) inference, along with a benchmark and search suite.

- Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates end-to-end inference performance for different workloads.

- It estimates metrics like latency, throughput, model FLOPs utilization, memory utilization, etc. with high accuracy.

- Vidur addresses challenges unique to simulating LLM inference like finer time granularity, varying iteration times, and cascading errors.

- It uses insights like architectural uniformity of LLMs, operator triaging, and automated profiling for parallelism strategies to achieve fidelity.

- Vidur-Search uses Vidur to automatically identify optimal cost-effective deployment configurations meeting performance constraints.

Nitin Kedia @nitinkedi

about 2 years ago

Made with🩷from @gtcomputing and @MSFTResearch India AI Infra Team. Folks behind: @agrawalamey12, @jayashree2912, Ashish, @nipunkw, Bhargav, @ramaramjee and @alsched.

Nitin Kedia @nitinkedi

about 2 years ago

We at @MSFTResearch and @GeorgiaTech believe that running LLM's shouldn't be so expensive 💵 So, we built a tool 🛠️ that will enable you to run it cheaper, make it cheaper. Introducing Vidur👳🏽, the first LLM Inference System simulator. #mlsys #vllm #llm #llama #gpt

541

Nitin Kedia @nitinkedi

about 2 years ago

Vidur is a tool 🛠️. Use it how you want! We need your contributions to add more devices (@AMD GPUs anyone) and more models and architectures (go MoE @mistalai). Code: https://t.co/64diEczBR2 (n/n)

nitinkedi retweeted

Amey Agrawal @agrawalamey12

about 2 years ago

1/ LLM inference systems are like high-performance engines ⚙️—complex, powerful, and full of intricate settings. Efficiently deploying them to maximize GPU performance is a challenge typically tackled by experts at orgs like @OpenAI and @AIatMeta 🚀. 🧵

Nitin Kedia

@nitinkedi

Last Seen Users on Sotwe

Trends for you

Most Popular Users