Jaydev

Verified account

@JaydevTonde

Learning LLM Inference, Building own LLM Inference server(tokn), Senior Data Scientist

Pune, Maharashtra India

Joined September 2023

984 Following

429 Followers

523 Posts

Pinned Tweet

3 months ago

I have been writing a small series on LLM inference with @vllm_project that can be a practical starting point for people trying to understand this space. Along with the explanations, I also ran benchmarks on realistic workloads across different GPUs and datasets to evaluate how these techniques perform in practice. It covers: - Major speculative decoding techniques - Major quantization methods - Distributed inference: DP / PP / TP - Expert Parallelism and mixed parallel setups - Practical optimization techniques like prefix caching, KV cache, and disaggregated prefill/decode My goal was to explain how these techniques work, where they help, so it is easier to choose the right approach for a given workload. This series is useful not only for people getting into LLM serving, but also for engineers who are already serving LLMs and want to optimize inference, improve throughput, reduce latency, or evaluate the right serving strategy.

7

194

16

205

13K

about 5 hours ago

Fine tuning the donut after long time for document image classification.

JaydevTonde's tweet photo. Fine tuning the donut after long time for document image classification. https://t.co/77Ei8sZBYq

0

3

0

0

107

JaydevTonde retweeted

about 13 hours ago

I get the hype around GLM 5.2 now. It’s really good. I’m trying to deploy it across our team for general coding use cases. Will share more details soon, along with suggestions on how to get it running for your own teams in the cheapest and most efficient way possible.

0

10

2

2

537

about 17 hours ago

@sakurayukiai Yes, checked we use all-to-all for MoE

0

1

0

0

59

about 18 hours ago

DeepEP : this is the library release by DeepSeek which avoids CPU for MoE communication and do all the work on GPU only. We generally use the NCCL primitives like all-reduce, all-gather for multi gpu MoE Inference.

JaydevTonde's tweet photo. DeepEP : this is the library release by DeepSeek which avoids CPU for MoE communication and do all the work on GPU only. We generally use the NCCL primitives like all-reduce, all-gather for multi gpu MoE Inference. https://t.co/iUYs0SE1Fj

3

34

5

14

2K

about 18 hours ago

Link : https://t.co/pqgyvHNdNk

0

0

0

1

108

1 day ago

Implementing Distributed Inference (TP) in tokn. This is how tensor multiplication happens in it for MLP layers. Two major things to be noted: 1. We split W_gate and UP_proj by the column dimension, which is column parallel, and 2. We split Down_proj by the row dimension, which is row parallel. After this we do all-reduce and sync the output on both GPUs.

JaydevTonde's tweet photo. Implementing Distributed Inference (TP) in tokn. This is how tensor multiplication happens in it for MLP layers. Two major things to be noted:

1. We split W_gate and UP_proj by the column dimension, which is column parallel, and
2. We split Down_proj by the row dimension, which is row parallel.

After this we do all-reduce and sync the output on both GPUs.

0

14

3

10

715

1 day ago

@vllm_project @Baidu_Inc Any benchmarking done for this?

0

0

0

0

634

2 days ago

CUDA Freeform Board

JaydevTonde's tweet photo. CUDA Freeform Board https://t.co/wAFAzWXEkz

0

10

0

4

556

5 days ago

Started 1.41× slower than vLLM. Added continuous batching -> still behind. Added torch.compile -> somehow got worse. Added CUDA graphs -> 452 vs 460 tok/s. Nearly identical. kernel launch overhead is the real bottleneck at decode time, not the model. CUDA graphs fix that.

JaydevTonde's tweet photo. Started 1.41× slower than vLLM.

Added continuous batching -> still behind.
Added torch.compile -> somehow got worse.
Added CUDA graphs -> 452 vs 460 tok/s.
Nearly identical.

kernel launch overhead is the real bottleneck at decode time, not the model. CUDA graphs fix that.

10 days ago

https://t.co/3uZh75Aojr

1

209

24

258

16K

0

55

12

43

4K

5 days ago

@mr_r0b0t Thanks, will check

0

2

0

0

24

6 days ago

vLLM does not support default flashinfer kernels for Blackwell dtype quantizatized model serving. we can use marlin kernels as fallback. 1. FP8: Selected CutlassFp8BlockScaledMMKernel for CompressedTensorsW8A8Fp8 2. MXFP4: Using MarlinMxFp4LinearKernel for MXFP4 GEMM 3. MXFP8: Using FlashInferCutlassMxfp8LinearKernel for MXFP8 GEMM 4. NVFP4: Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM

JaydevTonde's tweet photo. vLLM does not support default flashinfer kernels for Blackwell dtype quantizatized model serving. we can use marlin kernels as fallback.

1. FP8: Selected CutlassFp8BlockScaledMMKernel for CompressedTensorsW8A8Fp8
2. MXFP4: Using MarlinMxFp4LinearKernel for MXFP4 GEMM
3. MXFP8: Using FlashInferCutlassMxfp8LinearKernel for MXFP8 GEMM
4. NVFP4: Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM

1

34

1

26

3K

8 days ago

Qwen3-8B is taking 24 mins to quantize to NVFP4. I am using Framework: LLM Compressor Calibration Dataset: Wikitext-2 Quantization Algorithm: AWQ

0

2

0

1

178

10 days ago

Based on my recent reading it looks like we have full hardware support for MXFP8 and MXFP4 but not good software support like GPU kernels.

0

15

2

3

1K

10 days ago

https://t.co/3uZh75Aojr

1

209

24

258

16K

11 days ago

CUDA graphs are working now. I did a very minimal implementation with three major things: 1. Initialize buffers for input_ids, positions, etc. 2. Capture CUDA graphs for multiple batch sizes. 3. Condition the forward pass: for prefill use a simple pass and for decode use graph.replay(). Will upload code snippets with explanations soon.

JaydevTonde's tweet photo. CUDA graphs are working now. I did a very minimal implementation with three major things:

1. Initialize buffers for input_ids, positions, etc.
2. Capture CUDA graphs for multiple batch sizes.
3. Condition the forward pass: for prefill use a simple pass and for decode use graph.replay().

Will upload code snippets with explanations soon.

1

49

1

31

2K

11 days ago

If you want the cheapest GPU for small runs that require flash attention, go for the A30. It is the cheapest among all, and as it is Ampere, it supports flash attention.

JaydevTonde's tweet photo. If you want the cheapest GPU for small runs that require flash attention, go for the A30. It is the cheapest among all, and as it is Ampere, it supports flash attention.

0

2

0

0

157

12 days ago

Explored what makes Blackwell quantization techniques different from AWQ, GPTQ, etc. It features: 1. High precision scale encoding 2. Two-level micro-block scaling strategy 2.1 Using FP8 type scaling factor 2.2 Using FP32 type scaling factor Also, the Blackwell GPU series has FP4 Tensor Cores, which is the reason we can’t use these types of quantization on A100 or H100.

JaydevTonde's tweet photo. Explored what makes Blackwell quantization techniques different from AWQ, GPTQ, etc.

It features:
1. High precision scale encoding
2. Two-level micro-block scaling strategy
2.1 Using FP8 type scaling factor
2.2 Using FP32 type scaling factor

Also, the Blackwell GPU series has FP4 Tensor Cores, which is the reason we can’t use these types of quantization on A100 or H100.

0

2

1

2

165

13 days ago

My next few runs on NVIDIA RTX PRO 6000 Blackwell GPUs 😈

JaydevTonde's tweet photo. My next few runs on NVIDIA RTX PRO 6000 Blackwell GPUs 😈

0

10

0

5

579

JaydevTonde retweeted

14 days ago

Your GPUs shouldn't get paid to sit idle. JarvisLabs Serverless is now in beta. Turn any open model into an OpenAI-compatible endpoint with a single command. A request comes in, a GPU spins up on its own. Traffic stops, it scales back to zero. You're billed for GPU time only while it's serving, never for idle GPUs. We currently support vLLM, SGLang and Ollama Live in beta today.

1

15

5

7

1K

Last Seen Users on Sotwe

Trends for you

Most Popular Users