shubham @bluequbit - Twitter Profile

I have been using browser harness by @browseruse for the last 2 months and slowly but surely it is getter better at my workflows. I really like the pattern of having domain skills so that next time the harness is used for the same or similar task, it can be done faster and with less tokens. I have heavily customised the harness with custom scripts. The best usecase I found is collab-use. I made this since I like using collab for training and fine tuning models for free on the TPUs. But waiting on the platform is a pain as they don't have any API. collab skills started with a 300 step browser-harness workflow and now I have reduced it down to 70 step for reliably using collab for any general task. Thanks for building this self-healing harness!! @mamagnus00 @sauravpanda

1

2

0

3

703

shubham

@bluequbit

6 days ago

@ClementDelangue You don't know how big of a problem you have solved for me! Thanks for this!

0

2

0

392

shubham

@bluequbit

6 days ago

@SSGalagali Keep going. Your future self will thank you

1

2

0

41

shubham

@bluequbit

7 days ago

@bycloudai @ClementDelangue Definitely. I am trying to adapt this methodology for increasing rl rollouts with individual nn blocks and see if it makes a difference. A lot of things need to change tho. Right from grpo, infra and model alignment.

0

1

0

477

shubham

@bluequbit

7 days ago

Recently when I was trying to increase the rollout throughput of my RL fine tuning pipeline, I noticed that the GPU stayed idle for long periods of time instead of actually serving the LLM. After profiling, I realised there are several coldstart issues when you try to serve a model on vLLM (inference engine). Two largest contributors to coldstart are vLLM inspect and torch.compile. vLLM inspect - This inference engine supports a lot of different architectures and models - dense, MoE, multi-modal, speculative decoding, BF16, FP8, etc. In order to create a reasonable executing plan it has to inspect all the layers of the model it is running, - layer shapes, attention heads, rotary embeddings, hidden dimensions, KV structure. vLLM precomputes KV cache sizes, block allocation strategy, paged attention metadat, batching scheduler limits. For large models like gpt-oss- 120b this becomes substantial. Next is torch.compile = PyTorch compiles model graphs into optimized kernels. Most of the time it is pretty hard to beat these kernels on performance basis (although if you are good at GPU programming, can beat). But in order to generate these optimised kernels, torch takes substantial time as it has to observe tensor shapes, control flow, operator patterns to generate stable graphs. These graphs are then used by the compiler to fuse matmuls, layernorm, activations and attention ops into fewer kernels. This is obviously expensive. My next goal is to reduce this time as much as possible. Perhaps by techniques like cuda checkpointing and snapshots. Will update with progress.

bluequbit's tweet photo. Recently when I was trying to increase the rollout throughput of my RL fine tuning pipeline, I noticed that the GPU stayed idle for long periods of time instead of actually serving the LLM.

After profiling, I realised there are several coldstart issues when you try to serve a model on vLLM (inference engine).

Two largest contributors to coldstart are vLLM inspect and torch.compile.

vLLM inspect - This inference engine supports a lot of different architectures and models - dense, MoE, multi-modal, speculative decoding, BF16, FP8, etc.

In order to create a reasonable executing plan it has to inspect all the layers of the model it is running, - layer shapes, attention heads, rotary embeddings, hidden dimensions, KV structure.

vLLM precomputes KV cache sizes, block allocation strategy, paged attention metadat, batching scheduler limits.

For large models like gpt-oss- 120b this becomes substantial.

Next is torch.compile = PyTorch compiles model graphs into optimized kernels. Most of the time it is pretty hard to beat these kernels on performance basis (although if you are good at GPU programming, can beat).

But in order to generate these optimised kernels, torch takes substantial time as it has to observe tensor shapes, control flow, operator patterns to generate stable graphs. These graphs are then used by the compiler to fuse matmuls, layernorm, activations and attention ops into fewer kernels. This is obviously expensive.

My next goal is to reduce this time as much as possible. Perhaps by techniques like cuda checkpointing and snapshots.

Will update with progress.

0

2

0

1

104

shubham

@bluequbit

7 days ago

@malliktwts why would I write code by hand? Especially when even local models are good at writing reasonable code?

1

0

84

shubham

@bluequbit

9 days ago

@arnavmehta007 DM closed.

1

0

57

shubham

@bluequbit

9 days ago

Great advice!!

will depue

@willdepue

10 days ago

academics are unprepared for the coming world where much scientific progress is majorly a function of inference compute. whether OpenAI points the Eye of Stargate at your particular field will decide its acceleration. talent will leach away into the labs. it's already begun

78

2K

84

411

608K

0

1

0

104

shubham

@bluequbit

11 days ago

@tejasybhakta @Winterrose Use @twentycrm

0

1

0

13

shubham

@bluequbit

13 days ago

Great read if you are trying to copy the principles of writing good gemm kernels to other arch

steve

@gpusteve

14 days ago

we recently optimized qwen3.5-397b-a17b to be the fastest deployment publicly hosted. and the crazy thing: we did it by writing CUSTOM KERNELS for AMD MI355x. 🍿 see our post below outlining how we optimized kernels to achieve SOTA performance.