Seroze

@stringray3

- Obsession is far greater than Discipline

Joined May 2018

2.7K Following

115 Followers

3.2K Posts

stringray3 retweeted

romit @romitjain_

about 1 month ago

Sharing some learning from attending MLSys '26. There were a lot of interesting papers presented in distributed training and inference. Overall, I could capture the following themes: 1. Distributed training has a lot of knobs, which are really tough to manage and tune. Ton of work is being done to make it easy to manage this. 2. As training gets larger, reliability matters more. It was not surprising to see many industry talks focus on training reliability. 3. Ultra-long context lengths are getting a huge mindshare for both training and inference. 4. Heterogeneous compute (multi-region, multi-accelerator) is on the rise and is probably the next frontier of inference optimization. 5. Distributed inference still needs better auto-tuning for finding the best configs at large scale. 6. KV cache optimization, attention optimization, and quantization were already on the radar, so the number of papers on these topics was not a surprise. 7. IMO, from a skills perspective, the best thing to learn is GPU communication and networking. Learn everything around inter and intra-rack communication, NCCL, and UCCL. Lots of improvements in the coming years will come from optimizing communication between GPUs via better kernels and frameworks. For a list of interesting papers and their summaries: https://t.co/IY5p8PSeVQ

292

322

15K

stringray3 retweeted

yuwen lu

@yuwen_lu_

30 days ago

看了一半，我靠怎么从来没人告诉我rl这么好玩

675

893

141K

stringray3 retweeted

Chuck Tang

@j316chuck

30 days ago

Had such a blast working with @erictang000 , @charlie_ruan, @sumanthhegde, and @pcmoritz on enabling multi-LoRA RL training in SkyRL! We observed ~3x higher experiments throughput in comparison to running experiments in the traditional single-tenant fashion. One of my favorite parts of this collaboration is that all this code is open source so you can play with it yourself :) Here's the technical deep dive 🧵

179

141

30K

stringray3 retweeted

Vuk Rosić 武克

@VukRosic99

29 days ago

I broke my record on the LLM I'm training by switching from constant learning rate to warm-up + decay (scheduled) learning rate. Learning Rate Scheduling - Beginner Tutorial + Record LLM Speedrun Full video - https://t.co/6UTQAsRxYt At the start of training, the weights are far from optimal, so we use a high learning rate to make large updates and learn quickly. Later, as the model approaches a good solution, we reduce the learning rate to make smaller, more precise adjustments and avoid overshooting the optimum. Speedrun our LLM - https://t.co/IaVf9nrX6m Become AI researcher: https://t.co/pCxMtk6N9V (funds our compute)

VukRosic99's tweet photo. I broke my record on the LLM I'm training by switching from constant learning rate to warm-up + decay (scheduled) learning rate.

Learning Rate Scheduling - Beginner Tutorial + Record LLM Speedrun

Full video - https://t.co/6UTQAsRxYt

At the start of training, the weights are far from optimal, so we use a high learning rate to make large updates and learn quickly. Later, as the model approaches a good solution, we reduce the learning rate to make smaller, more precise adjustments and avoid overshooting the optimum.

Speedrun our LLM - https://t.co/IaVf9nrX6m

Become AI researcher: https://t.co/pCxMtk6N9V (funds our compute)

388

Who to follow

@omochanalの競プロアカウントです/atcoder: omochana2(黄)/codeforces: omochan(黄)/IS19/CS21

ほいころ（Hoi_koro）🐬

@Hoi_koroo

Kawasaki frontale 🐬 アルゴ・マラソンID : Hoi_koro in AtCoder (algo yellow), codeforces (orange)

stringray3 retweeted

Jaydev

@JaydevTonde

29 days ago

One more checkmark and I feel more excited about the upcoming work in building my LLM inference server. So far, it feels great implementing core techniques like separate prefill and decode, KV cache, prefill caching, etc. The upcoming things are more interesting: Torch Compile, CUDA Graphs, SD, Quantization, and Distributed inference. Since I know these theoretically, implementing them one by one will be fun. I recently completed my study on prefix caching, which involves block hash-based and radix tree-based approaches. I have also run some benchmarks with vLLM and SGLang and will make them public soon.

JaydevTonde's tweet photo. One more checkmark and I feel more excited about the upcoming work in building my LLM inference server.

So far, it feels great implementing core techniques like separate prefill and decode, KV cache, prefill caching, etc. The upcoming things are more interesting: Torch Compile, CUDA Graphs, SD, Quantization, and Distributed inference.

Since I know these theoretically, implementing them one by one will be fun.

I recently completed my study on prefix caching, which involves block hash-based and radix tree-based approaches. I have also run some benchmarks with vLLM and SGLang and will make them public soon.

602

stringray3 retweeted

Deep-ML

@real_deep_ml

about 1 month ago

We just launched a new project that teaches you how to build Flash Attention with CUDA, step by step. By the end, you’ll have a working Flash Attention kernel built from the ground up. The project covers: -CUDA primitives warm-up -Matrix operations -Naive attention baseline -Online softmax math -Tiled attention building blocks -Fused Flash Attention kernel -Causal Flash Attention It will be open to everyone for the first 2 weeks, then it will become part of our premium projects.

real_deep_ml's tweet photo. We just launched a new project that teaches you how to build Flash Attention with CUDA, step by step.

By the end, you’ll have a working Flash Attention kernel built from the ground up.

The project covers:
-CUDA primitives warm-up
-Matrix operations
-Naive attention baseline
-Online softmax math
-Tiled attention building blocks
-Fused Flash Attention kernel
-Causal Flash Attention

It will be open to everyone for the first 2 weeks, then it will become part of our premium projects.

108

47K

Seroze @stringray3

about 1 month ago

https://t.co/vYOaCzYaTQ working on learning cuda

stringray3 retweeted

Jim Huang @jserv

about 2 months ago

I released MazuNIX on Mazu’s birthday. Unlike many educational operating systems that avoid SMP and real-time (RT) topics, Mazu delivers SMP, multicore RT scheduling, and practical POSIX Threads support. Full source code is available: https://t.co/Ak8p7Kr9qr

jserv's tweet photo. I released MazuNIX on Mazu’s birthday. Unlike many educational operating systems that avoid SMP and real-time (RT) topics, Mazu delivers SMP, multicore RT scheduling, and practical POSIX Threads support.
Full source code is available: https://t.co/Ak8p7Kr9qr https://t.co/El4g3Tk7Cq

Seroze @stringray3

about 2 months ago

https://t.co/bEO4mavwOT Btree sql engine implementation in python

Seroze @stringray3

about 2 months ago

https://t.co/S6w3p0quZN

stringray3 retweeted

andy

@1a1n1d1y

3 months ago

https://t.co/aXVn6fQtLj

825

862

108K

stringray3 retweeted

Gabriele Berton

@gabriberton

3 months ago

I have joined @GoogleDeepMind! I'll be training VLMs And I'll still keep posting about latest developments on AI, Computer Vision and LLMs So no more posts on PyTorch tricks. I might post about JAX. Stay tuned...

gabriberton's tweet photo. I have joined @GoogleDeepMind!

I'll be training VLMs

And I'll still keep posting about latest developments on AI, Computer Vision and LLMs

So no more posts on PyTorch tricks. I might post about JAX. Stay tuned... https://t.co/rDzXWje8As

122

486

146K

stringray3 retweeted

0xSero

@0xSero

3 months ago

In 72 hours I got over 100k of value 1. Lambda gave me 5000$ credits in compute 2. Nvidia offered me 8x H100s on the cloud (20$/h) idk for how long but assuming 2 weeks that'd be 5000$~ 3. TNG technology offered me 2 weeks of B200s which is something like 12000$ in compute 4. A kind person offered me 100k in GCP credits (enough to train a 27B if you do it right) 5. Framework offered to mail me a desktop computer 6. We got 14,000$ in donations which will go to buying 2x RTX Pro 6000s (bringing me up to 384GB VRAM) 7. I got over 6M impressions which based on my RPM would be 1500$ over my 500$~ usual per pay period 8. I have gained 17,000~ followers, over doubling my follower count 9. 17 subscribers on X + 700 on youtube. The total value of all this approaches at minimum 50,000$~ and closer to 150,000$ if I leverage it all. --------------------- What I'll be doing with all this: Eric is an incredibly driven researcher I have been bouncing ideas off of over the last month. Him and I have been tackling the idea of getting massive models to fit on relatively cheap memory. The idea is taking advantage of different forms of memory, in combination with expert saliency scoring, to offload specific expert groupings to different memory tiers. For the MoEs I've tested over my entire AI session history about 37.5% of the model is responsible for 95% of token routing. So we can offload 62.5% of an LLM onto SSD/NVMe/CPU/Cheap VRAM this should theoretically result in minimal latency added if we can select the right experts. We can combine this with paged swapping to further accelerate the prompt processing, if done right we are looking at very very decent performance for massive unquantisation & unpruned LLMs. You can get DeepSeek-v3.2-speciale at full intelligence with decent tokens/s as long as you have enough vram to host the core 20-40% of the model and enough ram or SSD to host the rest. Add quantisation to the mix and you can basically have decent speeds and intelligence with just 5-10% of the model's size in vram (+ you need some for context) The funds will be used to push this to it's limits. ----------------- There's also tons of research that you can quantise a model drastically, then distill from the original BF16 or make a LoRA to align it back to the original mostly. This will be added to the pipeline too. ------------------ All this will be built out here: https://t.co/rHQUFdGfy4 you will be able to take any MoE and shove it in here, and with only 24GB and enough RAM/NVMe to compress it down. it'll be slow as hell but it will work with little tinkering. ------------------ Lastly I will be looking into either a full training run from scratch -> or just post-training on an open AMERICAN base model - a research model - an openclaw/nanoclaw/hermes model - a browser-use model To prove that this can be done. -------------------- I will be bad at all of it, and doubt I will get beyond the best small models from 6 months ago, but I want to prove it's no boogeyman impossible task to everyone who says otherwise. -------------------- By the end of the year: 1. I will have 1 model I trained in some capacity be on the top 5 at either pinchbench, browseruse, or research. 2. My github will have a master repo which combines all my work into reusable generalised scripts to help you do that same. 3. The largest public comparative dataset for all MoE quantisations, prunes, benchmarks, costs, hardware requirements. -------------------------- A lot of this will be lead by Eric, who I will tag in the next post. I want to say thank you to everyone who has supported me, I have gotten a lot of comments stating: 1. I'm crazy, stupid, or both 2. I'm wasting my time, no one cares about this 3. This is not a real issue I believe the amount of interest and support I've received says it all. https://t.co/aSLDkVhawQ