introducing cloudy, a platform where ai researchers can describe an experiment that they want to run, and cloudy handles everything else - generating code, the gpu & storage infrastructure, retries, and logs. more details below:
i’m building something new for teams that want to automate 100s of tiny research experiments on gpus (think @karpathy’s auto research). if you want to try out an early version, reach out to me!
for people who run experiments on gpus: i built a cli tool that gives coding agents access to cloudy's infra (~2x cheaper than other sandbox / serverless clouds).
with this cli, you can ask claude to:
"finetune kimi k2 on 64 h100s & to save money test your finetune on 8 h100s"
introducing simple-llm: a ~950 line, powerful & extensible inference engine that performs on par with vllm. enjoy :)
performance (gpt-oss-120b, on an h100):
- batch=1: 135 tok/s (vllm: 138)
- batch=64: 4,041 tok/s (vllm: 3,846)
https://t.co/x0aLi4rcAP
THE FINAL OFFSHORING
If you try to look to the past to understand the future of automation, the future is bleak.
On one hand, if we don’t go forward with robotics, then we’ll likely enter a stagnant, malthusian hellscape of a world.
But on the other hand, if we do go forward with robotics, then society might not survive the transition intact.
If these are the two default futures, then it's no wonder so many people are pessimistic. Is there an option to just have neither of these, please?
Yes, actually. But it won’t be easy. It won’t be simple. And it will require reading a lot of footnotes.
saturday night random idea: i'm vibe reimplementing vllm into a single & simple inference file + a folder with custom kernels.
rules:
- allowed libs: torch, numpy, flash-attn
- must match vllm's gpt-oss-120b inference speed
- can optimise for the model and hardware (h100)
Introducing "public volumes" on Cloudy - with this update, open-source projects can share as reproducible sandboxes that can be forked and mounted on GPU instances within seconds.
Share, fork & mount 100TB+ sandboxes seamlessly with Cloudy. Here is a demo:
i miss working on projects like these... we are working on something interesting @cloudysoftwares that will hopefully incentivise more projects like these on the timeline :')
Announcing Cloudy, a platform that seamlessly handles your training infrastructure.
You can rent a single H100 or a cluster of 1000 H100s, manage petabyte-scale storage volumes & seamlessly go from running experiments to managing large scale training runs on a single interface.
today, i'm shutting down @aaaaaaaaaaorg. although i deeply believe in a10's mission, the only right way to build a10 would be to stay open, focus on education & stay non-profit. it will make a comeback someday, self-funded. until then, i'm focused on making fuck you money.
i'm working on an open-source repo that teaches people llm inference optimisations. rn the fastest files on the repo are at 6.5-7k tps, as compared to vllm's 10k for the same batch size. i added a c++ implementation today. feel free to try it out (wip):
https://t.co/Sb4Nq3HDS7
i'm working on a repository that implements increasingly complex llm inference optimizations in zero abstraction single file implementations.
to start with, i put up 4 files on a repository that implements llama3.2 1b at 15tps, 60tps, 180tps, 1385tps on my 4090. the project will end when i am close to vllm's performance (7218 tps on my testset). then, i will do the same thing again with multiple gpus and implement parallel inference (maybe on a larger model). once i'm done with that, i'll write a blog / make a video explaining it all.
i have a few more working optimisations and a few buggy ones that i'll fix and push to the repository this week. i wanted to get the ball rolling & set up the repository for you guys to check out. feedback/thoughts are welcome :)
i love writing!! it gives me an excuse to learn topics and is my small way of giving back to the community (inspired by @karpathy's videos). my past blogs:
1. a blog on lcms (10k readers)
2. llama3 from scratch (14k github stars, ~100k readers)
3. an rl guide (20k readers)
for my next blog post -- i am working on a comprehensive guide on data parallelism. the guide will explain relevant parallelization tricks, training on thousands of h100s, networking within gpus and nodes, some relevant cuda concepts & a bunch of pytorch code :) eta ~2 months.
with deepseek's r1 release, now is the best time to learn about reinforcement learning
@naklecha already dropped a bomb guide covering almost everything for someone to get started with RL. crazy thing is it covers both theory plus code examples
notes on my reinforcement learning for rocket league side quest so far:
i’m in awe that — we live in a time where i can train a reinforcement learning model that is learning to play rocket league and at the same time, test community made rocket league models. where, both processes can run in parallel & locally on my hardware.
in the game window — i’m playing against a reinforcement learning ppo model that’s running locally. and in the vscode window — i’m training my own reinforcement learning model (just a dummy example script for now).
also, the rocket league botting ecosystem has incredible tooling where you can test out different reinforcement learning ideas against other algorithms + it has great documentation for linux and windows. it’s honestly a beautiful hidden gem of a rabbit hole :)
my new ml side quest -- i'm going to build rocket league bots, using reinforcement learning. my goal is to train a bot (in 1v1s) that is better than ~95% of rocket league's player base (diamond/champ rank).