introducing cloudy, a platform where ai researchers can describe an experiment that they want to run, and cloudy handles everything else - generating code, the gpu & storage infrastructure, retries, and logs. more details below:
i took a break from socials for a few weeks. i’ve been lurking through and some of y’all are shipping really cool products / blogs (vv inspiring)!!
right now, i’m tinkering around with gpus and related ideas + i have a lot more clarity on what needs to be done with cloudy :)
something cool i shipped this weekend at https://t.co/8UpnMDSZb5 - you can now mount or unmount storage volumes on live gpu instances (WITHOUT RESTARTING THE INSTANCE).
for people who run experiments on gpus: i built a cli tool that gives coding agents access to cloudy's infra (~2x cheaper than other sandbox / serverless clouds).
with this cli, you can ask claude to:
"finetune kimi k2 on 64 h100s & to save money test your finetune on 8 h100s"
we are currently expanding our h100 & h200 capacity (including 3.2Tb/s infiniband clusters with kubernetes). in the meantime, you should give https://t.co/FWb9jsazEH a try, you will be pleasantly surprised (i promise)!!
exciting updates soon :)
also, i've been thinking a lot about the dx for multinode training. atm, the options you have are:
- torchrun or deepspeed, where you have some janky k8s config, or you manually alt-tab into ssh terminals and run commands manually
- or you use a random cloud's library, give up control of your cluster & forever stay at the mercy of that cloud provider's library + high prices
i am building a 3rd option that - feels easier to use, scales up/down seamlessly, and also gives your team full control of your cluster.
atm, i’m working on adding multinode, reserved instances & a one click benchmark product for llm, image & video models. you will be able to spin off a training job and run benchmarks on your checkpoints seamlessly & in parallel, with no additional setup or uploads to huggingface.
Introducing "public volumes" on Cloudy - with this update, open-source projects can share as reproducible sandboxes that can be forked and mounted on GPU instances within seconds.
Share, fork & mount 100TB+ sandboxes seamlessly with Cloudy. Here is a demo:
i miss working on projects like these... we are working on something interesting @cloudysoftwares that will hopefully incentivise more projects like these on the timeline :')
i’m all in on @cloudysoftwares. i’ve got some really exciting stuff coming out soon & i’m not saying this lightly - we are just getting started.
and as always, i’m genuinely so grateful for all the support i’ve received in the last few months, love y’all. it’s game time!!!