Runhouse

@runhouse_

📦Kubetorch🔥: Fast, debuggable, fault-tolerant ML on shared compute.

Joined October 2022

11 Following

359 Followers

84 Posts

runhouse_ retweeted

PyTorch

@PyTorch

6 months ago

Runhouse and Curavoice were recognized in the 2025 PyTorch Startup Showcase, with @DonnyGreenberg presenting for @runhouse_ and Shrey Modi presenting for @CuraVoice. Their work reflects the depth of innovation across AI infrastructure and applied AI in the PyTorch ecosystem. Thank you to our Startup Showcase MC @chappyasel (@_ai_collective) and our judges Irving Hsu (@MayfieldFund), Radhika Malik (Dell Technologies Capital), Kevin Crosby (GitHub), @denise_teng25 (@GradientVC), and @simontiu (@VertexVUS) for their thoughtful evaluation and support. 💡 Learn more about the 2025 Showcase winners and their work: https://t.co/dR120wWEV6 #PyTorchCon #PyTorch #OpenSourceAI #AIInfrastructure

PyTorch's tweet photo. Runhouse and Curavoice were recognized in the 2025 PyTorch Startup Showcase, with @DonnyGreenberg presenting for @runhouse_ and Shrey Modi presenting for @CuraVoice. Their work reflects the depth of innovation across AI infrastructure and applied AI in the PyTorch ecosystem.

Thank you to our Startup Showcase MC @chappyasel (@_ai_collective) and our judges Irving Hsu (@MayfieldFund), Radhika Malik (Dell Technologies Capital), Kevin Crosby (GitHub), @denise_teng25 (@GradientVC), and @simontiu (@VertexVUS) for their thoughtful evaluation and support.

💡 Learn more about the 2025 Showcase winners and their work: https://t.co/dR120wWEV6

#PyTorchCon #PyTorch #OpenSourceAI #AIInfrastructure

runhouse_ retweeted

PyTorch

@PyTorch

8 months ago

The Startup Showcase is happening now at PyTorch Conference 2025, where AI startups are pitching live to top VCs. Startups presenting: Graphsignal, Runhouse, NimbleEdge, XOR, Snapshot AI, Node One, BrightOnLABS, Okahu AI, CuraVoice, Seer Systems, KnowNext, and Backfield AI. Judges: Chappy Asel (The GenAI Collective), Irving Hsu (Mayfield), Radhika Malik (Dell Technologies Capital), Kevin Crosby (GitHub), Denise Teng, and Simon Tiu. Sponsored by Dell Technologies Capital and Mayfield. Stay tuned for more! 🔗 Learn more: https://t.co/L5C8AAi34O #PyTorchCon #PyTorch #StartupShowcase

PyTorch's tweet photo. The Startup Showcase is happening now at PyTorch Conference 2025, where AI startups are pitching live to top VCs.

Startups presenting: Graphsignal, Runhouse, NimbleEdge, XOR, Snapshot AI, Node One, BrightOnLABS, Okahu AI, CuraVoice, Seer Systems, KnowNext, and Backfield AI.

Judges: Chappy Asel (The GenAI Collective), Irving Hsu (Mayfield), Radhika Malik (Dell Technologies Capital), Kevin Crosby (GitHub), Denise Teng, and Simon Tiu.

Sponsored by Dell Technologies Capital and Mayfield.

Stay tuned for more!

🔗 Learn more: https://t.co/L5C8AAi34O

#PyTorchCon #PyTorch #StartupShowcase

10K

runhouse_ retweeted

Donny Greenberg @DonnyGreenberg

8 months ago

“Programmatic control over ML compute is the new world. Now that I've seen it I can never go back to Slurm. It's hard to point to just one thing that's better, it's 100 things.” Announcing Kubetorch, fixing ML and RL development on Kubernetes: https://t.co/803n2KZp28

542

Runhouse

@runhouse_

10 months ago

Kubetorch makes RL easy, part 1: Launch code sandboxes (extremely easily!) as part of #RL training done with #TRL by @huggingface. KT makes it extremely simple to setup any agent for execution on separate compute (& separate images) for your RL loop.

266

Who to follow

Donny Greenberg

@DonnyGreenberg

Building Kubetorch at @runhouse_🏃‍♀️🏠. Previously product lead @PyTorch, Quantum Computing @IBM, Ads @Google. Compulsive cook

Angela Chen

@chengela

Journalist, editor, author of Ace. Currently at Asterisk, prev: Vox, WIRED, WSJ.

Reka

@RekaAILabs

An AI research and product company 🫠. We are a team of scientists and engineers building state-of-the-art multimodal models 😻

Runhouse

@runhouse_

10 months ago

How can you write distributed training jobs that fail gracefully and recover intelligently? In this example, we show how Runhouse's Kubetorch gives programmatic control over training. When the model OOMs, the error propagates back to the driver program, where you retain full control over what happens next. This allows for smart, dynamic adjustment (like reducing batch size or relaunching on larger compute) without needing to restart the whole job. You can see the code in our GitHub repo: https://t.co/ycELZdWd93 #MLInfrastructure #PyTorch #MLOps

423

Runhouse

@runhouse_

10 months ago

With Kubetorch, you can run a PDB debugger in both #distributed training and #inference applications while they run at production scale on #Kubernetes. This was built for #ML teams who struggle to debug production-like workloads today.

734

Runhouse

@runhouse_

11 months ago

📉 Kubetorch users report 98% fewer prod failures. Why? Because unlike CLI-based infra (@kubeflow, #slurm, @raydistributed , etc) that crumble on errors, Kubetorch: ✅Propagates remote exceptions in pure Python ✅Avoids cascading failures ✅Supports multiple interactive calls "Restart from checkpoint" is stone age

241

Runhouse

@runhouse_

11 months ago

Are you looking for a drama-free way to run your Airflow pipelines? In a new example below, we show how 📦Kubetorch🔥 cleanly separates the concerns of development from orchestration, so you never need to get stuck push-and-pray debugging in Airflow again. You can iterate and debug your application in local Python with Kubetorch's live dispatch to cloud hardware, and then simply run that Python code as-is in Airflow with PythonOperator in CI or production. The Airflow node will run your code identically, on identical cloud hardware, as you ran and debugged it locally, so you'll never need to rely on Airflow as your debugger again. If you need to debug something that's breaking in production, or simply experiment with a change to a production pipeline, you can just edit and run that same Python code locally. It's a totally closed research-to-production-to-research loop. This example demonstrates how to use Kubetorch to train a PyTorch model on a remote GPU, and then schedule that training for recurring automation with Airflow. If you already have a bunch of Airflow pipelines, you'll see that it takes minimal code changes to restore the development and debugging experience in any existing pipeline. It's not just Airflow; the same structure can be used to improve development velocity, research-to-production, and fault tolerance with any pipeline orchestrator (e.g. Argo, Dagster, Prefect, Flyte) without requiring any direct integration. Take a look and get in touch if you'd like to try it out yourself! https://t.co/3xp6dNpr1O

346

Runhouse

@runhouse_

11 months ago

📦Kubetorch🔥 Inference with vLLM Wouldn't you know it, if you build a system which is extremely good at deploying and scaling Python applications on Kubernetes, it works well for inference. Here's an example showing how Kubetorch's decorator API makes developing, deploying, scaling, and composing inference applications dead-easy. If you think about it, basically all the opinionated little subcomponents (training, notebooks, inference, data processing, etc.) of SageMaker, Kubeflow, Vertex, etc. are doing exactly the same thing - running Python on cloud compute. So if you invert that approach, and focus on developing, deploying, and scaling arbitrary Python on Kubernetes, you can support the full ML lifecycle with elegant, unopinonated, cloud-agnostic APIs. That's 📦Kubetorch🔥 .

364

Runhouse

@runhouse_

11 months ago

Hello 📦Kubetorch🔥 This demo: Running a distributed training on Kubernetes with *1-2 second debugging and iteration loops*, and no magic CLI incantations, DSLs, YAML, or dockerfiles. Just point any Python code to the compute you want to run it on and go. (Tensorflow:PyTorch::Kubeflow:Kubetorch, get it?) More demos coming soon: * Online inference * Training fault tolerance * Live pdb debugging * ... -> Reply below what you'd like to see

304

Runhouse

@runhouse_

over 1 year ago

tldr; We're currently in the "Hadoop era" of machine learning - companies are struggling with infrastructure complexity just like they did with big data a decade ago. The next wave of innovation will focus on making ML infrastructure accessible and scalable. Teams should be able to launch an LLM fine-tuning over a dozen GPUs or train an image recognition model at high velocity, without struggling with the hardware required to do so.

613

Runhouse

@runhouse_

over 1 year ago

.@DonnyGreenberg was at @NYSE 's Cyber + AI Innovators Summit with @furrier to discuss the future of ML infrastructure and how to support rapid experimentation and scale for #genAI and #LLM use cases. Watch here: https://t.co/Ea4hAvNKnw

Runhouse

@runhouse_

over 1 year ago

Hear a replay of our presentation at the @raydistributed Summit about how Runhouse enables easy adoption of #Ray without disrupting your existing infrastructure while giving you fault-tolerant, flexible, and cost-efficient execution. https://t.co/jBEXYNmbVW

runhouse_ retweeted

Donny Greenberg @DonnyGreenberg

over 1 year ago

We've observed an alarming line of discussion questioning whether companies should invest in the capabilities to train their own models, with "all ML moving to LLMs" as the alternative. The premise is that if you squint, you could imagine how a sufficiently powerful LLM could answer the type of question a custom ML model predicts today (Will this user like this movie? Is this transaction fraud?). Aside from the age old trap of "let's hope the model is magic this time," we see a few reasons to be highly suspicious of this excuse to pump the breaks on ML investment: 1) Definitionally, ML is *mobilizing your own data for the improvement of product,* and the large tech companies whose successes solidified the "data is the new oil" mantra are investing in training first-party models as intensely as ever. "Mobilizing first-party data" has a fundamentally different history of industry-redefining value creation compared to "building AI apps." Considering that the business value in Fortune500 enterprise today from proprietary models (ML) so dramatically dwarfs that of off-the-shelf models (AI), it's profoundly risky to speculatively shift investment and focus 100% from ML to AI at this juncture. 2) The LLM providers, and now the cloud providers too (Gemini, Bedrock, Azure-OpenAI), are highly incentivized to push this message. It would be incredibly convenient if we all stopped investing in training models and instead irreversibly relied on their model-training expertise as a service going forward. We've already seen what happens when a small few massively out-invest others in mobilizing first-party data: the ad ecosystem. Imagine if Facebook had relied on Google for ad technology rather than mobilizing their data directly, or if the ad agencies had matched Google's pace of investment to mobilize theirs. To this day, the largest companies on the planet can only minimally utilize their proprietary user data in ad targeting (through narrow APIs), maximizing their dependence on the proprietary targeting intelligence within their ad platform. 3) Yes, the infra for ML is hard, but it's getting easier. If we all stopped doing large-scale BI because Hadoop was hard, we'd never have reached the OLAP promised land. @runhouse_ is like Snowflake for ML training, so if you're investing in ML and hitting the infra walls, come give us a try. Read on below.

runhouse_ retweeted

Hetz Ventures @HetzVentures

over 1 year ago

We're excited for the return of #DevOpsDaysTLV (Oct 30-31) which means: - 4 portcos taking over our booth @Tabnine @velocitytechhq @trywilco @runhouse_ - Hetz hosting you at the Happy Hour on Day 1 Time to grab a spot and join @DevOpsDaysTLV for the best #dev event of the year!

377

runhouse_ retweeted

Priyanka Somrah

@psomrah

over 1 year ago

Fav talk at the #RaySummit2024 has to go to the ✨ @runhouse_ team ✨ with @DonnyGreenberg @mattkandler taking the stage to share how they’ve built https://t.co/Si4PgTV7P8 to foster collaborative Ray development at scale The team’s also doing a super fun drink sesh at 6:30 PT @Black Hammer Brewing Co in SoMa —> if you’re an ML Eng at the #RaySummit today, come through and talk shop 🍺🍻 fin/

psomrah's tweet photo. Fav talk at the #RaySummit2024 has to go to the ✨ @runhouse_ team ✨ with @DonnyGreenberg @mattkandler taking the stage to share how they’ve built https://t.co/Si4PgTV7P8 to foster collaborative Ray development at scale

The team’s also doing a super fun drink sesh at 6:30 PT @Black Hammer Brewing Co in SoMa —> if you’re an ML Eng at the #RaySummit today, come through and talk shop 🍺🍻

fin/

runhouse_ retweeted

ray

@raydistributed

almost 2 years ago

🚨#RaySummit speaker session alert!🚨 Join @runhouse_ as they reveal a new system that transforms Ray from single-player per cluster to multiplayer 🚀 Discover how they're overcoming cluster restrictions and introducing a collaborative UI, making AI resources accessible—like a Google Drive for AI 🙌 Don't miss it, sign up today: https://t.co/VTdE4cbXB9

runhouse_ retweeted

Work-Bench @Work_Bench

almost 2 years ago

Join our ML Engineering Happy Hour hosted by @psomrah @DonnyGreenberg & the @runhouse_ team 🍻 We're gathering our NYC technical community of practitioners across the ML Engineering ecosystem to come grab a drink, mingle & kick off the Fall 🍁🍂 https://t.co/9zmbuNU8TY

763

runhouse_ retweeted

Priyanka Somrah

@psomrah

almost 2 years ago

📢 Calling all Machine Learning Engineering practitioners & builders! @Work_Bench & @runhouse_ 🏃🏡 are kicking off the Fall season 🍂🍁 with a happy hour on Tue, Sep 10th! 🍻 Join us for drinks and snacks and come nerd out with fellow ML engineers in NYC. We have folks from @DoorDash @Etsy @flatironhealth @glean & MORE joining us already 🪩💃 RSVP now to grab a spot! ➡️ https://t.co/Bs4vJ8kqXy ⬅️

647

Runhouse

@runhouse_

almost 2 years ago

If you love the interactivity and visualizations of notebooks but want the determinism and power of ML pipelines, you should be using @marimo_io with Runhouse!

marimo

@marimo_io

almost 2 years ago

From local development to cloud compute: you can now run functions in a marimo notebook on remote clusters! We unlocked this capability thanks to @runhouse_, which builds on @skypilot_org to make it seamless to run remote functions.

marimo_io's tweet photo. From local development to cloud compute: you can now run functions in a marimo notebook on remote clusters!

We unlocked this capability thanks to @runhouse_, which builds on @skypilot_org to make it seamless to run remote functions. https://t.co/s515dKsGnx

879

Runhouse

@runhouse_

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users