Runhouse and Curavoice were recognized in the 2025 PyTorch Startup Showcase, with @DonnyGreenberg presenting for @runhouse_ and Shrey Modi presenting for @CuraVoice. Their work reflects the depth of innovation across AI infrastructure and applied AI in the PyTorch ecosystem.
Thank you to our Startup Showcase MC @chappyasel (@_ai_collective) and our judges Irving Hsu (@MayfieldFund), Radhika Malik (Dell Technologies Capital), Kevin Crosby (GitHub), @denise_teng25 (@GradientVC), and @simontiu (@VertexVUS) for their thoughtful evaluation and support.
💡 Learn more about the 2025 Showcase winners and their work: https://t.co/dR120wWEV6
#PyTorchCon #PyTorch #OpenSourceAI #AIInfrastructure
The Startup Showcase is happening now at PyTorch Conference 2025, where AI startups are pitching live to top VCs.
Startups presenting: Graphsignal, Runhouse, NimbleEdge, XOR, Snapshot AI, Node One, BrightOnLABS, Okahu AI, CuraVoice, Seer Systems, KnowNext, and Backfield AI.
Judges: Chappy Asel (The GenAI Collective), Irving Hsu (Mayfield), Radhika Malik (Dell Technologies Capital), Kevin Crosby (GitHub), Denise Teng, and Simon Tiu.
Sponsored by Dell Technologies Capital and Mayfield.
Stay tuned for more!
🔗 Learn more: https://t.co/L5C8AAi34O
#PyTorchCon #PyTorch #StartupShowcase
“Programmatic control over ML compute is the new world. Now that I've seen it I can never go back to Slurm. It's hard to point to just one thing that's better, it's 100 things.”
Announcing Kubetorch, fixing ML and RL development on Kubernetes: https://t.co/803n2KZp28
Kubetorch makes RL easy, part 1: Launch code sandboxes (extremely easily!) as part of #RL training done with #TRL by @huggingface. KT makes it extremely simple to setup any agent for execution on separate compute (& separate images) for your RL loop.
How can you write distributed training jobs that fail gracefully and recover intelligently?
In this example, we show how Runhouse's Kubetorch gives programmatic control over training. When the model OOMs, the error propagates back to the driver program, where you retain full control over what happens next. This allows for smart, dynamic adjustment (like reducing batch size or relaunching on larger compute) without needing to restart the whole job.
You can see the code in our GitHub repo: https://t.co/ycELZdWd93
#MLInfrastructure #PyTorch #MLOps
With Kubetorch, you can run a PDB debugger in both #distributed training and #inference applications while they run at production scale on #Kubernetes. This was built for #ML teams who struggle to debug production-like workloads today.
📉 Kubetorch users report 98% fewer prod failures.
Why? Because unlike CLI-based infra (@kubeflow, #slurm, @raydistributed , etc) that crumble on errors, Kubetorch:
✅Propagates remote exceptions in pure Python
✅Avoids cascading failures
✅Supports multiple interactive calls
"Restart from checkpoint" is stone age
Are you looking for a drama-free way to run your Airflow pipelines?
In a new example below, we show how 📦Kubetorch🔥 cleanly separates the concerns of development from orchestration, so you never need to get stuck push-and-pray debugging in Airflow again. You can iterate and debug your application in local Python with Kubetorch's live dispatch to cloud hardware, and then simply run that Python code as-is in Airflow with PythonOperator in CI or production. The Airflow node will run your code identically, on identical cloud hardware, as you ran and debugged it locally, so you'll never need to rely on Airflow as your debugger again.
If you need to debug something that's breaking in production, or simply experiment with a change to a production pipeline, you can just edit and run that same Python code locally. It's a totally closed research-to-production-to-research loop.
This example demonstrates how to use Kubetorch to train a PyTorch model on a remote GPU, and then schedule that training for recurring automation with Airflow. If you already have a bunch of Airflow pipelines, you'll see that it takes minimal code changes to restore the development and debugging experience in any existing pipeline.
It's not just Airflow; the same structure can be used to improve development velocity, research-to-production, and fault tolerance with any pipeline orchestrator (e.g. Argo, Dagster, Prefect, Flyte) without requiring any direct integration.
Take a look and get in touch if you'd like to try it out yourself!
https://t.co/3xp6dNpr1O
📦Kubetorch🔥 Inference with vLLM
Wouldn't you know it, if you build a system which is extremely good at deploying and scaling Python applications on Kubernetes, it works well for inference. Here's an example showing how Kubetorch's decorator API makes developing, deploying, scaling, and composing inference applications dead-easy.
If you think about it, basically all the opinionated little subcomponents (training, notebooks, inference, data processing, etc.) of SageMaker, Kubeflow, Vertex, etc. are doing exactly the same thing - running Python on cloud compute. So if you invert that approach, and focus on developing, deploying, and scaling arbitrary Python on Kubernetes, you can support the full ML lifecycle with elegant, unopinonated, cloud-agnostic APIs. That's 📦Kubetorch🔥 .
Hello 📦Kubetorch🔥
This demo: Running a distributed training on Kubernetes with *1-2 second debugging and iteration loops*, and no magic CLI incantations, DSLs, YAML, or dockerfiles. Just point any Python code to the compute you want to run it on and go.
(Tensorflow:PyTorch::Kubeflow:Kubetorch, get it?)
More demos coming soon:
* Online inference
* Training fault tolerance
* Live pdb debugging
* ... -> Reply below what you'd like to see
tldr; We're currently in the "Hadoop era" of machine learning - companies are struggling with infrastructure complexity just like they did with big data a decade ago. The next wave of innovation will focus on making ML infrastructure accessible and scalable. Teams should be able to launch an LLM fine-tuning over a dozen GPUs or train an image recognition model at high velocity, without struggling with the hardware required to do so.
.@DonnyGreenberg was at @NYSE 's Cyber + AI Innovators Summit with @furrier to discuss the future of ML infrastructure and how to support rapid experimentation and scale for #genAI and #LLM use cases.
Watch here: https://t.co/Ea4hAvNKnw
Hear a replay of our presentation at the @raydistributed Summit about how Runhouse enables easy adoption of #Ray without disrupting your existing infrastructure while giving you fault-tolerant, flexible, and cost-efficient execution.
https://t.co/jBEXYNmbVW
We've observed an alarming line of discussion questioning whether companies should invest in the capabilities to train their own models, with "all ML moving to LLMs" as the alternative. The premise is that if you squint, you could imagine how a sufficiently powerful LLM could answer the type of question a custom ML model predicts today (Will this user like this movie? Is this transaction fraud?). Aside from the age old trap of "let's hope the model is magic this time," we see a few reasons to be highly suspicious of this excuse to pump the breaks on ML investment:
1) Definitionally, ML is *mobilizing your own data for the improvement of product,* and the large tech companies whose successes solidified the "data is the new oil" mantra are investing in training first-party models as intensely as ever. "Mobilizing first-party data" has a fundamentally different history of industry-redefining value creation compared to "building AI apps." Considering that the business value in Fortune500 enterprise today from proprietary models (ML) so dramatically dwarfs that of off-the-shelf models (AI), it's profoundly risky to speculatively shift investment and focus 100% from ML to AI at this juncture.
2) The LLM providers, and now the cloud providers too (Gemini, Bedrock, Azure-OpenAI), are highly incentivized to push this message. It would be incredibly convenient if we all stopped investing in training models and instead irreversibly relied on their model-training expertise as a service going forward. We've already seen what happens when a small few massively out-invest others in mobilizing first-party data: the ad ecosystem. Imagine if Facebook had relied on Google for ad technology rather than mobilizing their data directly, or if the ad agencies had matched Google's pace of investment to mobilize theirs. To this day, the largest companies on the planet can only minimally utilize their proprietary user data in ad targeting (through narrow APIs), maximizing their dependence on the proprietary targeting intelligence within their ad platform.
3) Yes, the infra for ML is hard, but it's getting easier. If we all stopped doing large-scale BI because Hadoop was hard, we'd never have reached the OLAP promised land. @runhouse_ is like Snowflake for ML training, so if you're investing in ML and hitting the infra walls, come give us a try.
Read on below.
Fav talk at the #RaySummit2024 has to go to the ✨ @runhouse_ team ✨ with @DonnyGreenberg@mattkandler taking the stage to share how they’ve built https://t.co/Si4PgTV7P8 to foster collaborative Ray development at scale
The team’s also doing a super fun drink sesh at 6:30 PT @Black Hammer Brewing Co in SoMa —> if you’re an ML Eng at the #RaySummit today, come through and talk shop 🍺🍻
fin/
🚨#RaySummit speaker session alert!🚨 Join @runhouse_ as they reveal a new system that transforms Ray from single-player per cluster to multiplayer 🚀
Discover how they're overcoming cluster restrictions and introducing a collaborative UI, making AI resources accessible—like a Google Drive for AI 🙌
Don't miss it, sign up today: https://t.co/VTdE4cbXB9
Join our ML Engineering Happy Hour hosted by @psomrah@DonnyGreenberg & the @runhouse_ team 🍻
We're gathering our NYC technical community of practitioners across the ML Engineering ecosystem to come grab a drink, mingle & kick off the Fall 🍁🍂
https://t.co/9zmbuNU8TY
📢 Calling all Machine Learning Engineering practitioners & builders!
@Work_Bench & @runhouse_ 🏃🏡 are kicking off the Fall season 🍂🍁 with a happy hour on Tue, Sep 10th!
🍻 Join us for drinks and snacks and come nerd out with fellow ML engineers in NYC.
We have folks from @DoorDash@Etsy@flatironhealth@glean & MORE joining us already 🪩💃
RSVP now to grab a spot!
➡️ https://t.co/Bs4vJ8kqXy ⬅️
If you love the interactivity and visualizations of notebooks but want the determinism and power of ML pipelines, you should be using @marimo_io with Runhouse!
From local development to cloud compute: you can now run functions in a marimo notebook on remote clusters!
We unlocked this capability thanks to @runhouse_, which builds on @skypilot_org to make it seamless to run remote functions.