As someone working in inference rn,
It all depends a lot on how "urgent" the new model is and how "non-urgent" the old one is. Plus, how much spare capacity you have. Plus, if you were to wait for the current request(s) to finish, how much would you have to wait.
For first, would you rather prioritize old request that another GPU can accommodate, or a new request that doesn't have any place to go to? (i.e. code 429 (too many requests) -> bad customer experience)
Usually it's the latter.
For latter as well, if there's already a "warm" GPU that you can use to quickly serve up the new model, there's no reason to interrupt a request and switch something there. On the other hand if a new GPU first has to be warmed up with model images etc., you need to check if the new request is okay waiting for that long.
the above case still is usually too slow as compared to the model-switching options already available. Eg. vLLM offers a sleep mode (https://t.co/TeyC1Df1Cs) where you can start up a new model in seconds instead of setting up a new thing on a new GPU from scratch.
For second (spare capacity): if you have some extra GPUs, you might already have some GPUs for every model you support. In this case, it's all about committing to the SLA you've already promised. If you predict that the current set of GPUs cant finish a new request within t time, the only option is to send back a 429 (rate limit) or quickly set this new model up on another (old model) GPU.
For third (how much do we wait?), unless you do some crazy good request profiling, it's kinda impossible to know how much time all the requests running on a GPU would take to finish. at the end of the day, the patience runs out and this will end up interrupting some requests if not all (equivalent to ripping off the bandaid).
Bonus braindump: In the grand scheme of things, the delay in interrupted request might be acceptable esp if KV cache offload is taken care of and the request is move to another GPU (along with kv cache). (Surprisingly?), in some cases it's more efficient to move kv -> CPU memory -> another GPU instead of evicting it entirely. (https://t.co/Cg9vRjadeE)
I don't see how model routers / orchestrating multiple models is a viable business model.
If most inference is multi-turn agents, routing to a new model in the middle, and paying for the full context that would otherwise be KV-cached, seems quite wasteful.
🐍 #EuroPython2026 sponsorships are now OPEN..
… and we’ve already sold 80% of our Gold tier packages!
Early bird offer: sign up before 31st March and get 10% off.
📩 [email protected]
On that note, I'm officially a co-lead of the Sponsorship Team this year!
HMU if you want to sponsor the largest Python conference of Europe (in Poland, July 2026) and meet the best Python people! 🐍🏰🇵🇱
“EuroPython is a welcoming community [...] Just stay open, and your environment will do the rest.”
Read our latest interview with @deutranium, member of the Sponsorship team at EuroPython 2025
https://t.co/cDwuwI9u6D
#europython#conference#volunteers#python#contributors
HELIX CLOUD IS FINALLY IN OPEN BETA!!! ☁️🌪️
The team have been working tirelessly to bring you a robust, scale-to-zero graph/vector db.
Finally available via our website 👇🏻🥳
New #phd position in association #football (#soccer) analytics opening in @sn_ethz at @ETH_en, because https://t.co/txUdHRhgpb is about to finish. This is in addition to the @ETH_AI_Center#fellowship announced recently.
Job ad: https://t.co/lIJzLYoJwm
The @sn_ethz is looking for a motivated PhD student interested in combining social network research with analysis of digital trace data in the domain of music tastes and cultural consumption! (to work with @c_stadtfeld & Xinwei Xu) #socialnetwork#networkscience@SocNetAnalysts
WHAT: @deutranium's #MSThesisDefense "Sampling cohesive communities in unbounded networks"
1100hrs IST, 24th June @iiit_hyderabad
📜 Full thesis: https://t.co/glMjnzFtDe
#ProfGiri#Student30
K's paper - ICWSM 2024: Tight Sampling in Unbounded Networks https://t.co/8clljw3uKo
Students at @iiit_hyderabad, supposed to be one of the top engineering colleges in the country, are in an ongoing health crisis caused due to appalling mismanagement and negligence going on since well over a year. 🧵👇on mass typhoid breakouts, food poisoning, underreporting..
Excited to announce our preprint!
We develop a symbolic system for IMO Geometry that can rival Silver Medalists. Combined with AlphaGeometry, it outperforms IMO Gold Medalists in Geometry for the first time 🏅
https://t.co/TG418YGL2j