GPU shortage and GPU underutilization exist at the same time. Some nodes are overloaded. Others are waiting for work! Identical prompts recomputing prefill on different replicas, for no reason.
Owning GPUs and actually using them are two different problems. ๐ฐNew post on solving the second one!
https://t.co/F33s58eufA
#AI #GPU #Cloud #AIRCLOUD #MLOps #Ray
Finally got my hands on the big one.
Qwen3.5-122B-A10B โ 122 billion parameters. Too big for any single consumer GPU.
So I rented 4 of each... and then one professional card to see if brute force even matters.
- 1x RTX PRO 6000 (96GB): 101.4 tok/s
- 4x 5090 (128GB): 87.0 tok/s
- 4x 4090 (96GB): 25.1 tok/s
- 4x 3090 (96GB): 20.8 tok/s
One single $8,500 card beat four RTX 5090s
@johniosifov This matches what we see at AIEEV running Air Cloud for inference workloads. Teams that combine routing and caching cut tokens 60-80%. Teams that also move off hyperscaler GPUs cut the remaining bill another 60%. The two stack better than most CFOs realize.
No local setup. No driver conflicts. Just SSH in and start coding with Claude on a GPU server.
๐
Deploy a container. Inject your SSH key. Install Claude Code.
That's it. Claude was running GPU benchmarks for me within minutes.
If you're tired of environment hell, this is worth a try.
๐๐๐๐
AirCloud April Update is live๐ฅ๐ฅ๐ฅ
โ faster AI workloads, more reliable ops, and smarter GPU utilization.
Highlights:
โ Enhanced container ops
โ Air API GA
โ RP support
โ Intelligent scheduler.
The biggest signal in the open model infrastructure market this week was "the operational cost of long context."512Kโ1M context windows. Cached-input pricing. Hosted KV cache. The competition is shifting from model names to "how cheaply and reliably can you handle long inputs."
The framework:
* Experimenting with long context โ Serverless API
* Growing usage โ GPU rental
* Large, predictable demand โ Dedicated infrastructure
The long-context era doesn't demand the biggest GPU. It demands the cost structure that fits your workload.
I used to assume buying more GPUs meant being ready for AI. Turns out most teams are only using ~5% of what they pay for. ๐ณ
If youโre provisioning GPUs for โpeakโ and letting them sit idle, youโre burning money (and power) for nothing. We dug into how whole-GPU allocation, over-provisioning, and lack of cross-team visibility create this waste โ and how a shift from ownership to access fixes it. ๐
Want to stop paying for idle GPUs and actually get the performance your models need? Read how distributed GPU infrastructure, MIG partitioning, and serverless allocation can cut waste and scale inference for LLMs and diffusion models. ๐