CloudRift is the OS for sovereign AI deployments. We give data center operators and enterprises a single control plane to manage GPU fleets, launch customer workloads, and serve LLM inference, all within their own security perimeter.
We post:
• Benchmarks (matmul kernels, fp16/fp32, TMA, warp specialization)
• Availability as new GPUs come online
• Commentary on the GPU market and sovereign AI
• Engineering write-ups from @ditrifonov and team
https://t.co/R0dgYrPgom
Building a matmul kernel for Blackwell, we benchmarked two common bits of GPU optimization advice and found they don't survive modern ptxas:
Manually vectorizing to LDG.128. 0% delta.
Sinking vs hoisting loads. Same.
Writeup: https://t.co/kQVFE0cGns
#CUDA
We built an fp16 matmul kernel that hits 105% of cuBLAS HGEMM on the RTX 5090. cuBLAS still ships an Ampere-era kernel.
No TMA, no warp specialization. @ditrifonov rebuilt it for Blackwell.
The writeup walks through every pass.
https://t.co/kQVFE0ded0
#CUDA#GPU
Most GPU VMs come configured for general workloads.
Our team benchmarked what host-level tuning actually changes: memory bandwidth up to 7x on #H200, #NCCL up to +144% on PRO 6000. On the wrong config, #NUMA exposure cuts NCCL by 57%.
https://t.co/y9d7VYjKT9
Introducing Dubbing v2, our revolutionary new dubbing model.
For the first time, the emotion and performance of the original content is carried over into every language.
Close to half of planned US data center builds this year are projected to be delayed or canceled.
The cause is power infrastructure and China-sourced parts, with transformer lead times now up to five years.
https://t.co/oCUu82rsGb
#DataCenters#AIinfrastructure
61% of Western European CIOs now prioritize local cloud providers over US hyperscalers. With the EU AI Act fully applicable on August 2, regional GPU capacity is shifting from a preference to a procurement requirement.
https://t.co/HvNw5kOOl3
#SovereignAI#EUAIAct
Training models or serving inference on AMD GPUs?
We’ve refreshed the AMD accelerator example in the dstack docs, covering on-prem fleets, cloud GPU provisioning, dev environments, training jobs, and production-grade inference.
https://t.co/WffI8cKY7t
How do you search 24,000 matmul configurations without burning days of GPU time? @ditrifonov's autotuner samples around 207 of them in ~67 seconds with Monte Carlo tree search.
Check out part 3 of the writeup:
https://t.co/xyGDinQwUO
@triton_lang #MLcompilers#CUDA
Check out Part 3 of @ditrifonov's series on building a GPU compiler from scratch:
He added autotuning via Monte Carlo tree search, moving the geomean from 0.87x to 0.96x of PyTorch eager.
32 of 84 kernels now beat PyTorch's hand-tuned code.
https://t.co/0MBdQDUHFQ
#MLcompilers @PyTorch
@AMD Instinct #MI350X in our benchmarks:
2.6x faster FP16 matmul throughput than H200.
Memory bandwidth: 241 GB/s on default libvirt, 813 GB/s tuned. Full results in the post:
https://t.co/y9d7VYjd3B
#AMDInstinct#ROCm
If you've ever wished you could read PyTorch's compiler end to end, here's the closest thing:
Dmitry built a working ML compiler in about 8,000 lines of Python that's faster than PyTorch eager on average and up to 4.7x faster on small kernels like reductions and k/v projections.
https://t.co/mvVWdb26o5
@PyTorch #MLcompilers #PyTorch
NUMA exposure jumps GPU VM memory bandwidth by 3-7x.
But on H200, cross-node NCCL collectives lost 57% of bandwidth when GPUs spanned different NUMA nodes.
A real trade-off:
https://t.co/y9d7VYjKT9
@nvidia#NCCL#NUMA#GPUcloud
288 GB HBM3e per accelerator changes the #inference deployment math.
Workloads that need 2x or 4x #H100 with tensor parallelism collapse onto a single #MI350X. Fewer failure modes, no cross-GPU latency.
https://t.co/QqAOpQcG37
@AMD#AMDinstinct
#Llama 3 70B in FP16 weighs ~140 GB. A single @AMD#MI350X (288 GB HBM3e) fits it with room for KV cache and long context.
On #H100 (80 GB), the same model requires tensor parallelism across two GPUs.
https://t.co/QqAOpQcG37
#amdinstinct
Available now on CloudRift as on-demand VM rentals:
$3.65/hr for an @AMD Instinct #MI350X. 288 GB VRAM, HBM3e, 8 TB/s, no minimum commitment. No waitlist.
https://t.co/QqAOpQcG37
#AMDInstinct#LLMinference#ROCm
@ditrifonov 's ML compiler, benchmarked on a full transformer block at FP32, #RTX5090.
Geomean 1.11x over @PyTorch eager and 1.20x over torch.compile. Small k/v projections reach 4.7x.
Large matmuls at seq=512 regress where register pressure dominates.
#GPU#CUDA#PyTorch #MLSys
https://t.co/mvVWdb26o5
https://t.co/KlPX0PFytN, a CloudRift AI Grant recipient, trains models that generate ligands for drug discovery.
They've since won an Ignite grant from @PavaCenter and started wet-lab work at @HopkinsMedicine to test the model's predictions.
#AIDrugDiscovery#AIforScience