CloudRift

Verified account

@CloudRiftAI

The Operating System for Sovereign AI Deployments

Mountain View, CA

Joined March 2024

39 Following

76 Followers

88 Posts

about 12 hours ago

CloudRift is the OS for sovereign AI deployments. We give data center operators and enterprises a single control plane to manage GPU fleets, launch customer workloads, and serve LLM inference, all within their own security perimeter. We post: • Benchmarks (matmul kernels, fp16/fp32, TMA, warp specialization) • Availability as new GPUs come online • Commentary on the GPU market and sovereign AI • Engineering write-ups from @ditrifonov and team https://t.co/R0dgYrPgom

0

0

0

0

12

1 day ago

Building a matmul kernel for Blackwell, we benchmarked two common bits of GPU optimization advice and found they don't survive modern ptxas: Manually vectorizing to LDG.128. 0% delta. Sinking vs hoisting loads. Same. Writeup: https://t.co/kQVFE0cGns #CUDA

0

0

0

0

14

2 days ago

We built an fp16 matmul kernel that hits 105% of cuBLAS HGEMM on the RTX 5090. cuBLAS still ships an Ampere-era kernel. No TMA, no warp specialization. @ditrifonov rebuilt it for Blackwell. The writeup walks through every pass. https://t.co/kQVFE0ded0 #CUDA #GPU

0

0

0

0

21

6 days ago

Most GPU VMs come configured for general workloads. Our team benchmarked what host-level tuning actually changes: memory bandwidth up to 7x on #H200, #NCCL up to +144% on PRO 6000. On the wrong config, #NUMA exposure cuts NCCL by 57%. https://t.co/y9d7VYjKT9

0

0

0

0

26

CloudRiftAI retweeted

ElevenLabs @ElevenLabs

8 days ago

Introducing Dubbing v2, our revolutionary new dubbing model. For the first time, the emotion and performance of the original content is carried over into every language.

79

2K

194

765

519K

8 days ago

Close to half of planned US data center builds this year are projected to be delayed or canceled. The cause is power infrastructure and China-sourced parts, with transformer lead times now up to five years. https://t.co/oCUu82rsGb #DataCenters #AIinfrastructure

0

1

0

0

25

8 days ago

61% of Western European CIOs now prioritize local cloud providers over US hyperscalers. With the EU AI Act fully applicable on August 2, regional GPU capacity is shifting from a preference to a procurement requirement. https://t.co/HvNw5kOOl3 #SovereignAI #EUAIAct

1

1

0

0

18

CloudRiftAI retweeted

dstack @dstackai

9 days ago

Training models or serving inference on AMD GPUs? We’ve refreshed the AMD accelerator example in the dstack docs, covering on-prem fleets, cloud GPU provisioning, dev environments, training jobs, and production-grade inference. https://t.co/WffI8cKY7t

2

5

6

5

2K

9 days ago

How do you search 24,000 matmul configurations without burning days of GPU time? @ditrifonov's autotuner samples around 207 of them in ~67 seconds with Monte Carlo tree search. Check out part 3 of the writeup: https://t.co/xyGDinQwUO @triton_lang #MLcompilers #CUDA

0

0

0

0

23

9 days ago

Check out Part 3 of @ditrifonov's series on building a GPU compiler from scratch: He added autotuning via Monte Carlo tree search, moving the geomean from 0.87x to 0.96x of PyTorch eager. 32 of 84 kernels now beat PyTorch's hand-tuned code. https://t.co/0MBdQDUHFQ #MLcompilers @PyTorch

CloudRiftAI's tweet photo. Check out Part 3 of @ditrifonov's series on building a GPU compiler from scratch:

He added autotuning via Monte Carlo tree search, moving the geomean from 0.87x to 0.96x of PyTorch eager.

32 of 84 kernels now beat PyTorch's hand-tuned code.
https://t.co/0MBdQDUHFQ
#MLcompilers @PyTorch

0

0

0

0

30

13 days ago

@AMD Instinct #MI350X in our benchmarks: 2.6x faster FP16 matmul throughput than H200. Memory bandwidth: 241 GB/s on default libvirt, 813 GB/s tuned. Full results in the post: https://t.co/y9d7VYjd3B #AMDInstinct #ROCm

0

1

0

0

27

14 days ago

If you've ever wished you could read PyTorch's compiler end to end, here's the closest thing: Dmitry built a working ML compiler in about 8,000 lines of Python that's faster than PyTorch eager on average and up to 4.7x faster on small kernels like reductions and k/v projections. https://t.co/mvVWdb26o5 @PyTorch #MLcompilers #PyTorch

0

0

0

0

31

14 days ago

NUMA exposure jumps GPU VM memory bandwidth by 3-7x. But on H200, cross-node NCCL collectives lost 57% of bandwidth when GPUs spanned different NUMA nodes. A real trade-off: https://t.co/y9d7VYjKT9 @nvidia #NCCL #NUMA #GPUcloud

0

0

0

0

23

18 days ago

288 GB HBM3e per accelerator changes the #inference deployment math. Workloads that need 2x or 4x #H100 with tensor parallelism collapse onto a single #MI350X. Fewer failure modes, no cross-GPU latency. https://t.co/QqAOpQcG37 @AMD #AMDinstinct

0

2

0

0

50

20 days ago

#Llama 3 70B in FP16 weighs ~140 GB. A single @AMD #MI350X (288 GB HBM3e) fits it with room for KV cache and long context. On #H100 (80 GB), the same model requires tensor parallelism across two GPUs. https://t.co/QqAOpQcG37 #amdinstinct

0

1

0

0

59

21 days ago

Available now on CloudRift as on-demand VM rentals: $3.65/hr for an @AMD Instinct #MI350X. 288 GB VRAM, HBM3e, 8 TB/s, no minimum commitment. No waitlist. https://t.co/QqAOpQcG37 #AMDInstinct #LLMinference #ROCm

0

0

0

0

44

22 days ago

@ditrifonov 's ML compiler, benchmarked on a full transformer block at FP32, #RTX5090. Geomean 1.11x over @PyTorch eager and 1.20x over torch.compile. Small k/v projections reach 4.7x. Large matmuls at seq=512 regress where register pressure dominates. #GPU #CUDA #PyTorch #MLSys https://t.co/mvVWdb26o5

0

0

0

0

59

22 days ago

https://t.co/KlPX0PFytN, a CloudRift AI Grant recipient, trains models that generate ligands for drug discovery. They've since won an Ignite grant from @PavaCenter and started wet-lab work at @HopkinsMedicine to test the model's predictions. #AIDrugDiscovery #AIforScience

0

0

0

0

97

Last Seen Users on Sotwe

Trends for you

Most Popular Users