Nerds taming the green dragon with SCALE, our framework for compiling CUDA codebases for AMD GPUs, with support for more accelerated platforms coming soon.
Cross-architecture from a single codebase is exactly why we built SCALE. Thrilled to see @AtlasInference getting this running! More performance optimizations for both @AMD and @nvidia are on the way.
https://t.co/hQpeCzwg2B
Atlas Inference is running Qwen3.6-27B on AMD Strix Halo ๐ฅณ
Using @SpectralCom's SCALE ROCm backend, our CUDA kernels compile and run on RDNAโ๏ธ
Cross-architecture inference from ONE codebase ๐ฃ๏ธ
Thank you @AIatAMD for the gift ๐
POC โ excited to keep tuning performanceโก๏ธ
@AtlasInference@AIatAMD This is just the beginning! Huge shoutout to the Atlas team for this POC. We're working hard to unlock even more performance on both @AMD and @nvidia. Let's keep pushing.
https://t.co/hQpeCzwg2B
๐ฆ๐ฝ๐ฒ๐ฐ๐๐ฟ๐ฎ๐น ๐๐ผ๐บ๐ฝ๐๐๐ฒ ๐ถ๐ ๐ป๐ผ๐ ๐ฝ๐ฎ๐ฟ๐ ๐ผ๐ณ ๐๐ต๐ฒ ๐ก๐ฉ๐๐๐๐ ๐๐ป๐ฐ๐ฒ๐ฝ๐๐ถ๐ผ๐ป ๐ฝ๐ฟ๐ผ๐ด๐ฟ๐ฎ๐บ. ๐ฉ
Inception is @nvidia's program for AI startups - a membership that gives access to technical resources, preferred pricing on NVIDIA hardware and software, and exposure to a global network of investors and partners.
CUDA is the de-facto standard for AI developers, and weโre honored to play our part in growing the ecosystem.
And on NVIDIA's B300 (CUDA 13), SCALE lands within a whisker of nvcc on its home turf: ๐ผ๐ป ๐ฝ๐ฎ๐ฟ ๐ผ๐ป ๐ฎ๐๐ฒ๐ฟ๐ฎ๐ด๐ฒ, ๐๐ฝ ๐๐ผ ๐ต% ๐ณ๐ฎ๐๐๐ฒ๐ฟ on individual workloads.
That's native CUDA tooling, matched and occasionally beaten, by a third-party compiler.
Everyone says CUDA can't target TPUs.
What they mean is nobody has written the compiler that raises CUDA code to something a systolic backend can consume.
Those are very different sentences.
Full post โ Part 3 of why @SpectralCom exists: https://t.co/IAhMNaMqD4
@ChrisKitching17, our CTO and co-founder of Spectral Compute, recently presented at the ๐ก๐๐ฅ ๐ฃ๐ฒ๐ฟ๐ณ๐๐ฎ๐ฏ ๐ฆ๐ฒ๐บ๐ถ๐ป๐ฎ๐ฟ hosted by NHR@FAU. The recording is now online: https://t.co/9rD0gDVLla
The Q&A at the end is worth the watch on its own.
Attendees asked about:
โณ Whether optimized deep learning kernels flow through the same pipeline
โณ Support for Intel GPUs and emerging hardware
โณ How SCALE compares to NVIDIA's own compiler when targeting NVIDIA
โณ Adding custom accelerators (RISC-V came up) and emulating missing functionality
โณ Code-level transformations to improve occupancy on AMD
โณ Plans for OpenACC, CUDA Fortran, and OpenMP
๐จ๐ฝ ๐๐ผ ๐ฎ๐ฑ.๐ณร ๐ณ๐ฎ๐๐๐ฒ๐ฟ. Unmodified CUDA. AMD silicon.
Thanks to the @tensorwave team for benchmarking SCALE on MI355X and publishing the numbers.
Port to AMD used to mean a rewrite. Now it means a recompile.
Spectral Compute (@SpectralCom) used TensorWaveโs AMD-native infrastructure to benchmark CUDA portability and performance on @AMD Instinctโข MI355X GPUs.
See how they did it - https://t.co/CIGuIPGTEy
Our compiler toolchain is free for research and evaluation. If you maintain a CUDA project and want to finally close that thread, come say hi: https://t.co/sWKIICCvaj
Sort the issues of almost any popular open-source CUDA project by most commented, and you'll inevitably find the exact same unresolved thread: 'Any chance of AMD support?'"
For most maintainers, that thread stays open indefinitely. Porting a complex CUDA codebase is a part-time job, and volunteer contributors rarely have one to spare.
๐ฆ๐๐๐๐ ๐๐๐ฟ๐ป๐ ๐๐ต๐ฎ๐ ๐ถ๐๐๐๐ฒ ๐ถ๐ป๐๐ผ ๐ฎ ๐ฟ๐ฒ๐ฐ๐ผ๐บ๐ฝ๐ถ๐น๐ฒ.
๐ฆ๐๐๐๐ ๐ญ.๐ณ ๐ถ๐ ๐ผ๐๐.
Multi-architecture AMD builds. Kokkos support. Managed memory over 100ร faster on HPC workloads. More PyTorch coverage. OpenGL interop. FFmpeg on AMD.
A lot in one release - full details in the blog post:
https://t.co/WYkVniRPWg
Obvious take: if agents can write native code for any GPU, who needs a portable CUDA toolchain? Just point the model at each GPU. I think that's exactly backwards โ and wrote up why. https://t.co/wzODqYZFqE
@AMD@SpectralCom we're looking forward to using SCALE for the CUDAโROCm path :)
We want to bring this same concept of having hyper-optimized kernels built for a certain model architecture and its corresponding quantization, adapted per chip.
That kind of structured feedback is what agents need to reason about a codebase reliably. Modern CUDA tooling is a prerequisite for agentic workflows in CUDA development.
โข ๐๐ฒ๐ฎ๐ฟ๐ป ๐บ๐ผ๐ฟ๐ฒ ๐ฎ๐ฏ๐ผ๐๐ ๐ฆ๐๐๐๐: https://t.co/ACBNwFnYSw
โข ๐๐ผ๐ถ๐ป ๐ผ๐๐ฟ ๐๐ถ๐๐ฐ๐ผ๐ฟ๐ฑ ๐ฐ๐ผ๐บ๐บ๐๐ป๐ถ๐๐: https://t.co/dfRZHTUhcD
If your team is integrating AI coding agents into your CUDA development workflow this year, those agents need tooling that gives them something useful to read.