π open sourced metalBLAS, hand-tuned Metal matmul kernels for Apple Silicon, callable from PyTorch on mps.
Matches/beats MPS Graph (torch) matmuls on bf16/fp16, 2-3x faster on fp32 (TF32-relaxed) across the bench suite on M5 Pro.
Next step is to upstream this to PyTorch!
https://t.co/EMGdZaagXP
At first this looks confusing but once you see the recognizable functions it gets easier.
For example you can see a big block of scaled_dot_product_attention.
What's interesting is what happens after that, are there any GPU<->CPU syncs or not and which kernels take the most time, idea is to find the ops that you recognize from your model's forward and then reading this becomes easy
@baggiponte Agree on it being second-class citizen. AFAIK there is no official roadmap, but it should become much better/faster in next 2 releases (2.13/2.14)
Shipped specialized SDPA kernels for PyTorch MPS, up to 16x faster than the previous MPSGraph path π
Metal kernels for both decode (q_len=1) and prefill (long causal)
- Decode, 16k ctx, D=128: **1.42 β 0.087 ms (16.3x)
- Prefill, 4k seq, D=96: **99.6 β 18.8 ms (5.3x)
This marks the end of my first week at @huggingface! I'm joining as a founding engineer on HF's PyTorch team.
My first project: safetensors on Mac is up to 3x fasterπ
Parallel reads straight into MPS unified memory, no CPU staging.
MB Pro M5 Pro
- Cold 16 GB: **2.97 β 8.23 GB/s** (2.8Γ)
- Warm 3 GB: **10.3 β 26.6 GB/s** (2.6Γ)
@mohitwt_ I think fusing means not having an extra kernel launch. Doing:
GEMM -> inplace op -> GEMM
isn't fusion. GEMM still writes the output back to global memory, inplace still needs to read each element from global memory and write it back