Isalia20 @is36e - Twitter Profile

Isalia20

@Is36E

2 days ago

Flex attention is now available on @PyTorch MPS nightlies! https://t.co/hKvqfFmhg5

0

29

6

4

2K

Isalia20

@Is36E

6 days ago

🚀 open sourced metalBLAS, hand-tuned Metal matmul kernels for Apple Silicon, callable from PyTorch on mps. Matches/beats MPS Graph (torch) matmuls on bf16/fp16, 2-3x faster on fp32 (TF32-relaxed) across the bench suite on M5 Pro. Next step is to upstream this to PyTorch! https://t.co/EMGdZaagXP

Is36E's tweet photo. 🚀 open sourced metalBLAS, hand-tuned Metal matmul kernels for Apple Silicon, callable from PyTorch on mps.

Matches/beats MPS Graph (torch) matmuls on bf16/fp16, 2-3x faster on fp32 (TF32-relaxed) across the bench suite on M5 Pro.

Next step is to upstream this to PyTorch!
https://t.co/EMGdZaagXP

1

78

11

37

9K

Isalia20

@Is36E

26 days ago

At first this looks confusing but once you see the recognizable functions it gets easier. For example you can see a big block of scaled_dot_product_attention. What's interesting is what happens after that, are there any GPU<->CPU syncs or not and which kernels take the most time, idea is to find the ops that you recognize from your model's forward and then reading this becomes easy

1

0

45

Isalia20

@Is36E

27 days ago

@baggiponte Agree on it being second-class citizen. AFAIK there is no official roadmap, but it should become much better/faster in next 2 releases (2.13/2.14)

1

5

0

147

Isalia20

@Is36E

27 days ago

Shipped specialized SDPA kernels for PyTorch MPS, up to 16x faster than the previous MPSGraph path 🚀 Metal kernels for both decode (q_len=1) and prefill (long causal) - Decode, 16k ctx, D=128: **1.42 → 0.087 ms (16.3x) - Prefill, 4k seq, D=96: **99.6 → 18.8 ms (5.3x)

Is36E's tweet photo. Shipped specialized SDPA kernels for PyTorch MPS, up to 16x faster than the previous MPSGraph path 🚀

Metal kernels for both decode (q_len=1) and prefill (long causal)

- Decode, 16k ctx, D=128: **1.42 → 0.087 ms (16.3x)
- Prefill, 4k seq, D=96: **99.6 → 18.8 ms (5.3x) https://t.co/Gkmch8q2xY

2

80

6

30

5K

Isalia20

@Is36E

about 1 month ago

@0xkeenz yes, please let me know if any op is slower than CPU on MPS and I'll look into it

1

0

16

Isalia20

@Is36E

about 1 month ago

This marks the end of my first week at @huggingface! I'm joining as a founding engineer on HF's PyTorch team. My first project: safetensors on Mac is up to 3x faster🚀 Parallel reads straight into MPS unified memory, no CPU staging. MB Pro M5 Pro - Cold 16 GB: **2.97 → 8.23 GB/s** (2.8×) - Warm 3 GB: **10.3 → 26.6 GB/s** (2.6×)

Is36E's tweet photo. This marks the end of my first week at @huggingface! I'm joining as a founding engineer on HF's PyTorch team.

My first project: safetensors on Mac is up to 3x faster🚀

Parallel reads straight into MPS unified memory, no CPU staging.

MB Pro M5 Pro
- Cold 16 GB: **2.97 → 8.23 GB/s** (2.8×)
- Warm 3 GB: **10.3 → 26.6 GB/s** (2.6×)

6

153

7

42

12K

Isalia20

@Is36E

about 1 month ago

@francoisfleuret Also worth to mention that getting good gpu utilization/power usage on pipeline parallelism is quite tricky

0

1

0

371

Isalia20

@Is36E

about 2 months ago

Pesky bug killing performance on @PyTorch MPS. Can you spot it?

0

2

0

1

245

Isalia20

@Is36E

about 2 months ago

@josephjojoe @awnihannun Did you torch.compile MPS one?

0

199

Isalia20

@Is36E

2 months ago

@mohitwt_ I think fusing means not having an extra kernel launch. Doing: GEMM -> inplace op -> GEMM isn't fusion. GEMM still writes the output back to global memory, inplace still needs to read each element from global memory and write it back

0

2

0

63