Had a pleasure spending some time with @reinerpope and @ysmulki to film this interview! With help from @jtvhk and @usercondition. Thank you @dankuntz for lending your living room and of course Mick!
I chatted with @ysmulki about MatX, chip design and where silicon designed for LLMs is headed
(8:17) Tightly coupling SRAM and HBM on one chip
(14:03) More MoE FLOPS, smaller KV cache load
(16:08) Numerics: from 32-bit to 4-bit
(19:02) Targeting both training and inference
(22:14) Chip timelines
(27:15) Logic and memory scarcity
(29:42) Compute costs
(32:07) Latency: from 20ms to 1ms as the new table stakes
(40:50) Programming the chip
(43:00) Starting MatX
(47:11) Codesign without seeing the models
(51:57) Interconnect design
(55:44) Performance modeling philosophy
(1:07:02) Prefill vs. decode
(1:13:47) What's next
.@dankuntz just stopped by Gizmodo to show us the Starboy
Fun lil toy if you like fun lil toys
You can flip the finger to it and the Starboyβs eyes will get angry lol