56,000+ tokens/sec at just 80 MHz. 🤯
I burned a full Transformer with KV cache into a custom chip. Designed gate by gate as a 100% digital integrated circuit. Prototyped on a FPGA. (No GPU. No CPU)
Just pure digital silicon running @karpathy microGPT, spelling out names on a tiny LCD.
This is GateGPT 👇
Very cool project, and seriously, try those boards! 🙌 It's been a thrill to take my 15-year-old Virtex-5 and build this. @taalas_inc went further (taped out on TSMC 6nm), but the point is you can prototype the same idea with tools you already have. Their "single-transistor multiplier" is almost certainly Multiply-Select-Add: products baked as Mask ROM, one NMOS pass-gate to select. On an FPGA that's just a single LUT.
My inspiration came a few days ago at GTC Taipei, this exact moment 👇 Jensen talking about agents for chip design. That's when I went and dusted off the old board.
https://t.co/VhjOGkodM8
Thank you, truly �� The spark came 12 days ago from this #NVIDIAGTC talk on agents accelerating chip design. I dusted off my old Virtex-5 and, with Claude + the digital-design principles I teach, I had to build it. Inspired by @taalas_inc too. Next: OpenLane → ASIC. Beyond CPUs, GPUs, even TPUs.
@taalas_inc went further and taped it out on TSMC 6nm. The point here is showing you can prototype the same core idea with tools you already have. And their "single-transistor multiplier"? Almost certainly Multiply-Select-Add: burn all possible products into silicon as Mask ROM, use one NMOS pass-gate to select the right one. On an FPGA you get the same thing with a single LUT. Same idea, different substrate.
56,000+ tokens/sec at just 80 MHz. 🤯
I burned a full Transformer with KV cache into a custom chip. Designed gate by gate as a 100% digital integrated circuit. Prototyped on a FPGA. (No GPU. No CPU)
Just pure digital silicon running @karpathy microGPT, spelling out names on a tiny LCD.
This is GateGPT 👇
No GPU, no CPU - a full Transformer with KV cache as RTL on a Virtex-5 FPGA. microGPT at ~56k tokens/s, fully open-source. Thought you'd appreciate this, @reach_vb 🙂
56,000+ tokens/sec at just 80 MHz. 🤯
I burned a full Transformer with KV cache into a custom chip. Designed gate by gate as a 100% digital integrated circuit. Prototyped on a FPGA. (No GPU. No CPU)
Just pure digital silicon running @karpathy microGPT, spelling out names on a tiny LCD.
This is GateGPT 👇
56,000+ tokens/sec at just 80 MHz. 🤯
I burned a full Transformer with KV cache into a custom chip. Designed gate by gate as a 100% digital integrated circuit. Prototyped on a FPGA. (No GPU. No CPU)
Just pure digital silicon running @karpathy microGPT, spelling out names on a tiny LCD.
This is GateGPT 👇
How it hits 56,000+ tok/s
No monolithic FSM.
A microcode ROM sequences modular datapath actuators - matvec, attention, RMSNorm, exp, sampler - over one true dual-port scratchpad that also holds the persistent KV cache.
1 block · 24-dim · 4 heads · ctx 16 · Q5.11, bit-exact to Python.
The whole Transformer fits in 23% of the LUTs and a single Block RAM (activations + KV cache live there)
But it pins 62 of 64 DSPs at 96%. The multipliers are the wall. This is the actual place-and-route on the Virtex-5
The whole Transformer fits in 23% of the LUTs and a single Block RAM (activations + KV cache live there)
But it pins 62 of 64 DSPs at 96%. The multipliers are the wall. This is the actual place-and-route on the Virtex-5
@anemll The work in (https://t.co/AmndylYROv) is interesting, but do you think the CoreML framework still provides better benefits? In my case, CPU-only performs better than CPU+ANE, so it seems Apple AMX is good enough, and moving data to IOSurface introduces a high overhead.
Wow, excellent! It would be great if we could define a public repo with that skill, so we can contribute to bringing more models to MLX. Last year I converted this one: https://t.co/fEVbdphALQ
fortunately it was straightforward, but I understand that sometimes more elaborate handling is required.
Running VibeThinker-1.5B on iPhone.
~1.5GB RAM usage, reasoning behavior comparable to GPT-OSS-20B.
This is where edge AI is heading.
https://t.co/P8zqml0iJR