Fabio Guzman

Verified account

@FGuzmanAI

On-device ML Engineer | 🤖Passionate about reverse-engineering neural nets | 🚀Optimizing large models for the edge 💻📱

Colombia

Joined May 2011

545 Following

1.1K Followers

341 Posts

Pinned Tweet

about 21 hours ago

56,000+ tokens/sec at just 80 MHz. 🤯 I burned a full Transformer with KV cache into a custom chip. Designed gate by gate as a 100% digital integrated circuit. Prototyped on a FPGA. (No GPU. No CPU) Just pure digital silicon running @karpathy microGPT, spelling out names on a tiny LCD. This is GateGPT 👇

98

2K

232

1K

161K

about 11 hours ago

Very cool project, and seriously, try those boards! 🙌 It's been a thrill to take my 15-year-old Virtex-5 and build this. @taalas_inc went further (taped out on TSMC 6nm), but the point is you can prototype the same idea with tools you already have. Their "single-transistor multiplier" is almost certainly Multiply-Select-Add: products baked as Mask ROM, one NMOS pass-gate to select. On an FPGA that's just a single LUT. My inspiration came a few days ago at GTC Taipei, this exact moment 👇 Jensen talking about agents for chip design. That's when I went and dusted off the old board.

FGuzmanAI's tweet photo. Very cool project, and seriously, try those boards! 🙌 It's been a thrill to take my 15-year-old Virtex-5 and build this. @taalas_inc went further (taped out on TSMC 6nm), but the point is you can prototype the same idea with tools you already have. Their "single-transistor multiplier" is almost certainly Multiply-Select-Add: products baked as Mask ROM, one NMOS pass-gate to select. On an FPGA that's just a single LUT.

My inspiration came a few days ago at GTC Taipei, this exact moment 👇 Jensen talking about agents for chip design. That's when I went and dusted off the old board.

0

2

0

0

42

about 11 hours ago

https://t.co/VhjOGkodM8 Thank you, truly �� The spark came 12 days ago from this #NVIDIAGTC talk on agents accelerating chip design. I dusted off my old Virtex-5 and, with Claude + the digital-design principles I teach, I had to build it. Inspired by @taalas_inc too. Next: OpenLane → ASIC. Beyond CPUs, GPUs, even TPUs.

1

3

0

1

307

about 11 hours ago

@taalas_inc went further and taped it out on TSMC 6nm. The point here is showing you can prototype the same core idea with tools you already have. And their "single-transistor multiplier"? Almost certainly Multiply-Select-Add: burn all possible products into silicon as Mask ROM, use one NMOS pass-gate to select the right one. On an FPGA you get the same thing with a single LUT. Same idea, different substrate.

0

37

2

9

2K

Who to follow

folasewa001 🐺

Erasmus Mundus'24 NeuroData| Neuroscience, Data Science, and everything in between

Verified account

🤖 Building human-level robot brain @Amazon FAR | PhD @MIT_CSAIL

3D Shapes + Language @KAUST_News @Adobe

about 21 hours ago

56,000+ tokens/sec at just 80 MHz. 🤯 I burned a full Transformer with KV cache into a custom chip. Designed gate by gate as a 100% digital integrated circuit. Prototyped on a FPGA. (No GPU. No CPU) Just pure digital silicon running @karpathy microGPT, spelling out names on a tiny LCD. This is GateGPT 👇

98

2K

232

1K

161K

about 20 hours ago

No GPU, no CPU - a full Transformer with KV cache as RTL on a Virtex-5 FPGA. microGPT at ~56k tokens/s, fully open-source. Thought you'd appreciate this, @reach_vb 🙂

about 21 hours ago

56,000+ tokens/sec at just 80 MHz. 🤯 I burned a full Transformer with KV cache into a custom chip. Designed gate by gate as a 100% digital integrated circuit. Prototyped on a FPGA. (No GPU. No CPU) Just pure digital silicon running @karpathy microGPT, spelling out names on a tiny LCD. This is GateGPT 👇

98

2K

232

1K

161K

1

85

9

41

5K

about 20 hours ago

Built a full Transformer with KV cache in RTL, prototyped on a Virtex-5 FPGA microGPT at ~56k tokens/s. Thought this might interest you @pcuenq

about 21 hours ago

56,000+ tokens/sec at just 80 MHz. 🤯 I burned a full Transformer with KV cache into a custom chip. Designed gate by gate as a 100% digital integrated circuit. Prototyped on a FPGA. (No GPU. No CPU) Just pure digital silicon running @karpathy microGPT, spelling out names on a tiny LCD. This is GateGPT 👇

98

2K

232

1K

161K

0

33

2

4

2K

about 21 hours ago

How it hits 56,000+ tok/s No monolithic FSM. A microcode ROM sequences modular datapath actuators - matvec, attention, RMSNorm, exp, sampler - over one true dual-port scratchpad that also holds the persistent KV cache. 1 block · 24-dim · 4 heads · ctx 16 · Q5.11, bit-exact to Python.

FGuzmanAI's tweet photo. How it hits 56,000+ tok/s

No monolithic FSM.
A microcode ROM sequences modular datapath actuators - matvec, attention, RMSNorm, exp, sampler - over one true dual-port scratchpad that also holds the persistent KV cache.

1 block · 24-dim · 4 heads · ctx 16 · Q5.11, bit-exact to Python.

7

60

1

18

6K

about 21 hours ago

The whole Transformer fits in 23% of the LUTs and a single Block RAM (activations + KV cache live there) But it pins 62 of 64 DSPs at 96%. The multipliers are the wall. This is the actual place-and-route on the Virtex-5

FGuzmanAI's tweet photo. The whole Transformer fits in 23% of the LUTs and a single Block RAM (activations + KV cache live there)

But it pins 62 of 64 DSPs at 96%. The multipliers are the wall. This is the actual place-and-route on the Virtex-5 https://t.co/JRxaGo5hwt

3

75

3

17

7K

about 21 hours ago

The whole Transformer fits in 23% of the LUTs and a single Block RAM (activations + KV cache live there) But it pins 62 of 64 DSPs at 96%. The multipliers are the wall. This is the actual place-and-route on the Virtex-5

FGuzmanAI's tweet photo. The whole Transformer fits in 23% of the LUTs and a single Block RAM (activations + KV cache live there)

But it pins 62 of 64 DSPs at 96%. The multipliers are the wall. This is the actual place-and-route on the Virtex-5 https://t.co/JQuuxvp4DW

0

1

0

2

166

about 21 hours ago

Code (RTL, fixed-point spec, microcode ISA, weights) : https://t.co/Mra3z3Qs2l

1

3

0

4

203

2 days ago

@rudrank @rxwei @LouisDhauwe This is awesome 🤩

0

1

0

0

61

3 months ago

@anemll 🥳 We’re looking forward to the NE benchmarks and its energy efficiency

0

1

0

0

47

3 months ago

@anemll Fantastic 🤩

1

2

0

0

297

3 months ago

@anemll The work in (https://t.co/AmndylYROv) is interesting, but do you think the CoreML framework still provides better benefits? In my case, CPU-only performs better than CPU+ANE, so it seems Apple AMX is good enough, and moving data to IOSurface introduces a high overhead.

3 months ago

I moved all the logic to the CPU, and the performance improved by almost 2×. (42 tok/s), running on 100% CPU and 0% ANE.

0

4

1

0

386

2

0

0

3

327

3 months ago

I moved all the logic to the CPU, and the performance improved by almost 2×. (42 tok/s), running on 100% CPU and 0% ANE.

0

4

1

0

386

3 months ago

@sach1n @DamiDina @maderix Cool, which model are you training?

0

0

0

0

1K

4 months ago

Wow, excellent! It would be great if we could define a public repo with that skill, so we can contribute to bringing more models to MLX. Last year I converted this one: https://t.co/fEVbdphALQ fortunately it was straightforward, but I understand that sometimes more elaborate handling is required.

7 months ago

Running VibeThinker-1.5B on iPhone. ~1.5GB RAM usage, reasoning behavior comparable to GPT-OSS-20B. This is where edge AI is heading. https://t.co/P8zqml0iJR

5

165

15

98

10K

0

2

0

0

156

4 months ago

@elliotarledge Is the diffusion model single-step?

0

0

0

0

465

4 months ago

@Prince_Canuma Great, Prince - which URL hosts the MLX weights?

1

0

0

0

45

Last Seen Users on Sotwe

Trends for you

Most Popular Users