Building a compiler + HSL framework to turn @__tinygrad__ kernels into VHDL, and synthesize the perfect FPGA for a given compute graph.
Tinygrad UOps -> KernelIR (my custom IR) -> Amaranth hardware modules
@predict_addict@andrewgwils Because I’m a cs major and consciously chose not to do math and physics knowing it would have probably been better. Hindsight 20/20
To settle the "buy GPUs vs rent" debate for side projects: once you buy that RTX PRO 6000, you will procrastinate and let it idle. If you're renting it for $2/hr, you will be more productive working on your side project than you will at your day job.
@datavorous_ An agent orchestration harness with a CEO, engineers, and validator agents enabling devs to push 150kloc/day? Seems like the right next step if this kid wants to stop shipping weekend side projects and make a real impact on the world!
Turns out adding 0 helps :)
Today we’re introducing Ternary Bonsai 🌳, a family of end-to-end 1.58-bit language models in 8B, 4B, and 1.7B sizes.
Ternary Bonsai 8B is within 5% of Qwen 3 8B at 9x lower memory.
Still tiny. Noticeably smarter
The european mind (me) cannot comprehend that you would see an $18 flight ticket and your first instinct is buy all of them for $3400
Well played, well played
Renting H100s from runpod to write tinygrad bounties like a medieval peasant paying a tithe to his feudal lord for a meager plot of compute. I toil day and night, hoping my bounty harvest is enough to win the respect of the king and avoid starvation
It's nice that we could get Bonsai-family support so quickly, but this is a bit disingenuous. I have never contributed to tinygrad so I am not in a position to critique this, however this implementation unpacks the 1bit weights as float16 and runs computations on float16 instead of running custom kernels on the packed weights, nullifying a lot of the benefits of the Bonsai architecture.
Q1_0 It is a packed 1-bit format: for each block of 128 weights, you store 16 bytes of bits and 2 bytes for a shared fp16 scale.
128 weights take 18 bytes total. If you unpack those same 128 weights into float16, that becomes 256 bytes (14x).
This is basically unpacking the "bit-based llm" in normal float16 and running calculations that way.
My understanding of that llama.cpp’s Bonsai support keeps the weights in the quantized Q1_0 representation and uses kernels that operate on that packed format, which is the whole point.
Again, I do not mean this as a shot at the implementation itself. Getting support working this quickly is genuinely cool. I might also be misunderstanding some parts of this, hopefully not too much, but would love to be corrected.
Just merged an external PR for Bonsai-8B support (1 bit LLM). Because tinygrad has the correct abstractions, it was 5 lines. https://t.co/BLljWDANgq https://t.co/GlXWqPbYg5
Yea, just replied as well, I was totally misunderstanding. For some reason I thought that the ggml loading to tensor wouldn't be fused with the rest of the code (not sure why I would think that) so the scheduler only saw the multiplication with the float16 d, and separately saw the rest of the model. The memory usage doesn't lie