Cool competition! `examples/torch_cuda_kernel.py` shows how to use tinygrad (with BEAM=2). The bitter lesson always wins in the end, any tricks people find should be added to our search.
Launching a new kernel competition: Linear Algebra Kernels For The Age Of Research.
First problem: batched QR decomposition on B200. Old math, modern hardware.
Prize: Rare swag and hangout in SF
Last fall, we shared our deep dive on FA4 internals.
But we didn't stop at grokking the kernel.
Since then, we've been developing improvements for inference performance and upstreaming them.
This blog post explains those contributions.
https://t.co/xzDNHdq3Zw
Last fall, we shared our deep dive on FA4 internals.
But we didn't stop at grokking the kernel.
Since then, we've been developing improvements for inference performance and upstreaming them.
This blog post explains those contributions.
https://t.co/xzDNHdq3Zw
It's very cool that Apple shipped a 20B parameter on-device.
You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.
A small model predicts from the query (or prompt) which experts to load from Nand into RAM. The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts (instead of switching the experts for every token).
Super detailed tech report for MAI-Thinking-1, with a ton of info on all stages of the pipeline. I'm surprised so much of this info is released :)
Super long thread on my notes:
Introducing Claude Cairn 🪨
It is a Claude Code skill for carrying your thinking across sessions.
Say you explored a few different projects, ideas, or features in one session. Drop a checkpoint and you can restore or continue any of them in a later conversation with the relevant context already in place.
It came out of my own exploration sprints: in one Claude Code session I'd chase a few unrelated projects, ideas, and features at once, then waste time later untangling one thread from the rest to continue it elsewhere.
Glad to see this -- renderers are a foundational component of the LLM stack. Renderers map between tokens and messages, which are invariant to tokenizer and formatting details. Most APIs, datasets, and RL environments are defined in terms of messages.
Getting the details wrong leads to train-test mismatches, caching inefficiencies, and prompt injection vulnerabilities. We included a renderers module in Tinker Cookbook, but it makes sense as a standalone library.
new blog post out!
almost a month since I picked up robotics and one of the first things I dug into was inverse RL and behaviour cloning. it covers the bedrock of the field and the key concepts that shaped with with some fun interactive widgets to play around with.
Link below ⬇️