I gave a talk at GPU MODE workshop last week on llm.c
- the origin story of llm.c
- being naked in the world without PyTorch and having to re-invent Array, Autograd, Device, Dtype, Compile, Distributed
- how to port a PyTorch layer to 1) explicit PyTorch
- and then to 2) write the backward pass
- 3) port forward & backward pass to C
- 4) string all the layers together
- achieving one file of C with no dependencies that compiles and runs ~instantly, where all memory is pre-planned and allocated a single time, fully deterministic, portable code that can run on a potato or a von Neumann probe
- how most of llm.c was built at 1am-7am in a water villa porch in Maldives and why this is the recommended way to develop software
- convert all of it to run in CUDA on GPU in fp32
- port matmul to cuBLAS
- port attention to cuDNN flash-attention
- introduce bfloat16 mixed precision
- introduce many more optimizations and features like kernel fusions, Packed128, stochastic rounding, full determinism
- add multi-GPU training, NCCL, sharded optimizer
- add multi-node with MPI or file system or socket
- reproduce GPT-2 (1.6B) on one 8XH100 node in 24 hours for $672 in llm.c, achieving (at the time) 29% less memory, 19% faster training that PyTorch nightly, and much faster compile & run
- how open source development attracts Avengers from the internet
- port to training Llama 3 imminent (branch exists)
- many other notable forks
- last thought: how software abstractions like Python/PyTorch and everything else really exist only because humans are finite in knowledge, IQ and attention, and how with increasing AI capability LLMs may export custom binaries like llm.c for any application directly, tearing apart and refactoring all abstractions as needed.
<|endoftext|>
More links in reply
23+ years listening to #SomaFM. Those guys got me through a lot of marathon #hacking sessions nights 😉. Still going strong. Even with all the choices we have today for music listening I always find myself going back to the the original.
https://t.co/mzHC0blZV0
Programming is changing so fast... I'm trying VS Code Cursor + Sonnet 3.5 instead of GitHub Copilot again and I think it's now a net win. Just empirically, over the last few days most of my "programming" is now writing English (prompting and then reviewing and editing the generated diffs), and doing a bit of "half-coding" where you write the first chunk of the code you'd like, maybe comment it a bit so the LLM knows what the plan is, and then tab tab tab through completions. Sometimes you get a 100-line diff to your code that nails it, which could have taken 10+ minutes before.
I still don't think I got sufficiently used to all the features. It's a bit like learning to code all over again but I basically can't imagine going back to "unassisted" coding at this point, which was the only possibility just ~3 years ago.
Think you can tell if a social media account is a bot? What about as AI gets better?
A new paper—co-authored with researchers from ~20 orgs, & my OpenAI teammates Zoë Hitzig and David Schnurr—asks this question: What are AI-proof ways to tell who’s real online? (1/n)
I've been using Linux for 30 years. For a long time I used xterm as my terminal in Linux. I have never ever gone through and read it's man page 🤣. Has anyone read the whole thing? #Linux
It's here. The problem is you need about half a million dollars of GPUs to run a full model :). Excited to play with it anyway. #LLaMA3#AI#ML#Meta
https://t.co/6ftuxrXm9d
https://t.co/6JjRdZl05A
This is really a 'WOW' paper. 🤯
Claims that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales and by utilizing an optimized kernel during inference, their model’s memory consumption can be reduced by more than 10× compared to unoptimized models. 🤯
'Scalable MatMul-free Language Modeling'
Concludes that it is possible to create the first scalable MatMul-free LLM that achieves performance on par with state-of-the-art Transformers at billion-parameter scales.
📌 The proposed MatMul-free LLM replaces MatMul operations in dense layers with ternary accumulations using weights constrained to {-1, 0, +1}. This reduces computational cost and memory utilization while preserving network expressiveness.
📌 To remove MatMul from self-attention, the Gated Recurrent Unit (GRU) is optimized to rely solely on element-wise products, creating the MatMul-free Linear GRU (MLGRU) token mixer. The MLGRU simplifies the GRU by removing hidden-state related weights, enabling parallel computation, and replacing remaining weights with ternary matrices.
📌 For MatMul-free channel mixing, the Gated Linear Unit (GLU) is adapted to use BitLinear layers with ternary weights, eliminating expensive MatMuls while maintaining effectiveness in mixing information across channels.
📌 The paper introduces a hardware-efficient fused BitLinear layer that optimizes RMSNorm and BitLinear operations. By fusing these operations and utilizing shared memory, training speed improves by 25.6% and memory consumption reduces by 61% over an unoptimized baseline.
📌 Experimental results show that the MatMul-free LLM achieves competitive performance compared to Transformer++ baselines on downstream tasks, with the performance gap narrowing as model size increases. The scaling law projections suggest MatMul-free LLM can outperform Transformer++ in efficiency and potentially in loss when scaled up.
📌 A custom FPGA accelerator is built to exploit the lightweight operations of the MatMul-free LLM. The accelerator processes billion-parameter scale models at 13W beyond human-readable throughput, demonstrating the potential for brain-like efficiency in future lightweight LLMs.
@BitValentine They do not at all yet. Obviously, just reading that article and looking at the generated image of just a tiny portion of human brain you can see the complexity of it. Artificial Neural Networks are nowhere near that. However, we can learn from it and design accordingly :).
I don't think we're there yet with artificial neural networks :). #AI#ArtificialIntelligence#ML#MachineLearning
“The word ‘fragment’ is ironic,” Lichtman says. “A terabyte is, for most people, gigantic, yet a fragment of a human brain—just a miniscule, teeny-weeny little bit of human brain—is still thousands of terabytes.”
https://t.co/iNrX8udYAp
What your brain cells look like when you are learning something new and forming a new memory. This is what consciousness looks like but at a much larger scale. You are a neural network, AI neural networks operate on similar principles to humans.
Stepping into the Matrix ... of neural networks, literally.
PyTorch never fails to fascinate! It's a really cool visualization tool for matrix, attention, parallelism, and more. The best education comes from the most intuitive delivery.
This is a multilayer perceptron with data-parallel partitioning. Interactive demos: https://t.co/PUKQWOSr03