Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.
Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: https://t.co/CDSQ8HpZoc
The @karpathy interview
0:00:00 – AGI is still a decade away
0:30:33 – LLM cognitive deficits
0:40:53 – RL is terrible
0:50:26 – How do humans learn?
1:07:13 – AGI will blend into 2% GDP growth
1:18:24 – ASI
1:33:38 – Evolution of intelligence & culture
1:43:43 - Why self driving took so long
1:57:08 - Future of education
Look up Dwarkesh Podcast on YouTube, Apple Podcasts, Spotify, etc. Enjoy!
Excited to release new repo: nanochat!
(it's among the most unhinged I've written).
Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single, dependency-minimal codebase. You boot up a cloud GPU box, run a single script and in as little as 4 hours later you can talk to your own LLM in a ChatGPT-like web UI.
It weighs ~8,000 lines of imo quite clean code to:
- Train the tokenizer using a new Rust implementation
- Pretrain a Transformer LLM on FineWeb, evaluate CORE score across a number of metrics
- Midtrain on user-assistant conversations from SmolTalk, multiple choice questions, tool use.
- SFT, evaluate the chat model on world knowledge multiple choice (ARC-E/C, MMLU), math (GSM8K), code (HumanEval)
- RL the model optionally on GSM8K with "GRPO"
- Efficient inference the model in an Engine with KV cache, simple prefill/decode, tool use (Python interpreter in a lightweight sandbox), talk to it over CLI or ChatGPT-like WebUI.
- Write a single markdown report card, summarizing and gamifying the whole thing.
Even for as low as ~$100 in cost (~4 hours on an 8XH100 node), you can train a little ChatGPT clone that you can kind of talk to, and which can write stories/poems, answer simple questions. About ~12 hours surpasses GPT-2 CORE metric. As you further scale up towards ~$1000 (~41.6 hours of training), it quickly becomes a lot more coherent and can solve simple math/code problems and take multiple choice tests. E.g. a depth 30 model trained for 24 hours (this is about equal to FLOPs of GPT-3 Small 125M and 1/1000th of GPT-3) gets into 40s on MMLU and 70s on ARC-Easy, 20s on GSM8K, etc.
My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it. It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit), but I think it's at a place where the overall skeleton is ok enough that it can go up on GitHub where all the parts of it can be improved.
Link to repo and a detailed walkthrough of the nanochat speedrun is in the reply.
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.
https://t.co/fYV4FPi71m
Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference”
We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to prompt engineering. Here we share what we are working on and connect with the research community frequently and openly.
The name Connectionism is a throwback to an earlier era of AI; it was the name of the subfield in the 1980s that studied neural networks and their similarity to biological brains.
https://t.co/lrJioBmpbT
Introducing Alterego: the world’s first near-telepathic wearable that enables silent communication at the speed of thought.
Alterego makes AI an extension of the human mind.
We’ve made several breakthroughs since our work started at MIT.
We’re announcing those today.
I can finally map @NBA player's position from the camera perspective onto the court map
it's still a bit shaky... I'll smooth it out later
it's time to detect shooting motions and mark the shot location!
some of the code has already been migrated to: https://t.co/VK0RQFWud1
excited to finally share on arxiv what we've known for a while now:
All Embedding Models Learn The Same Thing
embeddings from different models are SO similar that we can map between them based on structure alone. without *any* paired data
feels like magic, but it's real:🧵
Introducing Deep Research for arXiv
Ask questions like 'What are the latest breakthroughs in RL fine-tuning?' and get comprehensive lit reviews with trending papers automatically included
Turn hours of literature searches into seconds with AI-powered research context ⚡
The @gwern interview.
0:00:00 – Anonymity
0:01:09 – Automating Steve Jobs
0:04:38 – Isaac Newton's theory of progress
0:06:36 – Grand theory of intelligence
0:10:39 – Seeing scaling early
0:21:04 – AGI Timelines
0:22:54 – What to do in remaining 3 years until AGI
0:26:29 – Influencing the shoggoth with writing
0:30:50 – Human vs artificial intelligence
0:33:52 – Rabbit holes
0:38:48 – Hearing impairment
0:43:00 – Wikipedia editing
0:47:43 – Gwern dot net
0:50:20 – Counterfactual careers
0:54:30 – Borges & literature
1:01:32 – Gwern's process
1:19:17 - Gwern's finances
1:25:05 - Random
🎥 Today we’re premiering Meta Movie Gen: the most advanced media foundation models to-date.
Developed by AI research teams at Meta, Movie Gen delivers state-of-the-art results across a range of capabilities. We’re excited for the potential of this line of research to usher in entirely new possibilities for casual creators and creative professionals alike.
More details and examples of what Movie Gen can do ➡️ https://t.co/M19x2ndwnr
🛠️ Movie Gen models and capabilities
Movie Gen Video: 30B parameter transformer model that can generate high-quality and high-definition images and videos from a single text prompt.
Movie Gen Audio: A 13B parameter transformer model that can take a video input along with optional text prompts for controllability to generate high-fidelity audio synced to the video. It can generate ambient sound, instrumental background music and foley sound — delivering state-of-the-art results in audio quality, video-to-audio alignment and text-to-audio alignment.
Precise video editing: Using a generated or existing video and accompanying text instructions as an input it can perform localized edits such as adding, removing or replacing elements — or global changes like background or style changes.
Personalized videos: Using an image of a person and a text prompt, the model can generate a video with state-of-the-art results on character preservation and natural movement in video.
We’re continuing to work closely with creative professionals from across the field to integrate their feedback as we work towards a potential release. We look forward to sharing more on this work and the creative possibilities it will enable in the future.
i made a repo, its very naive as i wasn't planning on releasing this when i started. This does not have the new sampler yet, but i will add it once its stable. It has both the jax and pytorch implementations. If y'all want to make it better, submit PRs.
https://t.co/1sc6fFWgf1
A student reached out asking for advice on research directions in optimization, so I wrote a long response with pointers to interesting papers. I thought it'd be worth sharing it here too:
1. Adaptive optimization.
There has been a lot going on in the last year, below are some papers I personally found interesting.
First of all, this paper by Li and Lan on Nesterov's acceleration of adaptive gradient descent:
https://t.co/D6hykeK2tw
Check Corollary 1 for a simple description of their method. There is one thing I don't like about it: the amount by which we can increase the stepsize at each iteration decreases as t grows. That being said, I don't know if this restriction can be lifted, and perhaps it's the best thing we can get.
Yura Malitsky and I also did some work on adaptive gradient descent, making the stepsizes a bit larger, roughly sqrt(2) improvement over our previous result:
https://t.co/exhFgbjChk
We still don't know if that's the best we can do or if a tighter analysis can give us better methods.
I should also mention that there is more push in the literature on Polyak stepsize, see for instance these two papers:
https://t.co/8tKRReEpx2 (a stepsize very similar to Polyak)
https://t.co/kZhWGqI1sE (Polyak stepsize with momentum)
2. Adagrad-like methods still can be studied, I believe it's an underexplored direction. I wish there was more papers on studying the importance of coordinate-wise stepsizes. One paper on the topic I really liked is this study of when Adam is more useful than SGD:
https://t.co/sF5Abi08h5
There is also some research on new practical methods, for instance, acceleration of DoG is interesting:
https://t.co/VMOdfbL95Z
And I also enjoyed reading this paper by Rodomanov et al. on line-search-inspired stochastic methods:
https://t.co/85uLHGZErQ
3. I also like the direction of getting better assumptions for optimization theory and studying the implications. A good example is the gradient clipping literature:
https://t.co/NMTXzFJScs ((L₀, L₁)-smoothness)
https://t.co/dG8xIoTFPN (same revisited)
https://t.co/goKclD80WG (on heavy-tailed noise)
We need to bridge optimization assumptions with what we know about neural networks, so read about properties of neural networks themselves like this:
https://t.co/6M8l1avBOJ (on scales of layers and how their type affects Lipschitz constants)
4. These days, people are using deep networks of all scales for their tasks, and they have discovered a lot of tricks that haven't been studied thoroughly in optimization literature: quantization, Straight-Through Estimator, (https://t.co/7UK2gsojhm), low-rank techniques such as LoRA, learning-rate warm-up, etc. You should expose yourself to those tricks to get a better understanding of what the current theory is lacking.
If you're considering choosing optimization as the topic for your PhD, here are some extra thoughts. Right now there is less activity than about 5 years ago, most low-hanging fruits seem to have been taken, and the remaining questions seem quite challenging. So if you're looking for a field where it is easy to get publications, it might not be perfect. However, it's still a good field to produce meaningful theory. It's also important who you would work with, i.e. if you can find a good advisor, that often affects one's satisfaction to a larger degree than the topic itself, so make your decision carefully.
As my last word of advice, I definitely encourage testing new methods on neural networks (and preferably not on CIFAR10/CIFAR100, because they give misleading results), at least something like nanoGPT (https://t.co/NTk9KAAqd4). When I was a PhD student, I did a lot of theoretical research testing my methods on logistic regression and that was useful to understand the theory, but I also had the wrong impression about what works and what doesn't because of that. If you can, do both, understand the theory as much as you can, but also learn its limits and failure modes.
📣 Introducing Llama 3.2: Lightweight models for edge devices, vision models and more!
What’s new?
• Llama 3.2 1B & 3B models deliver state-of-the-art capabilities for their class for several on-device use cases — with support for @Arm, @MediaTek & @Qualcomm on day one.
• Llama 3.2 11B & 90B vision models deliver performance competitive with leading closed models — and can be used as drop-in replacements for Llama 3.1 8B & 70B.
• New Llama Guard models to support multimodal use cases and edge deployments.
• The first official distro of Llama Stack simplifies and supercharges the way developers & enterprises can build around Llama to support agentic applications and more.
Details in the full announcement ➡️ https://t.co/1bnEeLY9qf
Download Llama 3.2 models ➡️ https://t.co/DZoTQvESbG
These models are available to download now directly from Meta and @HuggingFace — and will be available across offerings from 25+ partners that are rolling out starting today, including @accenture, @awscloud, @AMD, @azure, @Databricks, @Dell, @Deloitte, @FireworksAI_HQ, @GoogleCloud, @GroqInc, @IBMwatsonx, @Infosys, @Intel, @kaggle, @NVIDIA, @OracleCloud, @PwC, @scale_AI, @snowflakeDB, @togethercompute and more.
With Llama 3.2 we’re making it possible to run Llama in even more places, with even more flexible capabilities. We’ve said it before and we’ll say it again: open source AI is how we ensure that these innovations reflect the global community they’re built for and benefit everyone. We’re continuing our drive to make open source the standard with Llama 3.2.