Whenever you hear whoever share opinions about whether machine has become consciousness and pretend to be scientific, please remember what Lord Kelvin said about what it takes for something to be scientific at all: “When you can measure what you are speaking about and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of the meager and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science, whatever the matter may be.”
There is a nice sequence of papers that tell this story of why. Will write more about this when I have the time, but tldr is 1) faithfulness is hard to satisfy with next-token objective, 2) high statistical regularity of code makes codegen models highly reliable. The combination of these two broad stroke ideas resulted in proliferation of methods like this (for e.g.) https://t.co/84mnoq4mf7
Static math benchmarks saturate. We built one that doesn't.
Announcing MathDuels, the first self-play math benchmark.
Every frontier LLM writes problems for the others, and is graded on the ones written for it. As models improve, so does the benchmark.
We found widespread cheating on popular agent benchmarks, affecting 28+ submissions across 9 benchmarks and thousands of agent runs.
Surprisingly, the top 3 submissions on Terminal-Bench 2 are all cheating!
Here's what we found 🧵
Friends, followers, and strangers on X,
I recently got excited about mapping the jagged frontier of AI models in math and other STEM areas. With my colleague Prof. Rob Ghrist @prof_g , I founded Rabdos AI https://t.co/7fbOZgoCz1 to create original, research-level problems at scale to advance frontier AI capabilities.
In the coming weeks, we'll be sharing exciting technical updates through @Rabdos_AI. Please give us a follow if you're interested in model evaluation, AI for math/science, autoformalization in Lean, robustness of world models, and much more.
Excited to be at NeurIPS this week presenting my recent work with @NeelayV! Find us at 4:30pm at Exhibit Hall C,D,E poster #3717!
Come by to see how LLMs struggle to use code for hard reasoning tasks, and how per-instance program synthesis (PIPS) fixes it.
(1/n)
How to start a deep learning project?
We use a remarkably streamlined step-by-step process to set up deep learning projects. At the same time, people who are new to deep learning tend to always make the same (avoidable) mistakes.
Check out the thread below! 🧵
Announcing our NeurIPS paper: Once Upon an Input: Reasoning via Per-Instance Program Synthesis (PIPS)
📝: https://t.co/u6MMoecnp0
Why do LLMs (and LLM agents) still struggle on hard reasoning problems which should be solvable by writing and executing code?
We find that the biggest problem with LLM generated “programs” for reasoning is that they don’t compute anything, they just hardcode the answer!
PIPS fixes this by 1️⃣ abstracting the input into symbols, 2️⃣ generating code that maps symbols to the answer, and 3️⃣ refining the code with structural feedback.
🧵👇
In this note w/ @beenwrekt we look at RL problems with 0/1 rewards, showing that popular methods maximize the average (transformed) probability of correctly answering a prompt x:
max_θ 𝔼ₓ h(Prob(correct ∣ x; θ))
for certain functions h. Weirdly, h is arcsin(√t) in GRPO.
Finally had a chance to listen through this pod with Sutton, which was interesting and amusing.
As background, Sutton's "The Bitter Lesson" has become a bit of biblical text in frontier LLM circles. Researchers routinely talk about and ask whether this or that approach or idea is sufficiently "bitter lesson pilled" (meaning arranged so that it benefits from added computation for free) as a proxy for whether it's going to work or worth even pursuing. The underlying assumption being that LLMs are of course highly "bitter lesson pilled" indeed, just look at LLM scaling laws where if you put compute on the x-axis, number go up and to the right. So it's amusing to see that Sutton, the author of the post, is not so sure that LLMs are "bitter lesson pilled" at all. They are trained on giant datasets of fundamentally human data, which is both 1) human generated and 2) finite. What do you do when you run out? How do you prevent a human bias? So there you have it, bitter lesson pilled LLM researchers taken down by the author of the bitter lesson - rough!
In some sense, Dwarkesh (who represents the LLM researchers viewpoint in the pod) and Sutton are slightly speaking past each other because Sutton has a very different architecture in mind and LLMs break a lot of its principles. He calls himself a "classicist" and evokes the original concept of Alan Turing of building a "child machine" - a system capable of learning through experience by dynamically interacting with the world. There's no giant pretraining stage of imitating internet webpages. There's also no supervised finetuning, which he points out is absent in the animal kingdom (it's a subtle point but Sutton is right in the strong sense: animals may of course observe demonstrations, but their actions are not directly forced/"teleoperated" by other animals). Another important note he makes is that even if you just treat pretraining as an initialization of a prior before you finetune with reinforcement learning, Sutton sees the approach as tainted with human bias and fundamentally off course, a bit like when AlphaZero (which has never seen human games of Go) beats AlphaGo (which initializes from them). In Sutton's world view, all there is is an interaction with a world via reinforcement learning, where the reward functions are partially environment specific, but also intrinsically motivated, e.g. "fun", "curiosity", and related to the quality of the prediction in your world model. And the agent is always learning at test time by default, it's not trained once and then deployed thereafter. Overall, Sutton is a lot more interested in what we have common with the animal kingdom instead of what differentiates us. "If we understood a squirrel, we'd be almost done".
As for my take...
First, I should say that I think Sutton was a great guest for the pod and I like that the AI field maintains entropy of thought and that not everyone is exploiting the next local iteration LLMs. AI has gone through too many discrete transitions of the dominant approach to lose that. And I also think that his criticism of LLMs as not bitter lesson pilled is not inadequate. Frontier LLMs are now highly complex artifacts with a lot of humanness involved at all the stages - the foundation (the pretraining data) is all human text, the finetuning data is human and curated, the reinforcement learning environment mixture is tuned by human engineers. We do not in fact have an actual, single, clean, actually bitter lesson pilled, "turn the crank" algorithm that you could unleash upon the world and see it learn automatically from experience alone.
Does such an algorithm even exist? Finding it would of course be a huge AI breakthrough. Two "example proofs" are commonly offered to argue that such a thing is possible. The first example is the success of AlphaZero learning to play Go completely from scratch with no human supervision whatsoever. But the game of Go is clearly such a simple, closed, environment that it's difficult to see the analogous formulation in the messiness of reality. I love Go, but algorithmically and categorically, it is essentially a harder version of tic tac toe. The second example is that of animals, like squirrels. And here, personally, I am also quite hesitant whether it's appropriate because animals arise by a very different computational process and via different constraints than what we have practically available to us in the industry. Animal brains are nowhere near the blank slate they appear to be at birth. First, a lot of what is commonly attributed to "learning" is imo a lot more "maturation". And second, even that which clearly is "learning" and not maturation is a lot more "finetuning" on top of something clearly powerful and preexisting. Example. A baby zebra is born and within a few dozen minutes it can run around the savannah and follow its mother. This is a highly complex sensory-motor task and there is no way in my mind that this is achieved from scratch, tabula rasa. The brains of animals and the billions of parameters within have a powerful initialization encoded in the ATCGs of their DNA, trained via the "outer loop" optimization in the course of evolution. If the baby zebra spasmed its muscles around at random as a reinforcement learning policy would have you do at initialization, it wouldn't get very far at all. Similarly, our AIs now also have neural networks with billions of parameters. These parameters need their own rich, high information density supervision signal. We are not going to re-run evolution. But we do have mountains of internet documents. Yes it is basically supervised learning that is ~absent in the animal kingdom. But it is a way to practically gather enough soft constraints over billions of parameters, to try to get to a point where you're not starting from scratch. TLDR: Pretraining is our crappy evolution. It is one candidate solution to the cold start problem, to be followed later by finetuning on tasks that look more correct, e.g. within the reinforcement learning framework, as state of the art frontier LLM labs now do pervasively.
I still think it is worth to be inspired by animals. I think there are multiple powerful ideas that LLM agents are algorithmically missing that can still be adapted from animal intelligence. And I still think the bitter lesson is correct, but I see it more as something platonic to pursue, not necessarily to reach, in our real world and practically speaking. And I say both of these with double digit percent uncertainty and cheer the work of those who disagree, especially those a lot more ambitious bitter lesson wise.
So that brings us to where we are. Stated plainly, today's frontier LLM research is not about building animals. It is about summoning ghosts. You can think of ghosts as a fundamentally different kind of point in the space of possible intelligences. They are muddled by humanity. Thoroughly engineered by it. They are these imperfect replicas, a kind of statistical distillation of humanity's documents with some sprinkle on top. They are not platonically bitter lesson pilled, but they are perhaps "practically" bitter lesson pilled, at least compared to a lot of what came before. It seems possibly to me that over time, we can further finetune our ghosts more and more in the direction of animals; That it's not so much a fundamental incompatibility but a matter of initialization in the intelligence space. But it's also quite possible that they diverge even further and end up permanently different, un-animal-like, but still incredibly helpful and properly world-altering. It's possible that ghosts:animals :: planes:birds.
Anyway, in summary, overall and actionably, I think this pod is solid "real talk" from Sutton to the frontier LLM researchers, who might be gear shifted a little too much in the exploit mode. Probably we are still not sufficiently bitter lesson pilled and there is a very good chance of more powerful ideas and paradigms, other than exhaustive benchbuilding and benchmaxxing. And animals might be a good source of inspiration. Intrinsic motivation, fun, curiosity, empowerment, multi-agent self-play, culture. Use your imagination.
How do we navigate a growing collection of post-trained LLMs?
In Delta Activations: A Representation for Finetuned LLMs, we propose a compact embedding that encodes the post-training signal.
Try the interactive model navigator 👉 https://t.co/EhxzGihVu5
Our latest book on the mathematical principles of deep learning and intelligence has been released publicly at: https://t.co/ihPBCkI3x5 It also comes with a customized Chatbot that helps readers study and a Chinese version translated mainly by AI. This is an open-source project.
Congratulations to Dr. Ziyang Li (@_ziyang_) on defending his dissertation today! Titled "Neurosymbolic Programming in Scallop: Design, Implementation, and Applications", this dissertation proposed Scallop, a unified programming system for combining the otherwise complementary worlds of deep learning and symbolic reasoning.
Ziyang's work demonstrated that neurosymbolic solutions achieve superior accuracy, interpretability, and data efficiency compared to their purely neural or symbolic counterparts, in applications as diverse as natural language reasoning, image and video scene graph generation, program vulnerability detection, and RNA folding.
Ziyang will join the computer science department at Johns Hopkins University as a tenure-track assistant professor in the summer of 2025. He will be an amazing collaborator, mentor, and educator to those lucky enough to cross paths with him. I certainly was and I will miss him dearly!
My deepest gratitude to thesis committee members Rajeev Alur, Chris Callison-Burch, Armando Solar-Lezama, and Val Tannen, as well as to Ziyang's many collaborators over many years.
You can learn more about Scallop from Ziyang's dissertation at https://t.co/gP3hjg1NOR and try it out using the tutorials at https://t.co/P2wdJ6StMj.
🧠 Foundation models are reshaping reasoning. Do we still need specialized neuro-symbolic (NeSy) training, or can clever prompting now suffice?
Our new position paper argues the road to generalizable NeSy should be paved with foundation models. 🔗 https://t.co/6DFvhYs11m
(🧵1/9)
Happy New Year! The lectures of all 11 guest speakers in my LLM course from Fall 2024 are now available.
Thanks to all the speakers:
- Hanjun Dai (@hanjundai), Thang Luong (@lmthang), and Denny Zhou (@denny_zhou) from Google DeepMind;
- Jason Wei (@_jasonwei), Hyung Won Chung (@hwchung27), and Yann Dubois (@yanndubs) from OpenAI;
- Kai Sheng Tai (@kaishengtai) and Aakanksha Chowdhery (@achowdhery) from Meta;
- Ram Sriharsha from Pinecone;
- Vijay Krishnan (@krishnanvijay) from Turing; and
- Guangxuan Xiao (@Guangxuan_Xiao) from MIT.
Links to the talk videos are in follow-up comments.
Yes, I've made this point many times.
The beginning of a sigmoid looks like an exponential.
Not only can we "never be fully certain that what we are observing isn't in fact following a logistic trend before the inflection point", we can always be fully certain that *every* *single* *exponential* *trend* eventually passes an inflection point and saturates into a sigmoid.
Continuing an exponential trend beyond that inflection point requires a paradigm shift.
No physical process can grow indefinitely.
There are always friction terms in the dynamics equation that eventually become dominant (energy consumption, heat dissipation, quantum effects, thermal fluctuations, communication bandwidth, mass/energy density....).
Even processes that *appear* exponential on a long time scale are actually a succession of sigmoids, in which each new sigmoid is caused by a paradigm shift.
A good example is Moore's Law. It is saturating right now. But the exponential progress of the last 7 decades is due to a succession of technological paradigm shifts that weren't pre-ordained.
Each paradigm behaved like a sigmoid. Each new sigmoid overtook the previous one. The envelope turned out to be exponential.
We haven't seen similar paradigm shifts in, say, airplane speed or space travel.
Technological paradigm shifts require scientific breakthroughs.