A CNN kernel is like a neuron with shared weights. A dense neuron looks for a fixed pattern in the entire image, while a CNN kernel looks for a small pattern and scans for it everywhere in the image.
2/2
Does it think of a 0 as a circle? A 1 as a vertical line? An 8 as two vertically stacked zeroes?
Check out the full technical write up here:
https://t.co/lJaNYLm3XZ
1/n Over the past few days I performed brain surgery on a neural network.
More specifically, I wanted to reverse-engineer a neural network trained on MNIST, a handwritten digit recognition dataset.
"What is a neural network’s idea of a digit?"
We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining.
It's true that IQ gaps create challenges in communication, but the real issue is how upset you get when you have to slow down, notice the other person's thought process, and speak to that. If that bothers you then the safe assumption is you just use conversation as a way to feel validated or self-righteous.
Once you hit about a 20-point IQ gap, communication starts to completely break down.
It's not that the lower IQ person is "stupid" (although that can often be the case) or the higher one is arrogant, it's that you're literally operating on different systems.
A 20 point difference (roughly 1.3 standard deviations) means:
Vocabulary and abstraction levels diverge sharply. What feels like crystal clear logic to one side sounds like vague, pretentious word salad to the other. Jokes land flat. Metaphors get taken literally. Complex cause and effect chains get simplified into "this good, that bad."
Different time horizons and pattern recognition. One person thinks in months or years and sees systems, the other is locked into days or immediate rewards. Trying to explain second order effects feels like speaking another language.
Also, processing speed and working memory gaps. The higher IQ person is already three steps ahead, getting impatient. The lower IQ person feels talked down to or overwhelmed.
Both walk away frustrated.
Both have wasted each others time.
Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.
New blackboard lecture w @ericjang11
He walks through how to build AlphaGo from scratch, but with modern AI tools.
Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn.
Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second.
Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside.
Timestamps:
0:00:00 – Basics of Go
0:08:06 – Monte Carlo Tree Search
0:31:53 – What the neural network does
1:00:22 – Self-play
1:25:27 – Alternative RL approaches
1:45:36 – Why doesn’t MCTS work for LLMs
2:00:58 – Off-policy training
2:11:51 – RL is even more information inefficient than you thought
2:22:05 – Automated AI researchers
okay wow, im stunned.
codex just generated all this. I just gave it the audio file and it generated all the images and the video in the end with a good rhythm of the images changing
If Hantavirus mutated into a global threat, it would unleash AI + biotech unlike anything we've ever seen.
> genome sequenced and public in 4 hours
> AlphaFold maps every protein target
> AI screens 10,000 drugs in 24 hrs
> 50 vaccine candidates designed simultaneously
> AI designed antibodies in days
> risk of death computed instantly
> decentralized trials launch globally
> enroll from home
> 20 countries manufacturing at once
> first doses in three weeks
> real-time dose characterization
> your genome + biomarkers determine your protocol
> variant map updates every hour
No one would wait for governments.
@bnielson01 The theorem assumes that the same algorithm has to work across all problems, but if we remove that assumption and allow the entity to specialize their algorithm depending on the problem they can become really capable in a wide variety of domains.
Been thinking a lot about continual learning and I feel we probably have it backwards.
Most formulations care about reducing “catastrophic forgetting” on previously learned tasks when you learn new tasks, but what matters in the real world is speed of adaptation to new tasks.
It’s irrelevant if, as adults, we can solve grade 10 math exams; what matters is if we have learned good representations that are composable such that we can adapt to new tasks with minimal training. You’ve trained well if you can re-learn grade 10 math quickly as an adult, not that you can solve it out of the box.
So we should be measuring performance of AI systems on future expected distributions of tasks, not the distribution encountered in the past.