In Sept 2024, o1 surprised many purists who thought inference-time scaling for LLMs was through MCTS.
What if a connection exists, just implicit? What does it imply?
New post: "Squint enough and RLing CoT reasoners is approximable as Monte Carlo Tree Search policy learning." 🧵
7/ Looking beyond this paper: scaling compute against a fixed, limited pool of data will need new primitives. Searching over a population of models is a different problem than standard gradient descent training and we've barely scratched the surface. We hope q0 pushes people toward crazy ideas in multi-epoch training and scaling compute in general!!
1/ Now that we're running out of data, how do you optimally scale multi-epoch pretraining to hundreds of epochs?
Our first paper from Q! q0 trains a population of models, instead of single model that saturates fast, reaching a dramatically lower loss at *every* epoch budget.
w/ @bishmdl76@akshayvegesna@ShmuelBerman
@soldni regularization is BACK i suppose. dropout 0.15 is quite large and i don't think anyone else uses dropout in the big 26. also rather high std for init these days but you can't go wrong with a good old 0.02. also why depth scale output proj when you have sandwich norm??
This paper empirically ~verifies the section of my first Zipfian grokking blog post where I hypothesize about how capacity competition dynamics extrapolate from the grokking to language pretraining case
Cool work from the authors! :)
q: "why don't Sora-like models learn compositional physics understanding or do ICL like how language models learn compositional semantics?"
a: every attempt to date heavily leaks information from the future. some even bake it into the bottleneck design without realizing (!!!)
Rule changes for the SpaceX $SPCX IPO:
Index providers waived the profitability requirement and cut the seasoning window from 90 days to 5.
This forces over $30 trillion in passive 401k and retirement money to buy SpaceX at IPO valuations.
Bloomberg Intelligence estimates S&P 500 funds must absorb 19% of SpaceX's float within 6 months.
Russell 1000 and Nasdaq 100 funds will absorb 24%.
The rules built to protect passive investors:
1. S&P 500 has required 12 months of trading and 4 quarters of GAAP profitability since 2002. Both waived.
2. Nasdaq cut its inclusion window from 90 trading days to 15.
3. FTSE Russell cut its to 5.
All three benchmarks are now structured to buy SpaceX at IPO pricing.
The following animation convey the intuition: when a 1-neuron model tries to learn two tasks, the frequent task updates suppress the infrequent task updates. The 2-neuron model can dedicate a neuron to the infrequent task once the frequent one is fully learned.
a quick way to force oneself into thinking about a thing is maintaining a list of words about that thing and just staring at it
something something required circuits activate from high cosine similarity
I was thinking about it again recently, Google Allo was really ahead on the idea of chatting with Google Assistant or @'ing in conversations to build out this Agent/AI UX we have now
my favorite interp researcher can identify neurons responsible for any behavior and provide steering vectors for them
her name is backprop and her steering vectors are just gradients
@mschoening and I are starting a podcast where we nerd out about human-AI collaboration and malleable software.
In this episode: is HTML actually better than Markdown? and an alternative to Software Factories...
Watch on YT: https://t.co/O2DwUTWm4o
Artificial intelligences do not undergo experiences, do not possess a body, do not feel joy or pain, do not mature through relationships, and do not know from within what love, work, friendship or responsibility mean. Nor do they have a moral conscience, since they do not judge good and evil, grasp the ultimate meaning of situations, or bear responsibility for consequences. They may imitate or even simulate, but they do not understand what they produce, for they lack the affective, relational, and spiritual perspective through which human beings grow in wisdom. #MagnificaHumanitas
Yes, AIs are going to do all or almost all of the pure theory, but tbh humans probably finished most of the pure theory that it's possible for humans to do by the end of the 20th century. Yes there has been some recent theory progress but let's be honest, most is of marginal economic value at best. There's probably lots of useful pure theory left to do in this universe, but it's probably not the kind of stuff that can be intuited by a single human, explained to a grad student, and written down in a textbook. AI will do all that stuff.
1. people undestimate how hard this problem is
2. universal issue. IGCSE billed ~Rs 40k for exams - still many papers leaked
3. change is much harder than running things as is. migration to OSM requires competence++++
4. with privatization, public sector => competence----