To build machines that can program themselves to do any task, you need to measure how well they learn to do things they were never programmed to do.
Today we release a new benchmark that's the first attempt to do this for any real-world task.
interesting how relativity means travelling to larger and larger scales of the universe will require mastering tinier and tinier scales of autonomy/self-replication
i would argue that the nature of the arc challenge is that incremental improvements means you are doing it wrong. a true solution should move from not working to near 100% in one go
Collating thread of biological intelligence generalising to out-of-distribution tasks from just a few samples.
1. Pro basketball player composes models of the hoop and the windmill mentally, in order to simulate the right trajectory to throw ball at - in just two tries.
the language server protocol is severely underrated/underutilized technology
an instrument to collaborate live with a machine via shared syntax & context in realtime
a linear gain in 'intelligence' (or more accurately, benchmark performance) for an exponential increase in resources
fancy betting on a curve that has explicit diminishing returns
The correct learning regime will display the exact opposite curve
src : https://t.co/QoBViISmv4
a deep learning network : rigid, dense fully-connected, ordered, non-local updates, needs a lot of data/energy to learn anything new
a biological brain : flexible, sparse, local updates, messy, can adapt to a new task from a few samples and less than a few watts
https://t.co/9Wgz74Ktr5
per "saturate in months" prediction : LLM RL post-training follows the same log-linear laws as RL pre-training, which means similar theoretical wall (reached faster)
(in graphs : xAI 10x’d the amount of compute used on RL for only marginal perf improvement)
first principles :
> reasoning model perf is downstream of base model perf
> if base model perf plateaus, reasoning model perf is a few months from plateauing
> unless there's a way to auto-induce shorter CoTs
> unlikely - penalizing CoT length leads to memorization in practice
everyday we wakeup and pretend the world we live in is completely normal while the thing on the left routinely wraps & dissolves itself to turn into the thing on the right