>Claims auto AI research
>Look under the hood, it’s just smart hyperparam sweeps in LLM paradigm
Sure that’s valuable, all respectable AI researchers already do so at some capacity. Real auto research is discovering new arch/algo innovations, everything else is playing around.
@YiMaTweets Academia incentives are as corrupted contributing to high noise vs signal. The pace of unreviewed and poorly conducted science is astounding. Many papers are not replicable. Too many focusing on exploitation rather than exploration
Most of JAX’s advantage is erased if you are a cracked engineer. Only reason you would use JAX is for some level of convenience (functional transforms? Out of the box compiles?) and maybe for building on TPUs. Any serious AI researcher or engineer doesn’t just accept vanilla PyTorch performance disadvantage, we profile, fuse ops, write custom kernels and optimize the fuck out of the implementation we want to ship.
Thanks for the thoughtful reply,
I just revisited their pub and it does seem to your point that their new platform now deliberately plates both excitatory and inhibitory neurons together (~80/20 E/I ratio), and in this case they’ve moved to the deeper active-inference principle where the training rule is now Structured Predictable stim (ie low surprise = positive reward), and Unstructured Random stim (ie high surprise = negative reward). They tune the stim params (frequency + pattern + amplitude + which electrodes + burst length......) so the reward signal still drives the right plasticity.
Maybe training isn’t saturated yet and more can be squeezed out of these archs for now, even so if we want to train or serve 100T+ params or equivalent, we cant rely on current paradigm, we need novel archs and training methods. Very soon we will see competitive models to current frontier that don’t need billions of dollars of compute or even trillions of tokens of data.
Addressing some comments and DMs:
What I mean is that the objective is to replicate human performance and action efficiency. You can solve ARCAGI3 tasks in many different ways, through LLM agent harnesses or brute forcing via classic RL. But these do not meet my criteria and I don't find these solutions interesting for solving the fundamental problem/gap with AIs right now. The agent (however architected or pre-trained, even on the public arc-agi-3 tasks or the train tasks in prev challenges) when evaluated on a new previously unseen task/game cannot rely on these tricks (they won't even work that well if the ARCAGI3 held out tasks are TRULY unique).
The thing about arc is that you get a human comparable for difficult tasks in a contained env. Which also points to the missing capabilities we need like online learning, reasoning.... What matters is how you frame the problem. I've observed a few ppl playing the games (ranges of "pretraining") and the benchmark is honestly great at making the challenge (of replicating these cap) achieveable/digestible (IF you approach it in the right way)
Knowing the interface shouldn't matter, something held constant. Not knowing the rules and having these rules be difficult, unique, requiring exploration+reasoning+online learning is high enough complexity to get at what matters.
Classic RL models will likely fail on unseen tasks if the games are different enough (which is the premise of this benchmark, if that assumption fails then whole benchmark fails). It's like taking a policy trained on a bunch of atari games and testing on a totally new game/same interface (MANY papers during RL golden era explored this), it may get lucky but will likely not perform well or at human level.
if your ARC-AGI 3 solution is to just classic RL on the games themselves (i.e. tufa attempt), or if you're hardcoding anything (i.e. agentica) then you clearly missed the point of the benchmark. The score doesnt matter if the process is not true to the qualities we want from the next paradigm. A human doesnt need prior instruction or that many attempts to solve each game.
@leothecurious they both address shortcomings in current models (from diff perspectives), which imo points to the same missing fundamental quality. but yea ARC is cleaner/more focused