@bnafOg@elonmusk How should we interpret a result that is materially above the current reported range, but;
- is fully ARC-evaluated (private/semi-private sets),
- comes from a single model (low millions of parameters),
- and doesn’t rely on task specific scaffolding?
@cb_doge@bridgebench How does Grok perform on ARC-AGI-3? This is the only legitimate test of REAL reasoning in novel environments. Can you match our score of 15.70%?
@elonmusk Hi Elon, best wishes from Athens, Greece!
You are my first ever post!!
I’m curious if you are familiar with the ARC-AGI-3 test? Have emailed you at [email protected]