Have you ever asked your model to grade itself? 🤖✅
In my new post you’ll learn:
• The 4-layer QA stack (safety filter → string checks → LLM judge → human audit)
• When a “second LLM” boosts quality (chatbots, RAG, tool-using agents) https://t.co/2yU5pNUJqJ
Throwing on a small $BTC short here 🤏. Know the odds: ETFs keep gulping coins, dominance still leaning north, and a clean close > 112 K torches the trade. Tight stop, need ETF flows to flip red & alts to pop fast. Risk on, eyes wide open #Bitcoin#Bitcoin2025
Throwing on a small $BTC short here 🤏. Know the odds: ETFs keep gulping coins, dominance still leaning north, and a clean close > 112 K torches the trade. Tight stop, need ETF flows to flip red & alts to pop fast. Risk on, eyes wide open #Bitcoin#Bitcoin2025
It’s more than just training models.
In my latest post, I walk you through how to take an ML idea all the way to production — step by step, with real examples on Azure and AWS.
If you’re ready to truly own the pipeline, this one’s for you.
https://t.co/PaG6W0yPDP
From the GPT-4.5 System Card:
"GPT-4.5 is not a frontier model, but it is OpenAI's largest LLM, improving on GPT-4's computational efficiency by more than 10x."
It offers:
— increased world knowledge
— improved writing ability
— refined personality
2-7% lift on 4o at SWE-Bench
I got early access to ChatGPT Operator.
It's OpenAI's new AI agent that autonomously takes action across the web on your behalf.
The 9 most impressive use cases I’ve tried (videos sped up):
1. Ordering dinner ingredients based on a picture and a recipe
o3 is really special and everyone will need to update their intuition about what AI can/cannot do.
while these are still early days, this system shows a genuine increase in intelligence, canaried by ARC-AGI
semiprivate v1 scores:
* GPT-2 (2019): 0%
* GPT-3 (2020): 0%
* GPT-4 (2023): 2%
* GPT-4o (2024): 5%
* o1-preview (2024): 21%
* o1 high (2024): 32%
* o1 Pro (2024): ~50%
* o3 tuned low (2024): 76%
* o3 tuned high (2024): 87%
given i put in the original $1M @arcprize, i'd like to re-affirm my previous commitment. we will keep running the grand prize competition until an efficient 85% solution is open sourced.
but our ambitions are greater! ARC Prize found its mission this year -- to be an enduring north star towards AGI.
the ARC benchmark design principle is to be easy for humans, hard for AI and so long as there remain things in that category, there is more work to do for AGI.
there are >100 tasks from the v1 family unsolved by o3 even on the high compute config which is very curious.
successors to o3 will need to reckon with efficiency. i expect this to become a major focus for the field. for context, o3 high used 172x more compute than o3 low which itself used 100-1000x more compute than the grand prize competition target.
we also started work on v2 in earnest this summer (v2 is in the same grid domain as v1) and will launch it alongside ARC Prize 2025. early testing is promising even against o3 high compute. but the goal for v2 is not to make an adversarial benchmark, rather be interesting and high signal towards AGI.
we also want AGI benchmarks that can endure many years. i do not expect v2 will. and so we've also starting turning attention to v3 which will be very different. im excited to work with OpenAI and other labs on designing v3.
given it's almost the end of the year, im in the mood for reflection.
as anyone who has spent time with the ARC dataset can tell you, there is something special about it. and even moreso about a system than can fully beat it. we are seeing glimpses of that system with the o-series.
i mean it when i say these are early days. i believe o3 is the alexnet moment for program synthesis. we now have concrete evidence that deep-learning guided program search works.
we are staring up another mountain that, from my vantage point, looks equally tall and important as deep learning for AGI.
many things have surprised me this year, including o3. but the biggest surprise has been the increasing response to ARC Prize.
i've been surveying AI researchers about ARC for years. before ARC Prize launched in June, only one in ten had heard of it.
now it's objectively the spear tip benchmark, being used by spear tip labs, to demonstrate progress on the spear tip of AGI -- the most important technology in human history.
@fchollet deserves recognition for designing such an incredible benchmark.
i'm continually grateful for the opportunity to steward attention towards AGI with ARC Prize and we'll be back in 2025!