There are only two honest metrics when it comes to benchmarking intelligence: novelty and efficiency.
You don't need intelligence to solve a known problem (only memory). And you don't need intelligence to solve a problem via brute force. But to solve a novel problem efficiently, intelligence is the only way.
Product launches usually paint an ambitious vision of what they want to achieve.
The launch of Vibes paints the vision of people (and kids!) glued to their phones, scrolling thru AI slop (infused with ads eventually, obviously)
What a terrible future.
I hope it never happens
Yes.
Writing is not a second thing that happens after thinking. The act of writing is an act of thinking. Writing *is* thinking.
Students, academics, and anyone else who outsources their writing to LLMs will find their screens full of words and their minds emptied of thought.
We just added @OpenAI's powerful new o3 and o4-mini agents to this graph. The results are striking.
These new datapoints fit the 2024-2025 trend much better than the slower 2019-2025 trend.
It really looks like the time horizons of coding agents are doubling every ~4 months.
I wish more AI lab leaders would spell out a vision for the world, one that is clear about what they think life will actually be like for humans living in a world of AGI
Faster science & productivity, good - but what is the experience of a day in the life in the world they want?
Downloaded this app last night. And was not prepared for how jaw-dropping it is. You genuinely feel like you're flying. Anywhere in the world. And if you go low enough, you can view immersive Street View. AR/VR apps done well are like downloading superpowers.
I just downloaded FLY – Explore the Earth on Vision Pro, and I can’t stop using it!
You’re in a little aircraft, soaring over any location. Simply lean in the direction you want to go. It’s immersive, even with Google Maps graphics. Highly recommend!
It's totally insane that humans have figured out physics and chemistry to the point where you can do a bunch of math, then launch a giant hunk of metal and fuel to space, then have it come back and land (...get caught!) exactly where your math said it would.
That, plus the hardware / software / real-time algorithm systems to perfectly nestle up to the tower, autonomously.
Just an astonishing level of precision. Incredible work, @SpaceX.
The 3D artists at the weather channel deserve a raise for this insane visual
Now watch this, and then realize forecasts are now predicting up to 15 ft of storm surge in certain areas on the western coast of Florida
With Web + Work + Pages, you can now ideate with AI and collaborate with other people.
It’s just magical.
You can learn more here: https://t.co/N0iE3Rv21I
OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter.
1. You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased.
2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS).
3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month:
- Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5.
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search.
4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What's the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn't share much.
5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards.
This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.
This app is *amazing.* Just being able to teleport to any location with Street View is enough, but having all your photos layered on top is so emotional. Exciting to see more mixed reality apps getting built!
this is officially one of my favorite new apps: Sceno on Apple Vision Pro!
i couldn’t stop smiling while reliving memories exactly where they happened.
you can browse past photos in immersive panoramas and explore new locations. i highly recommend it!