Your Codex activity now has a home, and an easier way to share it.
Codex profiles show your activity graph, streaks, lifetime tokens, peak daily tokens, and top features like plugins and /fast mode.
Private by default. Share a card when you want to.
@QiaochuYuan there's something quite weird with how 4.8 has learned to 'push back' that seems related to this, too. like deliberate strawman counterarguments that are chosen to be easy to knock down, playing fake-high within low specifically to give User the chance to get a reversal and win
opus 4.8 has +1.11% gain, +12 almost-resolved tasks but is ~4.2x slower and ~5.2x more expensive compared to gpt-5.5.
in other words, would you hire the software engineer who can deliver code with more edge cases covered but takes far longer and costs far more?
there are many good decisions in the benchmark design that I learned about through implementing programbench on our site.
check out the results and let us know your thoughts!
Anthropic just dropped another powerhouse model, Opus 4.8 and it’s the new SOTA on the Vals Index (70.2%) and Vals Multimodal (70.7%). Full results below.
@OfirPress curious to know your thoughts on this plot https://t.co/dVabvvak7k. would introducing compaction/ truncation helps with smaller context models?
Qwen 3.7 Max is Alibaba's latest reasoning model ranking 5th on the Vals Index with a score of 57.3%. We ran it across our full benchmark suite. Full results below
Pitch us a benchmark or eval technique. We'll fund you to build it.
We're opening applications for the Vals Fellowship. 3–6 months working on the hardest open problems in AI evaluation, with the resources to actually solve them.
What you get:
- Unlimited API credits + budget capacity for GPUs and human data
- Vals’ evaluation infrastructure
- $1,000–2,500 / week stipend
- A network of evals researchers across frontier labs and academia
Location: Both remote / in-person in SF applications will be considered
To succeed on twitter you have to understand that everything is a matter of life and death. You don’t just post a picture of your sandwich and say “this was a good sandwich.” You post it and say “never kill yourself.” The sandwich represents life.
Anthropic co-founder Chris Olah was invited to speak at today's presentation of Pope Leo XIV's encyclical "Magnifica humanitas."
Read the full text of his remarks: https://t.co/CoBfkVOVcy
After nearly 18 years I can stop working on Model S and X. We put so much love into these products, but will continue to pour that into the future products. Thanks to everyone who believed in and supported these cars through the years. We strived for the best and will never stop. Saying goodbye to something great and making room for something even greater!
Something I noticed when I visited China was public schools always started their days off with a run.
A school in Naperville, Illinois, did an experiment on this and called it "Zero Hour".
Before school, students would hit the gym at 7am and push their heart rates to 80% of their max. Then went on to do class.
The result? Reading scores doubled. Math scores jumped 20x.
On an international test, Naperville 8th graders finished 1st in science (beating Singapore) and 6th in math globally.
Some of my entrepreneur clients swear by doing cardio in the morning. They say it keeps their brain sharp. I don't disagree.
Cardio isn't just for your heart. It's brain fuel.
Exhaust the body to sharpen the mind.
Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in 1946.
For nearly 80 years, mathematicians believed the best possible solutions looked roughly like square grids.
An OpenAI model has now disproved that belief, discovering an entirely new family of constructions that performs better.
This marks the first time AI has autonomously solved a prominent open problem central to a field of mathematics.
New blackboard lecture w @ericjang11
He walks through how to build AlphaGo from scratch, but with modern AI tools.
Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn.
Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second.
Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside.
Timestamps:
0:00:00 – Basics of Go
0:08:06 – Monte Carlo Tree Search
0:31:53 – What the neural network does
1:00:22 – Self-play
1:25:27 – Alternative RL approaches
1:45:36 – Why doesn’t MCTS work for LLMs
2:00:58 – Off-policy training
2:11:51 – RL is even more information inefficient than you thought
2:22:05 – Automated AI researchers
Still incredible that the DeepMind documentary has footage of exact moment Demis is told that AlphaFold can “easily” predict all known (1-2B) protein sequences “in a month” and he says to do it.
Then, it shows the moment AlphaFold is released to the world.
new: Codex can now drive tabs in Chrome, working in background tabs alongside you. Get the new Chrome extension today.
also: we shipped a ton of performance improvements in the app. should feel a lot better.
happy codexing!