I disproved a 50-year old math conjecture with AI.
With GPT-5.5 Pro I found counterexamples to Beneš Conjecture as well as related Shuffle-Exchange Conjecture in network theory.
I've joined @OpenAI to work on Codex
@ajambrosino and team have built a very good app! It's the first coding agent GUI that got me out of the terminal
Excited to help make it even better, especially as it goes beyond software engineers
Also delighted to get to work with old friends @gpeal8@tarstarr again
Introducing web-scale /monitor.
An always-on search that watches the web & pings you or your agent the moment something comes online.
Before /monitor only worked for single pages or websites, but now you can use that power on the entire web.
Available today.
You can embed this model practically anywhere - like a chrome extension
This is transformersjs + rampart for real-time PII removal in your browser (personal email still blurred)
You can see I'm toggling fields - the model is removing them from text almost instantly: zip, state, city.
Since the model is local, data never leaves my device.
Rampart doesn't hide API keys, but regex can handle those
Aside from obvious enterprise implications, Rampart will be huge for privacy-forward individuals
I'm working on getting this live on the Chrome Web Store. The smol model era is here - next I want to see a native mac app 👀
Thanks to @xenovacom & @huggingface for the amazing libraries and @ndstudio for the awesome model
We're opening the waitlist for our Monetization Gateway, which will allow you to charge for any web page, dataset, API, or MCP tool behind Cloudflare. The charges will settle in stablecoins over the x402 open protocol. https://t.co/pvICtEIixj
Under President Trump’s leadership the United States is the undisputed winner in the AI race.
My gratitude to companies across industries who continue to work closely with the White House to implement the President’s EO: “Promoting Advanced AI Innovation and Security.” This includes excellent work around advanced model access and guardrail testing and security. The government and private sector have worked together in a way we have never seen before and this foundation of America First is unprecedented.
Our shared priority remains: get the best tech deployed as quickly and safely as possible.
Finally releasing Fable 5 and Mythos 5 for use. Should be available tomorrow!
“Anthropic has taken steps in close coordination with the U.S. government to address the risks associated with Claude Mythos 5 and Claude Fable 5. Among other things, Anthropic has agreed to proactively detect and address security risks associated with the models; to work diligently with the U.S. government on protocols and standards and releases for Mythos, Fable, and future models; and to inform the U.S. government of any malicious activity.”
“In light of these actions and commitments, as well as the Bureau of Industry and Security's evaluation of the diversion risks now presented by Claude Mythos 5 and Claude Fable 5, the controls in the June 12 letter are withdrawn.”
noticed that this bench was run with safeguards turned off. on a production version it scores zero. same situation as a third-party eval finding that fable rejects 99.5% of prompts while anthropic posts self-reported scores achieved on a non-production config. this is unacceptable, and we should reject such deceitful evals.
After 18 months of writing, coding, and experimenting, Build a Reasoning Model (From Scratch) is
finally out!
My first copies just arrived! 📚
440 full-color pages. Inference scaling, reinforcement learning, and distillation from scratch.
what’s a little funny about the “GPT weak on frontend” discourse is that everything we ship in the codex app gets adopted by the entire industry within days or weeks, pixel for pixel
We’re introducing GeneBench-Pro, a research-level benchmark for a harder kind of AI progress: how well agents can navigate messy biological data, choose the right analysis path, and make judgment calls that real computational research depends on.
https://t.co/AsilnnSxnE
⚙️ We debugged a year’s worth of crashes in our data infrastructure and found one issue in the hardware and another that has been unnoticed in open-source code for 18 years.
Here’s how we tracked them down:
https://t.co/5c13Knw69o
Excited to share that my paper "Size Doesn't Matter: Cosine-Scored Sparse Autoencoders" got accepted as a spotlight at ICML!
We propose cosine sparse auto-encoders (SAEs) which have
- 14.6% better top 1 sparse probing accuracy
- discover ~3x more features
- matched FVE and interpretability
- minimal recipe change
SAEs detect features via inner product, so a feature's activation scales with both its directional alignment and the input's norm
But sublayer normalization discards magnitude entirely, which means the encoder detects a quantity the model does not read!
A learned scalar parameter is free to recover inner product scoring but doesn't, showing that 74% of magnitude is noise
Github and paper below 👇
Codex usage limits will be fully reset again in the next hour and we will credit one additional reset into your bank for your own usage over the next 24 hours.
We investigated reports that Codex usage was being consumed faster than expected. There wasn't one central issue, but a few smaller problems compounded for some users.
Here's what we found and changed:
- Actual usage: Auto-review had become more proactive, another change was triggering more subagent work, and background suggestions could run twice or retry too frequently after failures. We reverted the changes and fixed suggestion scheduling, duplicate generation, and retry behavior. This should reduce unnecessary background token consumption while preserving the work users explicitly request.
- Usage reporting: Auto-review was incorrectly appearing as GPT‑5.4 usage, and failed or rate-limited requests were still shown as turns. Auto-review now appears as its own category, and only successful requests count toward the turn graphs. Rate-limited requests were never charged, but they were being displayed incorrectly.
- Immediate relief: We reset usage limits while rolling out the fixes, then shipped hotfixes across the CLI, desktop app, and usage backend.
- What to expect: New usage data should be clearer and actual consumption should be lower. Historical charts may still show auto-review under GPT‑5.4 because older turn data was not relabeled. Features that intentionally perform more work; such as /goal, subagents, and higher reasoning levels will still naturally use more capacity.
All fixes are now deployed, and we've added more detailed monitoring so we can detect background-usage regressions sooner. We'll continue watching the results closely.
Thank you for building and doing all sorts of things with Codex.