every job will turn into explaining your intentions to ai
explaining what you want to ai is surpringly time consuming, coders already spend 80% of their time doing it, and this will be true for everyone
OpenAI frontier models and Codex are now generally available on AWS, giving enterprises a new way to build on Amazon Bedrock with OpenAI through the security, compliance, and governance workflows they already use.
This is also the beginning of a broader expansion of OpenAI capabilities on AWS, including future availability for cybersecurity capabilities like Daybreak.
https://t.co/vMws0YU6Q3
GPT-5.4 Pro just solved Erdős Problem #1196, a 60-year-old conjecture from Erdős, Sárközy, and Szemerédi on primitive sets.
One shot. ~80 minutes of reasoning.
What makes this different from other AI math results (from what I understand): The world's top expert on this problem, Jared Lichtman (who proved the original Erdős Primitive Set Conjecture during his PhD), worked on #1196 for 7 years alongside Fields Medal-level collaborators; this was not attention-starved.
The AI rejected the standard approach every mathematician had used since Erdős' 1935 paper, a switch from analysis to probability, and instead stayed purely analytic using von Mangoldt weights. Human aesthetic convention had made that path invisible.
Lichtman's analogy: like AI discovering a new chess opening that grandmasters overlooked because of convention.
Terry Tao suspects the trick could simplify the broader theory of prime factorization anatomy, not just solve one conjecture.
Lichtman calls it possibly the first AI "Book proof" for an Erdős problem.
GPT-5.4 pro is the math goat.
So true. Many non-technical people I know have tried AI (usually ChatGPT) 1 - 1.5 years ago, it didn't went really well, and since then they dismiss it (with the occasional not-so-important query about something). And the true power of the current crop of models is unleashed in certain domains only.
Judging by my tl there is a growing gap in understanding of AI capability.
The first issue I think is around recency and tier of use. I think a lot of people tried the free tier of ChatGPT somewhere last year and allowed it to inform their views on AI a little too much. This is a group of reactions laughing at various quirks of the models, hallucinations, etc. Yes I also saw the viral videos of OpenAI's Advanced Voice mode fumbling simple queries like "should I drive or walk to the carwash". The thing is that these free and old/deprecated models don't reflect the capability in the latest round of state of the art agentic models of this year, especially OpenAI Codex and Claude Code.
But that brings me to the second issue. Even if people paid $200/month to use the state of the art models, a lot of the capabilities are relatively "peaky" in highly technical areas. Typical queries around search, writing, advice, etc. are *not* the domain that has made the most noticeable and dramatic strides in capability. Partly, this is due to the technical details of reinforcement learning and its use of verifiable rewards. But partly, it's also because these use cases are not sufficiently prioritized by the companies in their hillclimbing because they don't lead to as much $$$ value. The goldmines are elsewhere, and the focus comes along.
So that brings me to the second group of people, who *both* 1) pay for and use the state of the art frontier agentic models (OpenAI Codex / Claude Code) and 2) do so professionally in technical domains like programming, math and research. This group of people is subject to the highest amount of "AI Psychosis" because the recent improvements in these domains as of this year have been nothing short of staggering. When you hand a computer terminal to one of these models, you can now watch them melt programming problems that you'd normally expect to take days/weeks of work. It's this second group of people that assigns a much greater gravity to the capabilities, their slope, and various cyber-related repercussions.
TLDR the people in these two groups are speaking past each other. It really is simultaneously the case that OpenAI's free and I think slightly orphaned (?) "Advanced Voice Mode" will fumble the dumbest questions in your Instagram's reels and *at the same time*, OpenAI's highest-tier and paid Codex model will go off for 1 hour to coherently restructure an entire code base, or find and exploit vulnerabilities in computer systems. This part really works and has made dramatic strides because 2 properties: 1) these domains offer explicit reward functions that are verifiable meaning they are easily amenable to reinforcement learning training (e.g. unit tests passed yes or no, in contrast to writing, which is much harder to explicitly judge), but also 2) they are a lot more valuable in b2b settings, meaning that the biggest fraction of the team is focused on improving them. So here we are.
Working on a new project and for one day I did more than normally possible for 3 weeks.
Using agents for coding literally feels like unlocking God mode.
🚨 CRITICAL: Active supply chain attack on axios -- one of npm's most depended-on packages.
The latest [email protected] now pulls in [email protected], a package that did not exist before today. This is a live compromise.
This is textbook supply chain installer malware. axios has 100M+ weekly downloads. Every npm install pulling the latest version is potentially compromised right now.
Socket AI analysis confirms this is malware. plain-crypto-js is an obfuscated dropper/loader that:
• Deobfuscates embedded payloads and operational strings at runtime
• Dynamically loads fs, os, and execSync to evade static analysis
• Executes decoded shell commands
• Stages and copies payload files into OS temp and Windows ProgramData directories
• Deletes and renames artifacts post-execution to destroy forensic evidence
If you use axios, pin your version immediately and audit your lockfiles. Do not upgrade.
Three weeks ago there were rumors that one of the labs had completed its largest ever successful training run, and that the model that emerged from it performed far above both internal expectations and what people assumed the scaling laws would predict. At the time these were only rumors, and no lab was attached to them. But in light of what we now know about Mythos, they look more credible, and the lab was probably Anthropic.
Around the same time there were also rumors that one of the frontier labs had made an architectural breakthrough. If you are in enough group chats, you hear claims like this constantly, and most turn out to be nothing. But if Anthropic found that training above a certain scale, or in a certain way at that scale, produces capabilities that sit far above the prior trendline, then that is an architectural breakthrough.
I think the leaked blog post was real, but still a draft. Mythos and Capybara were both candidate names for the new tier, though Mythos may now have enough mindshare that they end up keeping it. The specific rumor in early March was that the run produced a model roughly twice as performant as expected. That remains unconfirmed. What is confirmed is that Anthropic told Fortune the new model is a 'step change,' a sudden 2x would certainly fit the definition.
We will find out in April how much of this is true. My own view is that the broad shape of this is correct even if some of the numbers are wrong. And if it is substantially accurate, then it also casts OpenAI's recent restructuring in a new light. If very large training runs are about to become essential to staying in the game, then a lot of their recent decisions, like dropping Sora, make even more sense strategically.
For the public, this would mean the best models in the world are about to become much more expensive to serve, and therefore much more expensive to use. That will put pressure on rate limits, pricing, and subscription plans that are already subsidized to some unknown degree. Instead of becoming too cheap to meter, frontier intelligence may be about to become too expensive for most of humanity to afford.
Second-order effects; compute, memory, and energy are about to become much more important than they already are. In the blog they describe the new model as not just an improvement, but having 'dramatically higher scores' than Opus 4.6 in coding and reasoning, and as being 'far ahead' of any other current models. If this is the new reality, then scale is about to become king in a whole new way. It would also mean, as usual, that Jensen wins again.
Mythos is a new tier, bigger than Opus, and more capable. It will be beyond the current state of the art by a wide margin, a dramatic jump in intelligence. Things have been accelerating under the surface since December. Capabilities are increasing faster, the time between model releases is getting shorter. I think OpenAI's recent decisions and refocus have been driven by the fact that it is increasingly obvious to some people working in the industry that we are beginning to take off. I think there have been multiple unannounced breakthroughs over the last few months, in more than one lab. I think there have been advances in capabilities, in alignment, and in architecture. The loop that started in December has begun to bear fruit. This model will be huge, very expensive to serve, and incredibly powerful. It has already finished training, and is in early rollout to some enterprise partners so they can prepare for what's about to arrive. Once we get benchmarks I think it will become clearer where we are; we are already inside the singularity.
"AI is not doing end-to-end science yet. But this project proves that I could create a set of prompts that can get Claude to do frontier science. This wasn’t true three months ago."
https://t.co/RSdHgyDykd
this is easily the best thing you’ll read this year.
you’ll never have to to hear the words skill issue or code mogged ever again.
bookmark it.
re read it.
tell your agents to read it.
enjoy generational wealth.
wave bye bye permanent underclass.
Interesting observation: In KuorumAI we built a notes system via MCP with clear instructions to actively update a short-term scratchpad on every conversation. Both Claude 4.6 and GPT-5.4 fail to follow it reliably. The only model that does it consistently is @grok 4.2 beta.
i’ll be honest, with all this anthropic chatter that
>their models are conscious
>they have the recipe for super intelligence this year
>the models know they’re being tested and can game benchmarks.
imma little concerned…