This is a super exciting release - Claude Fable 5 is the same underlying model as Mythos but with added safeguards. The benchmarks are great and it's SOTA on everything by a margin but I'll add that *qualitatively* also, this is a major-version-bump-deserving step change forward (imo of the same order as Claude 4.5 was in November), peaking especially for long problem-solving sessions on very difficult problems. You can give it a lot more ambitious tasks than what you're used to, the model "gets it" and it will just go, and it's never felt this tempting to stop looking at the code at all (but don't do this in prod!). The model still has quirks that people will run into and the safeguards are configured to be a little too trigger happy for launch, which can hopefully be tuned over time.
I feel a lot of things changing as working software increasingly comes out on a tap. The Jevon's paradox kicks in and I feel my own demand for software growing substantially. You can ask for anything - explainers, visualizers, dashboards, bespoke single-use apps (e.g. a full wandb that is hyper-specific just for your project), you can 10X your test suite, auto-optimize code, run giant research projects with custom HTML for the results, anything! "Free your mind" (Matrix ref). Really looking forward to all the things people build!
This article is literally wow.
i read it 2 years ago, and coming back to it today, it still feels new.
few tutorials teach computers in a way that permanently changes how you think. this is one of them.
If you've never built a VM before, you're missing one of the biggest "aha" moments in computer science.
As I wrote this, I saw X go into meltdown over tokens.
You've seen the headlines: “Uber blows yearly AI budget in just one quarter.” “Meta employee burns 281 billion tokens in April.”
But, the problem isn't spending. Spending works. Since 2023, the top quartile of our AI spenders doubled their revenue. The bottom quartile? Flat.
It's blind spending. We don’t know which spend worked.
A sales team has qualified leads. A support team has resolved conversations. These are units you can measure against. All a token tells you is the meter ran, not whether the work was worth it or not.
Finance says, “half the budget,” engineering says, “double it” and you don’t know who’s right because there is no shared language of value. There’s no attribution, and no attribution means no allocation.
For example, right now, all work, no matter the size or shape, defaults to frontier models. But meeting summaries and calendar updates don’t require GPT-5.5 Pro.
In isolation this seems trivial, but re-route just 10% of a $10M AI bill from frontier to GPT-4 level intelligence you’ve saved nearly one million dollars. This sounds like a made-up stat — it’s not. It truly is that much cheaper.
This is the future of finance: not blindly rubber-stamping or rejecting AI spend, but allocating it with the same rigor companies apply to headcount.
this is the guide I wish someone had handed me on day one of CS.
it builds up, from a single transistor, why your computer spends most of its life just waiting for memory instead of doing math. barely a calculator at all.
understanding this one idea rewired how I read every program I write, and I tried hard to make it click for you too.
if you've never really understood the memory wall, you're missing one of the biggest "aha" moments in computer science.
it took me 20+ days of deep research and a lot of work to make it as simple as possible, and honestly there's still a long way to go. every piece of feedback would mean the world to me 🙏
here's mine article: https://t.co/vKr1JCRRe7
This is the best site on the internet to learn harness engineering.
Free. Completely.
Most AI engineers have never heard the term.
https://t.co/bwDbTTYsjM
Bookmark this site.
Then read this setup ↓
imo there’s a pretty solid default recipe that everyone should use to optimize a system of
Agent = Model + Harness
you should “train” both
1. Build v1 agent using a sensible base harness and some task specific prompting + tools
2. Harness Engineering using eval tasks that roughly match prod
this is often enough - most companies can get acceptable perf doing this. then they collect traces, mine them for patterns, and make slight tweaks from there
3. SFT using data collected from traces) or synthetic data. Often is good candidate for “distillation tasks” to train a cheaper model while maintaining existing performance
4. RL if you have the bandwidth and ability and desire to create environments and designing rewards that represents the tasks you want your agent to be good at. Push past the SFT behavior of “copying” data from existing model to pushing past in some dimension
5. Light harness engineering again to squeeze any more juice (ex: slight prompting) using the trained model that’s better at your task distribution
this loop will largely be productized as a general purpose recipe for building and improving agents
we’re still in the earliest innings of the world’s companies getting comfortable with steps 1-2 of this loop. Harness engineering will probably be the dominant way ppl will optimize agents
but i expect a large number of companies to onboard through this entire loop on some trial project of interest in the next year
Big Tech just ran out of money building AI and what they're doing to cover it up should be illegal.
Google, Amazon, Microsoft, and Meta are spending a combined $700 BILLION this year on AI infrastructure.
This eats up 94% of their total operating cash flow.
The richest companies in human history are almost broke. And instead of slowing down, they're covering it up with the biggest financial engineering operation since 2008:
Google just sold $80 billion in stock to fund AI infrastructure. That was their first equity raise in 20 YEARS.
The last time Google needed to sell stock, YouTube didn't even exist. Sundar Pichai admitted the thing keeping him up at night is "compute capacity."
The company that prints $100 billion a year in ad revenue just told Wall Street it isn't enough anymore.
Amazon's free cash flow is projected to go NEGATIVE this year for the first time ever. Morgan Stanley estimates a $17 billion deficit and Bank of America says $28 billion.
The most profitable logistics machine on Earth is about to burn more cash than it generates, and they quietly filed with the SEC saying they may need to raise even more debt and equity to keep building.
All four hyperscalers are now borrowing hundreds of billions in bonds to keep the AI buildout alive. These were the most cash-rich companies in human history, and they're leveraging themselves to the teeth to build infrastructure that nobody has proven will generate enough revenue to pay for itself.
And the cracks are already starting to show:
Broadcom makes the custom AI chips that power Google, Meta, OpenAI, and Anthropic. This week their AI revenue TRIPLED year over year, sales grew 48%, and profits smashed every Wall Street estimate.
The reward for all of that was $320 billion in value erased in a single trading session.
Their CEO Hock Tan went on the earnings call and exposed three things about the AI industry:
Google is already shopping for cheaper AI chip alternatives, broadcom abandoned its strategy of selling complete AI systems and is now retreating to selling bare chips at lower margins.
And despite supposedly "unprecedented demand," Tan refused to raise his full-year forecast, which tells you everything about what he's actually seeing behind the curtain.
Wall Street heard all three and hit the sell button so hard it dragged AMD, Intel, and the entire chip sector down with it.
When a company triples its AI revenue and gets punished because tripling isn't fast enough, the expectations have left the atmosphere entirely.
And here's the really scary part...
These companies ARE your retirement account. Apple, Microsoft, Amazon, Google, Meta, and Nvidia make up roughly 30% of the S&P 500. If you have a 401k or an index fund, you are already exposed to this bet whether you chose to be or not.
Every single one of these companies is telling you AI will generate trillions in revenue. But right now the math says they're spending trillions FIRST and hoping the revenue shows up later.
If the revenue catches up, this becomes the greatest infrastructure buildout in human history. Bigger than railroads and bigger than the internet.
If it doesn't, the companies that make up a third of the American stock market just leveraged their balance sheets into the largest write-down cycle since 2000.
And unlike the dot-com crash, this time the bubble companies aren't random startups with no revenue. They're the backbone of the entire global economy.
Bill Ackman was asked how he would underwrite SpaceX at $750 billion and his answer was the most honest thing anyone has said about the biggest IPO in history (Save this).
"You underwrite SpaceX the way you underwrite a venture capital investment."
His business school professor taught him a framework that has guided his entire career, it's people, opportunity, context, deal.
On all three of the first criteria, People, Opportunity, and Context Ackman's verdict was the same, SpaceX is one of one, and nothing else in the market comes close.
He even acknowledged feeling bad for Blue Origin before noting that their being so far behind is not harmful to SpaceX but rather a structural tailwind that leaves SpaceX with a near monopoly on low cost orbital access for years to come.
And at $1.75 trillion, the number SpaceX is actually targeting on June 12, the question is no longer whether this is the best business on earth, but what the present value math looks like when you extend it five years forward and stress test every assumption about Starlink, launch economics, and AI compute revenue.
He said that even Amazon is going to have to become a bigger SpaceX customer, because Blue Origin is so far behind that Amazon has no real alternative for low-cost orbital access.
He also said something that almost no one is giving enough weight heading into Thursday's listing: "Time has become increasingly valuable in the AI era. You lose a month, you lose a couple months today, and it means a lot."
The Colossus and Macro Hard facilities are compounding infrastructure assets where every month of operational delay means less contracted revenue, less negotiating leverage with customers like Google and Anthropic, and a progressively weaker moat against the hyperscalers who are now racing to build competing compute capacity.
Come join Milk Road Pro for our full SpaceX IPO breakdown, how we're stress-testing the Deal leg of Ackman's framework at $1.75 trillion, what our five-year revenue model actually looks like, and our full AI thesis.
Link below.
Google has published a paper that might end the transformer era.
For the last 7 years, every major AI, ChatGPT, Claude, Gemini, has been built on the exact same architecture: The Transformer.
But Transformers have a fatal flaw.
To remember context, they have to process every single word against every other word. It’s called quadratic complexity. As your prompt gets longer, the compute cost explodes.
The alternative is the old-school RNN (Recurrent Neural Network). RNNs are incredibly cheap and fast, but they have a fixed memory size. If you give them a long document, they get amnesia.
Until today.
Google researchers published Memory Caching: RNNs with Growing Memory.
And it fixes the biggest bottleneck in AI.
Instead of an RNN having a fixed, rigid memory that constantly overwrites itself, Google gave it a "save" button.
The technique allows the RNN to cache checkpoints of its hidden states as it reads.
The memory capacity of the RNN can now dynamically grow as the sequence gets longer.
They built four different variants, including sparse selective mechanisms where the AI actively chooses exactly which checkpoints matter most.
The results rewrite the rules of efficiency.
On long-context understanding and recall-intensive tasks, these new Memory-Cached RNNs closed the gap with Transformers.
They achieved competitive accuracy without the explosive, quadratic compute cost. It perfectly bridges the gap between the cheap efficiency of an RNN and the massive capability of a Transformer.
We have spent billions scaling Transformers because we thought they were the only way an AI could remember a long conversation.
But Google just proved we don't need to process the whole history every single time.
We just needed a smarter cache.
Tri Dao wrote the code running inside ChatGPT, Claude, and Gemini. Nobody alive understands the GPU bottleneck better.
Now he's calling the top. Nvidia holds 90% of AI compute today. He says that ends in three years.
His reason: as inference splits into specialized chips for agents, batch jobs, and chat, the one-size-fits-all GPU stops winning.
The man whose code is Nvidia's moat just told you the moat is draining.
in 1988 a physicist named Jack Crenshaw got tired of compiler books being impossible to read
so he wrote his own series on a BBS called Let's Build a Compiler
it starts with a parser that understands exactly one digit
then each installment adds one new idea until you end up with a real compiler
one small step at a time instead of 500 pages of theory first
Everything You Need To Know About
Inference Engines and Running LLMs Locally at Home
Explains why Inference Engines exist in the first place
- Prefill is not Decode
- VRAM is not bandwidth
- Fit is not speed
- KV Cache is the real memory problem
- Quantization only matters if the engine has good kernels for it
- Batching is not scheduling
- MoE and the routing problem
- How long context changes the serving problem
- Multi-GPU changes the interconnect problem
- Production: latency, p99s, backpressure, routing, metrics, and failure behavior
Then maps the Engines including:
- llama.cpp → portability king
- MLX / MLX-LM → Apple Silicon weapon
- ExLlamaV3 → multi-GPU consumer CUDA / local MoE
- vLLM → default open-source production server
- SGLang → long-context, MoE, routing, ugly workloads
- TensorRT-LLM → max NVIDIA performance
- NVIDIA Dynamo → fleet orchestration
The point of this article is not “use vLLM” or “use TensorRT-LLM” or “use llama.cpp”
But rather fully grasp how the Inference Engines are the traffic cop, memory manager, kernel dispatcher, scheduler, cache accountant, parallelism planner, API surface, and sometimes the deployment framework
Do not pick the engine first
- Pick the hardware
- Pick the workload
- Pick the serving model
Then the engine becomes obvious
Opensource / Local AI FTW
Very excited to share our interview with @polynoamial on AI for math — the Erdős unit distance problem, saturating the IMO, the future of math research, and more!
Mathematicians and scientists often peak in their 20s. Why?
Maybe older scientists become stuck in their ways. Or maybe younger researchers feel free to be more creative.
But @jacobkimmel's hypothesis is that this isn't because of social factors at all - it's evolution:
"We use Prometheus for monitoring."
I hear this in almost every interview. Then I ask one question and the whole thing falls apart.
"Why do logs and metrics need different pipelines?"
Silence.
Most people jump into Prometheus and Grafana without understanding what they're actually solving. They know the tools. They can't explain the problem.
With observability, you're solving two completely different problems.
Logs tell you what happened. An error occurred. A request came in. A database query failed. These are events. Stories your application tells.
Metrics tell you how things are performing right now. Latency is 200ms. CPU is at 75%. You processed 500 requests per minute. These are measurements.
Different data types. Different collection methods. Different storage. That's where people get confused.
Last month in my DevOps bootcamp, we built a complete observability system for microservices on Kubernetes.
For logs, we used Fluentd sidecars that share a volume with the application container.
The app writes logs to the volume.
Fluentd reads and forwards them.
Clean separation of concerns.
At a small scale, you send logs straight to CloudWatch.
But when you're generating thousands of log lines per second, you add layers.
Lambda for formatting.
Kinesis for buffering.
OpenSearch for fast queries across petabytes of data.
S3 for long-term backup.
We kept 7 days in OpenSearch for active investigation. 30 days in CloudWatch. Years in S3 for compliance. Each layer has different cost and performance characteristics.
For metrics, Prometheus scrapes application endpoints every 30 seconds.
Developers instrument their code with Prometheus client libraries.
They expose a /metrics endpoint.
Prometheus pulls the data automatically.
We created ServiceMonitors that tell Prometheus which pods to scrape based on labels.
As soon as new pods come up, Prometheus discovers and scrapes them.
Then Grafana visualizes everything.
We imported pre-built dashboards from https://t.co/5wE21Lb4Q8 for Kubernetes monitoring.
And built custom panels for application-specific metrics.
Logs and metrics run in parallel.
When something breaks, metrics show you the spike. The error rate jumped. Latency went from 100ms to 2 seconds.
Then you check the logs. Filter for that time window. Find the stack traces. See exactly what failed.
You can't troubleshoot with just one. You need both perspectives.
We implemented it, troubleshot everything in a live call, generated real metrics and logs, and built dashboards in Grafana.
That's the difference between watching tutorials and actually understanding how systems work in production.
"Dijkstra said … Programming is not a craft. It is closer to mathematics than to carpentry, and the moment you treat it as a craft, you guarantee that the software you produce will be full of the kind of bugs that craftsmanship cannot catch. The fix, in his view, was to teach programming the way mathematics is taught. You should be able to prove your program correct before you run it."
Don't we have a half century of experience showing he was just wrong?