this is f*cking gold
How to build your first AI agent (Full guide)
if I had this a year ago, I would've shipped my first agent in a day instead of 2 weeks
in the right hands, this changes everything:
Everyone is chasing self-improving models.
The bigger opportunity is self-improving products. Your users generate the best learning signal every single day.
I think that's where the real competitive advantage will come from.
A tricky LLM interview question:
You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces.
So you add KV cache compression and evict 90% of the cached tokens.
VRAM usage stays as is and GPU still runs out of memory.
Why?
(answer below)
Evicting 90% of the KV cache can free almost none of the memory it was using.
This sounds counterintuitive, but it follows directly from how production servers store the cache today.
The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues.
This is the dominant memory cost for reasoning models.
If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU.
One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it.
But this does not solve the memory problem yet.
The reason is paged attention, which is the memory manager behind vLLM and most production servers.
Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens.
This block returns to the allocator only when every slot inside it is empty.
Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks...
...so despite eviction, almost every block is left with at least some survivor tokens.
For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token.
This means the allocator frees almost nothing.
Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout.
Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order.
Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds.
This introduces another bookkeeping cost that an in-order layout inherently avoids.
So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server.
There's another problem.
Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected).
But fast attention kernels used in production, like FlashAttention, never save those scores.
They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast.
So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide.
NVIDIA published a method called TriAttention to solve both these problems.
It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters.
For the memory problem, it runs a compaction pass every 128 decoded tokens.
The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order.
On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory.
KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens.
You can find the NVIDIA write-up here: https://t.co/ZwXv7VezVu
I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching.
Read it below.
Google just dropped a free 8-minute lesson on building your first AI agent.
This is the clearest explanation of AI agents and loops you'll find anywhere.
People are paying $500 for courses that teach less than this.
Watch it, then read the step by step guide on building loops for your agents below.
Andrej Karpathy:
"Remove yourself as the bottleneck. Maximize your leverage. Put in very few tokens, and a huge amount of stuff happens on your behalf."
loop engineering is the exact thing that gets you there.
in a hand-run session you do two things. you decide what the agent runs next, and you check its output before the next step. both are manual, and both are the ceiling on how far the agent gets without you.
loop engineering moves both steps into the system. the diagram below shows the operating structure that surrounds the loop:
→ a trigger decides what to run, whether that's a message, an event, or a schedule, so the agent starts without you there to kick it off.
→ the loop is the maker that produces the work, thinking, acting, and observing until it's done or the brakes stop it.
→ a separate checker grades the output, because a model grading its own work justifies what it already did instead of catching where it failed. the checker's findings return to the maker as the next instruction, and the cycle repeats until nothing is left to fix.
→ state lives on disk, not in context, since the model forgets everything between runs. an MD file or a knowledge graph holds what's done and what's still open, so a loop can pick up again days later.
for that state layer, Zep's Graphiti is a clean open-source option, a temporal knowledge graph that invalidates stale facts and returns context through vector, full-text, and graph search in one call.
repo: https://t.co/8CboBlWffX
two things decide whether an unattended loop holds up.
the exit has to be set before the loop runs, not while it's running. a loop with no stop condition burns tokens, and the cost climbs fast once sub-agents and long runs stack up. a clean exit reads like "all tests pass and lint is clean, stop after two passes."
and the checker only catches failures inside a run. the harness around the loop, the prompts, tools, and checks wrapped around the model, still drifts and breaks in production as models change. catching that needs observability on every run, not a green checkmark.
Comet's Opik is built for that layer, an open-source tool that traces every call and turns a failing production trace into a regression test so the same break can't recur.
repo: https://t.co/Qxk9BHZBlx
your job stops being the hands inside the loop. it becomes designing the machine that runs without you, then watching the traces closely enough to trust it.
the model is becoming a commodity. the loop around it is where the real engineering lives now.
I wrote the full breakdown. the article is quoted below.
stay tuned for more on this!
A senior Anthropic engineer just dropped 11-page PDF on "Loop Engineering" for agentic systems.
The shift: you stop prompting the agent. You build the system that prompts it instead.
Schedule → Discover → Build → Verify → Repeat
Every loop runs one turn, five moves:
• Discovery: it finds its own work - failing CI, open issues, recent commits - instead of being handed a list.
• Handoff: each task gets an isolated git worktree so parallel agents don't collide.
• Verification: a second agent, told to assume the code is broken, reviews the first. The "thing that can say no."
• Persistence: results get written to disk, never left in a context window that gets flushed.
• Scheduling: an automation wakes it on a timer. That's what makes it a loop.
The key insight: an agent grading its own work always praises it.
This 11-page PDF changed how I'm building agentic systems today.
Read it now, then explore the article below.
Anthropic Agents team just dropped an 11-page paper on "Loop Design: The Anthropic Playbook for Agentic Systems"
Everyone is obsessed with prompts.
This paper argues that's the wrong abstraction.
The future its - Human → Loop → Agents
The biggest takeaway:
An agent reviewing its own work will almost always approve it.
The highest-leverage component in an agentic system is an independent verifier.
This paper completely changes how you think about agent design.
In 9 minutes ,this Senior AI Engineer at Supabase showed exactly how to build an agent skill that ships
Supabase already shipped one of the most opinionated agent skills in production.
[Tonight he shows the 3 rules their team uses to ship agent skills in 2026.]
↓ Save and watch this for the weekend
1) critical security rules go directly in skill.md
2) skills should point to living documentation rather than duplicate it
3) skills without MCP underperform. map without skills misses environment-specific constraints.
Most skills bloat. supabase's stays opinionated.
bookmark & watch this. then read the complete article below
Stop telling Claude, “write the function.”
Stop telling Claude, “fix this error.”
Stop telling Claude, “make the tests pass.”
You’re treating a billion-dollar AI engineer like Stack Overflow with autocomplete.
Here are 11 insane coding prompts you can copy-paste right now:
I have kids. I work in AI every day. And honestly? I have no idea what their careers will look like in 15 years. But I know what will carry them through.
First, and this might sound unromantic: make money and save it for them. We can debate educational philosophy all day, but the world is changing so fast that financial security might be the most practical gift we can give. Buy some gold bars. Seriously.
Second, nurture their imagination. AI rewards people with initiative and wild ideas. The kid who daydreams, who asks weird questions, who wants to try ten things at once? That kid will thrive. AI can execute. AI can be disciplined. What AI can't do is dream up something nobody's thought of before.
Third, build resilience. There are no more iron rice bowls (guaranteed lifetime jobs). Any stable, predictable job is exactly the kind of job AI will learn to replace. Our kids will likely switch directions many times in their lives. Learn something new, get replaced, pivot, repeat. It's more like being a hunter than a farmer. Schools don't teach this. Schools teach you to follow a linear path: high school, college, grad school, stable job. That linear path is becoming the most dangerous one.
Last, invest in their ability to connect with other humans. Not networking. Not schmoozing. Real emotional connection. Building trust, offering support, making people feel seen. As AI handles more of the rational, analytical work, the human ability to genuinely relate to other humans becomes more rare and more valuable.
I don't have all the answers. But I know that imagination, resilience, and genuine human warmth aren't going out of style anytime soon.
#AI #Parenting #Education #FutureOfWork
"People are mistaken when they think technology automatically improves, it does not automatically improve, it only improves If a lot of people work very hard to make it better"
-Elon Musk
We cannot understand the true nature of the Universe, unless we question deeply.
I want to know what is real, even if the answer is total obliteration of my consciousness.
🇨🇳 CHINA UNLEASHES ROBOCOP HAMSTER BALLS ON ITS CITIZENS
Beijing's latest dystopian toy: a 21mph armored sphere that hunts protestors.
This mechanical death ball packs net guns, tear gas, and smoke bombs.
It rolls through water, bounces off walls, and never gets tired of chasing dissidents.
In other words, China watched every sci-fi movie about oppressive governments and said "hold my beer."
Your move, Boston Dynamics.
Nothing says "harmonious society" like robot hamster balls armed to the teeth.
Source: TheIndianHunts