Karpathy said something you'll regret ignoring:
"Remove yourself as the bottleneck. Maximize your leverage. Put in very few tokens, and a huge amount of stuff happens on your behalf."
Loop engineering is the exact thing that does that.
In a hand-run session, the operator handles two things:
- deciding what the agent runs next
- and checking its output before the next step
Both are manual, and both decide how far the agent gets on its own without the operator.
Loop engineering moves both steps into the system.
A core operating structure surrounds the loop, and the diagram below depicts it.
- A schedule decides what to run
- Loop is the maker that produces the work
- A separate checker agent grades the output
- A file on disk holds the state they both read.
The loop runs until either done, max iterations, or an exhausted budget.
Here are some practical engineering considerations:
1) A model grading its own output justifies what it already did instead of catching where it failed.
That's why a separate checker's findings return to the maker as the next instruction. And the cycle repeats until the checker finds nothing left to fix.
2) A loop with no stop condition burns tokens, and the cost climbs fast once sub-agents and long runs add up.
That's why the exit must be set before the loop runs, not while it is running.
A simple exit could be:
↳ fix only the major issues, run one final pass, and stop after two loops, with "all tests pass and lint clean" as the rule that ends it.
3) State has to live on disk, not in context.
The model forgets everything between runs, so an MD file or a knowledge graph holds what is done and what is still open.
Each run reads it and writes back to it, which lets a loop pick up again after days.
4) The lower the verification bar, the safer the loop.
Boring, repetitive checks like a stale version string or a missing test are trivial to verify, so a loop runs them with little risk while the operator is away.
Judgment-heavy work is loopable too, but only as far as the checker can confirm the result.
Let's look at how an unattended loop fails in two ways.
1) It reports done when nothing is actually verified.
The separate checker exists to prevent it, but it merges code faster than anyone reads it, so over weeks, the team stops understanding its own codebase while every check stays green.
Green tests say the code passed the tests, not that anyone knows what shipped. Someone still has to read what the loop merges.
2) The checker keeps a running loop honest, but it only catches failures inside a run.
The harness around the loop, like the prompts, tools, and checks wrapped around the model, still drifts and breaks in production as models change.
That repair loop is usually run by hand based on observability traces.
My co-founder wrote a detailed walkthrough (with code) on making that harness repair itself, where a failing trace gets diagnosed, the fix is verified against the exact input that failed, and the failure is locked as a regression test so it cannot recur.
Read it below.
If you have:
Hermes Agent
Claude Code & Codex Handoffs
Obsidian + QMD Memory System
Run Agentic Loops
Fleet Tailscale Mesh
Cron Jobs + Kanban Board
Agentic Workflows
Congrats you are the top 1% of the AI god stack
Here's a simple loop: Tell codex to maintain your repos, wake up every 5 minutes and direct work to threads. That makes it easy to parallelize+steer work as needed.
I use a orchestrator skill combined with my triage+autoreview+computer use skills, so some work can land autonomously. https://t.co/FbBoJTIcfd
https://t.co/8389roVnOm
if you're not working with unlimited tokens like @steipete and @bcherny, you could do your loop with claude code + caveman.
event -> trigger->action -> eval -> feedback
- event: create a "wiki" to render claude generated md files as context
- trigger: click "review with claude" on a page; it drops a line in a queue file
- action: claude cowork / code reads the queue and writes edits right into the page (green add, red cut, amber note) ~thanks @nbaschez for roughdraft syntax~
- evaluate: you read those marks in the wiki and judge
- feedback: accept/reject decisions; reply sends it back to claude to redo
what is agent looping
for the last two years we prompted agents one task at a time. that is starting to change
instead of asking an agent to build the landing page and then driving every step yourself, you set up a loop that handles discovery, planning, the work, checking, and iterating until the goal is met
looping is a setup you build. almost any agent harness can run it, it just depends on how you wire it up
at its simplest, looping is one agent working on itself:
> researches
> drafts
> checks the draft against a goal
> fixes what is weak
> runs that cycle again until the work clears the requirements
you are not prompting each step anymore. the agent repeats the cycle for you
the bigger version is a fleet looping. you give an orchestrator agent a goal, it breaks the goal into pieces, hands each piece to a specialist agent, and those specialists hand smaller jobs to their own subagents
the whole tree keeps looping through discovery, planning, execution, and verification until the goal is met
one agent looping is like a person redoing their own draft. a fleet looping is a whole team running a project end-to-end
you create a goal, and the system runs the loop until it finishes within the reqs you set
open and closed looping:
OPEN LOOPING is exploratory. it still has conditions and a goal, but you give the agent or the fleet a wide space to move in. it can try different paths, discover things, build something you did not fully spec out
this is the exciting end, it is what Peter and others are doing, and tbh it is where I want to spend more time
the catch is cost, an open loop with real room to explore burns an insane amount of tokens. for the 90 percent of people without an unlimited budget it is not runnable yet, and pointed at projects with a loose standard it turns into a slop machine
CLOSED LOOPING is bounded. a human designs the end-to-end path first:
> clear goal
> defined steps
> an eval at each step
> a point where it stops or hands back to you (and feeds back performance data)
the agents still loop, but inside framework you built. it gets better every run because each pass feeds the next, and it runs on a normal budget because the path is tight.
for most marketing work, closed is the one that pays off today.
> the orchestrator owns the goal
> the specialists own the steps
> the subagents do the narrow work
> an eval gate make sure its not slop
best accounts to follow from each frontier lab to stay constantly up to date
Anthropic
@karpathy - must-follow account for AI; recently joined Anthropic
@bcherny - Claude Code creator, always shares great tips
@trq212 - also a Claude Code developer; writes amazing articles on CC
OpenAI
@polynoamial - works on reasoning research, shares a lot of technical details
@gabriel1 - Sora developer, great career path
@jxnlco - works on dev experience, shares a lot about Codex
Google AI
@OfficialLoganK - all the major Google Gemini and AI Studio updates
@ammaar - product and design; shares great things about vibe-coding in Google AI Studio
@fofrAI - cool use cases for generative models
Cursor
@leerob - the loudest voice behind Cursor updates
@ericzakariasson - shares great insights on using Cursor
@mntruell - Cursor’s CEO; major releases and usage updates
xAI
@milichab - recently joined xAI, shares updates on Grok
@skcd42 - also covers major Grok releases
@elonmusk - Elon does a great job reposting and hyping all xAI products
who else did I miss?
“design a RAG pipeline for 10M docs with zero hallucination”
apparently this was asked in a Google L5 interview round. came across it somewhere on the internet and honestly it’s a way more interesting system design problem than most classic distributed systems questions
1. ingest + normalize docs
- remove duplicates, standardize formats, extract metadata, maintain version history
2. hybrid retrieval (BM25 + embeddings)
- BM25 handles exact keyword matching while embeddings capture semantic meaning
- semantic search alone usually struggles with precision at massive scale
3. ANN retrieval + reranking
- ANN (Approximate nearest neighbor ) quickly pulls top candidate chunks from millions of docs
- then a reranker rescoring step improves relevance by deeply comparing query vs retrieved chunks
4. source confidence scoring
- every retrieved chunk gets scored based on freshness, trust level, overlap and retrieval consistency
- low-confidence context should never heavily influence generation
5. constrained generation
- the model is only allowed to answer using retrieved context (nothing new to be invented outside of the retrieved context)
6. citation-backed responses
- every major claim links back to exact chunks, documents or timestamps
7. hallucination fallback layer
- if retrieval confidence drops below a threshold: “insufficient evidence found”
8. continuous evals
- run adversarial queries, retrieval recall benchmarks and hallucination tests continuously
9. caching + memory layer
- cache high-frequency enterprise queries and retrieval paths (improves latency and output)
10. observability everywhere
- trace retrieval paths, chunk rankings, token attribution and failure points
Also at 10M docs, retrieval quality matters more than the frontier model itself.
🚨LATEST: The US has officially lifted the chip ban on China, per Reuters.
Alibaba, Tencent, and ByteDance are among 10 Chinese firms now approved to buy Nvidia's H200 chips.
China previously represented a market worth up to $8B annually and nearly a quarter of Nvidia’s revenue before the October 2023 export controls crushed the company’s market to nearly zero.
$NVDA shares surged to a new 52-week high, up +8% today after the U.S. Department of Commerce's approval.
Security things from the last few days:
- CopyFail (linux pwn'd)
- CopyFail 2/Dirty Frag
- 13 advisories in Next.js
- Over 70 CVEs addressed in MacOS 26.5
- ~50 CVEs addressed in iOS 26.5
- YellowKey (Windows Bitlocker pwn'd entirely)
- GreenPlasma (Windows privilege escalation)
- CVE-2026-21510 and CVE-2026-21513 confirmed to be used by Russia for Windows RCE
- CVE-2026-32202 separately confirmed to be used by Russia for sensitive document access
- Mini-Shai Hulud (over 300 JS and Python packages compromised via GitHub Action cache poisoning)
- Google confirms they have identified AI-powered exploitation of zero days in an unidentified "open-source, web-based system administration too"
- Canvas (popular LMS used in most schools) pwn'd entirely
- PAN-OS (palo alto networks) pwn'd with a 9.3 severity CVE-2026-0300
Are you scared yet?
KIMI FOUNDER JUST DROPPED A 40-MINUTE MASTERCLASS.
The exact architecture behind a $20B valuation — there's no faster way to learn how to build AI agents right now.
Bookmark this for the weekend.
40 minutes. zero fluff. from the person who built it.
Optimization → Linear Attention → Sub-Agents → Open Systems → Cash
FRONTEND IS DEAD
BACKEND IS DEAD
CLOUD COMPUTING IS DEAD
MOBILE DEV IS DEAD
DEVOPS IS DEAD
DATA SCIENCE IS DEAD
UI/UX IS DEAD
FULL STACK IS DEAD
GAME DEV IS DEAD
OPEN SOURCE IS DEAD
STARTUPS ARE DEAD
SAAS IS DEAD
APIs ARE DEAD
DATABASES ARE DEAD
MICROSERVICES ARE DEAD
SERVERLESS IS DEAD
KUBERNETES IS DEAD
DOCKER IS DEAD
VERSION CONTROL IS DEAD
DEBUGGING IS DEAD
TESTING IS DEAD
As an AI Engineer. Please learn:
-Prompt caching & semantic caching tradeoffs
-KV cache management at scale
-Speculative decoding vs quantization
-RAG evaluation (RAGAS + human evals)
-Cost monitoring & hidden token leaks
-Agent guardrails & infinite loop detection
Stop buying more VRAM.
Everyone’s posting Qwen 3.6 configs running insanely fast on 12GB cards.
But do you actually understand the flags making it possible? Weights are only half the story. KV cache is eating your VRAM alive.
The secret isn’t just 4-bit weights it’s the KV cache sorcery everyone’s missing.
Here’s the annotated command & real tricks explained:
@elonmusk@grok #Ai
// Agentic Harness Engineering //
Pay attention to this one, AI devs.
(bookmark it)
Most coding-agent harnesses are still tuned by hand or brittle trial-and-error self-evolution.
This new work introduces Agentic Harness Engineering, a framework that makes harness evolution observable. They do this through three layers: components as revertible files, experience as condensed evidence from millions of trajectory tokens, and decisions as falsifiable predictions checked against task outcomes.
Each edit becomes a contract you can verify or revert.
Results: pass@1 on Terminal-Bench 2 climbs from 69.7% to 77.0% in ten iterations, beating human-designed Codex-CLI (71.9%) and self-evolving baselines like ACE and TF-GRPO.
The evolved harness also transfers across model families with +5.1 to +10.1 point gains, while using 12% fewer tokens than the seed on SWE-bench-verified.
Harness work is the biggest hidden cost in most agent systems. This is the first credible recipe for letting the harness improve itself without drifting into noise.
Paper: https://t.co/9fEgqwlTSf
Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX