Loop engineering shifts AI work from prompting agents to designing the system around them: goals, tools, context, stopping rules, and verification.
The loop is easy; the hard part is preventing context rot, enabling safe tool use, and proving โdoneโ instead of trusting the agent.
The next bottleneck in agentic systems isnโt the model. Itโs the skills we give the agent.
Instead of manually tweaking prompts, new systems like SkillOpt, GEPA, and EvoSkill optimize skill files through rollout, evaluation, reflection, and validation.
Prompt engineering is becoming loop engineering.
How much we can automate with VAI depends on whether the outcome is verifiable.
If the output can be checked reliably, automation can go much further. If not, we need stronger human judgment in the loop.
If AI Is So Powerful, Why Do Humans Still Need to Understand the Work?
We can outsource the mechanics of thinking (writing, coding, content generation, and similar tasks) to AI agents, but we cannot delegate understanding. Our own understanding determines how effectively we can direct the machine. Without strong domain understanding, we cannot be good directors.
Why Agents Can't Replace the Search Stack
Agentic search models can learn how to surface relevant results, but their training data still shapes their strengths and blind spots. LLMs cannot reliably evaluate what they do not know or information introduced after their training cutoff, and there is still no real substitute for missing knowledge.
MIT on Cognitive Debt: Use AI as a Finishing Tool
MIT study key findings:
- Researchers coined "cognitive debt" to describe how AI spares mental effort short-term but causes long-term costs including diminished critical thinking and shallow information processing.
- When brain-only users switched to AI, they showed increased brain connectivity, suggesting AI used properly could enhance learning.
- Researchers recommend delaying AI integration until learners have first engaged in sufficient self-driven cognitive effort.
The takeaway: use AI as a finishing tool, not a starting one.
Own the Outcome, Not the Interface
While AI may commoditize many simple, bounded, and generic workflows, it also creates a big opportunity for us to turn model capabilities into measurable business outcomes (the opportunity is owning the outcome not the interface).
For what it's worth, I think we should be careful not to build thin AI wrappers. Instead, we should focus on high-value metrics, tight feedback loops, and using evidence to expand automation where it actually works.
The real advantage comes from combining AI with our domain context (which frontier labs do not have), workflow ownership, evals, integrations, and the living memory we build from real customer and operational interactions.
We can treat AI initiatives as a portfolio of experiments: move quickly, measure rigorously, learn continuously, and double down where AI improves a metric the business already cares about
Enterprise AI: Governance Must Be Executable
Most enterprise AI failures come from slow organizational processes, not technical limitations. Governance needs to become executable and automated, not dependent on manual approval chains. AI investments should be treated like a portfolio of bets, with learning and upside prioritized over fixed upfront ROI. Delivery should be hypothesis-driven: build, evaluate, and iterate based on evidence. Trust should grow progressively, from shadow mode to advisory mode to controlled autonomy.
The Harness Is the Bottleneck, Not the Model
New method boosts frozen LLM agents 88.5% by fixing the runtime wrapper, not the model. The model is not the bottleneck โ the harness is.
Anthropic IPO: AI's Value Is Growth, Not Cost-Cutting
Interesting Axios piece on Anthropic's IPO timing: the company is leaning heavily on enterprise revenue at a time when businesses are increasingly scrutinizing AI spending.
A recent survey found that 40% of respondents reported cost savings of less than 10%.
I think that framing misses the bigger picture: in large enterprises, AI's value is not primarily about cost-cutting, but about enabling entirely new capabilities, unlocking new workflows, and driving growth.
Research and Trench Engineering: Operating Without a Map
Proven research and trench engineering are not separate skills at frontier labs, but two expressions of the same ability: operating without a map. Research output is not the paper but a refined ability to make progress when certainty is unavailable, and trench engineering at modern AI infrastructure scale is less about accumulating every detail and more about compressing complexity into useful abstractions that predict reality.
Loop Engineering: Adding Agents Moves the Bottleneck
Many of us have heard about "loop engineering" and the idea of spinning up thousands of agents. One caveat I think we shouldn't miss: adding agents doesn't remove the human bottleneck; it moves it.
Coding is only one part of developer work. Reviews, planning, testing, and coordination still determine how much parallel work we can safely absorb.
The real limit is our review bandwidth, not how many agents we can launch. As engineers, we need to watch for "quiet success," where agents succeed in ways we no longer fully understand or track. The risk isn't just failing loudly; it's succeeding quietly while we fall hundreds of commits behind.
Co-Design Surface: Model, Harness, Workflow, Evals
The model, harness, workflow, and evaluation loop are no longer separate stack pieces but co-design surfaces that compound together.
Automating the Loop Without Automating Away the Judgment
We can automate more and more of the AI engineering workflow, but we should be careful not to automate away the judgment that makes the work valuable. The real question is not just what parts of the engineering loop can be automated, but who defines what "good" looks like, what failure modes matter, and what trade-offs are acceptable.
Without that human judgment, we risk producing agent slop at scale: systems that generate artifacts, open PRs, and optimize metrics, but still miss the actual product, user, or domain context. To me, the future AI engineering loop should be highly automated but deeply human-guided.
You Can Outsource Thinking but Not Understanding
"You can outsource your thinking but you cannot outsource your understanding." โ Karpathy
AI can help you move faster, think broader, and produce better work. But it cannot replace your responsibility to understand the goal, judge the output, and decide whether the final result is actually good.
To get better results from AI, you need more than better prompts. You need a system.
That system has three layers:
Layer 1: Spec
A strong output starts with a strong spec. The basic loop is: tight scope โ clear checkpoint โ review the output โ adjust and repeat. The goal is to avoid vague, oversized tasks. Instead, you want small, focused specs that are easier to review, improve, and execute.
A good spec should do three things:
- Clarify the real goal, not just the task
- Keep the work small and compartmentalized
- Force key decisions to be verified explicitly
The more precise the spec is, the less the AI has to assume. And the fewer assumptions it makes, the better the output becomes.
Layer 2: Verifier
"Models are not animals, they are ghosts." โ Karpathy
AI models can produce confident, polished answers that still may be wrong, incomplete, or misaligned with your standards. That is why verification needs to be part of the workflow from the beginning.
A good verification system should include three things:
- Clear evaluation criteria โ before the model starts producing the final output, define what "great" looks like.
- A second AI model as the critic โ use another model to review the final output, challenge assumptions, identify gaps, and point out weaknesses the first model missed.
- External signals where possible โ past examples, documentation, user feedback, metrics, expert review, real-world constraints.
The more external signal you provide, the stronger the verification loop becomes.
"Give Claude a way to verify its work. If Claude has a feedback loop, you get 2โ3X the quality of final results." โ Boris
Layer 3: Environment
The final layer is the environment around the AI. A better environment gives the model better context, better tools, and clearer boundaries. Over time, this improves the quality of every interaction.
A strong AI workspace should include four parts:
- A proper CLAUDE.md file โ explains how you work, what standards matter, and how Claude should approach your projects.
- An LLM knowledge base โ reusable context: project details, important decisions, examples, terminology, patterns, lessons learned.
- A reusable skill set โ workflows for tasks you repeat often: writing specs, reviewing code, improving prompts, auditing outputs, generating test cases, summarizing research, creating implementation plans.
- Clear guardrails โ rules for what AI can and cannot do, divided into three buckets: always do / ask first / never do. Enforce these with pre-tool and post-tool hooks.
Final Thought
The best AI workflows do not remove your judgment. They amplify it. You can let AI help with thinking, drafting, reviewing, and iterating. But you still need to own the goal, the standards, and the final decision.
"You can outsource your thinking but you cannot outsource your understanding." โ Karpathy
Who Did the Real Work: Me, Claude, or the Reviewers?
Imagine I'm assigned a new task in an area where I have limited context, limited domain knowledge, or not enough time to do a proper investigation. I'm still excited to see what the final outcome could look like, so instead of deeply researching the problem, I write a few rough sentences and delegate the implementation to Claude. The output looks promising, so I share it with an SME and a few engineers to review the code and validate the approach.
Their feedback is clear: the solution looks like it works, but it is wrong in important ways. Some API fields are obsolete, some assumptions are invalid, and the implementation needs correction. After incorporating their feedback, the final result becomes acceptable.
So the real question is: who actually did the work? The AI that generated the first draft, the humans who caught the mistakes, or the person who orchestrated the whole loop?
AI did not replace the work - it changed where the work happened.
Claude Code Is 98% Traditional Software, Not AI
Claude Code's real architectural strength is not a complex "AI brain," but a large deterministic software harness around a relatively thin model-driven loop. The attached paper estimates that only ~1.6% of the codebase is AI decision logic, while ~98.4% is operational infrastructure: permission gates, tool routing, context compaction, session recovery, extensibility, sandboxing, hooks, and persistence.
In other words, the model decides what to do, but conventional software decides what it is allowed to do, how actions execute, how context is managed, and how failures are recovered.
Two Profiles I Now Index On
Two kinds of people I increasingly value.
- Creative builders with product sense: they spot the right thing to build and prototype it fast โ taste is scarce, typing isn't.
- Deep systems experts for the hard parts: the places where "trust but verify" matters most, where subtly wrong is still wrong.