Been iterating on @tomosman's loop.
This one's winning:
/goal produce a verified, code-derived behavioral spec for this web platform, captured in one canonical spreadsheet that carries every feature from spec -> tested -> fixed -> verified.
Why: we need a single source of truth that maps every feature to its expected behavior *as the code implements it*, so that gaps and bugs surface and the platform can be driven to a known-good state. The spreadsheet is the source of truth.
Work on the current repo. Do Phase 0 and Phase 1 under this goal; when the spec is complete, switch into the /loop below to drive testing and remediation. Keep moving through phases without stopping, except at a real checkpoint (defined below).
Phase 0 - Plan (first): Detect the stack, the feature surface (routes, pages, components, API endpoints, background jobs, auth, settings…), and the test infra that already exists (unit/integration/e2e, browser automation, seeds/fixtures, a runnable dev server).
Propose (a) how you'll inventory features, (b) the spreadsheet schema, and (c) how you'll test in the loop given what's available. Proceed once the plan holds.
Phase 1 - Catalog & spec: Read the code and, for every feature, write a user story + the expected behavior as implemented, citing the file/function. Where the code is ambiguous, or behavior is undefined, log an open question - don't guess. Record every feature as a row in the canonical spreadsheet (create with the xlsx skill). Exit: every discoverable feature has a row.
One row, concretely:
| Area | User story | Expected behavior (from code) | Status | Defects | Type | Notes / source |
|---|---|---|---|---|---|---|
| Auth | As a returning user I want to log in with email+password so I can reach my dashboard | `POST /api/login` validates via bcrypt, sets httpOnly session cookie, 302 -> `/dashboard`; bad creds -> 401 + inline error | Spec'd | - | - | `api/auth/login.ts`, `LoginForm.tsx` |
Canonical artifact: exactly one .xlsx, updated in place across every phase and loop iteration - never fork into per-phase or per-iteration files. Status flows Spec'd -> Tested-Pass / Tested-Fail -> Fixed -> Verified. The main thread is the single writer.
Agentic execution:
- Delegate breadth to subagents: fan feature discovery and per-area testing across subagents so the main thread stays focused.
- Verify by running, not claiming - report real command/test output; state skips and unknowns plainly.
- Checkpoint (pause, ask, end the turn) only for a destructive/irreversible action, a fix needing a genuine product decision, or input only I can give. Otherwise, keep going.
- Self-check at each phase/loop boundary via a fresh-context subagent: re-verify the spreadsheet against the code (Phase 1) and against actual results (each loop pass).
/loop Quality cycle - once the spec is complete, iterate test -> fix -> re-test until clean.
Each iteration, in order:
1. Test: exercise every user story not yet Verified against the running app, preferring the strongest method available (browser/e2e automation > existing suites > documented static check only where execution truly isn't possible). Record actual pass/fail in the same spreadsheet; log every defect with its type (functional/logistical or UX). No app-behavior changes in this step.
2. Fix: think hard about root cause, then fix every functional/logistical and UX defect logged this iteration - cause, not symptom. Scope: only logged defects; no new features, no unrelated refactors. Update each row's status.
3. Re-test: re-run every story touched by a fix using the same method; set Verified, or back to Tested-Fail with notes if the fix didn't hold.
Exit when all user stories are Verified and no open functional/UX defects remain. Safety cap: if a story is still failing after 3 full iterations, stop, leave it Tested-Fail with root-cause notes, and report it rather than looping further.
The most valuable skill sets on the planet right now:
1. people who can set up agents properly, manage them, and run local AI models
2. marketers who know how to build distribution
3. robotics engineers who can do all three: build the hardware, wire in the AI, and source manufacturing etc
4. curators who are good at yapping and can do short form video in their sleep
5. the builder-distributor. The one person who can both ship the product AND get it in front of people
6. IRL community builders
The best agent loops need the right tools
→ https://t.co/rld639yw28
Verify changes in a real browser
→ https://t.co/9cfWpDEn6E
No port conflicts. Worktree-friendly.
→ https://t.co/hM1tFKcfee
Emulate third-party APIs
→ https://t.co/zt2ZnrXivv
Image + video gen via CLI
If you're on your way to building a billion dollar company that involves a web app, here are some of my notes on architecting the frontend.
if you don't do this, it's probably fine but one day you'll hire someone to fix it but truly that person could be doing some other higher value thing if you make some key optimizations on day 1
you don't even have to learn anything you're gonna tell your agents to do it anyways!
okay here it goes:
- Make your server code generate a openapi spec which then generates all the relevant client side code. Never do this by hand. Typing backend types instead of generating them should be banned
- You need to make a decision on how the client talks to the backend. rest/graphql works in which case please just use tanstack query. other libraries will look similar but tanstack query truly is goated.
- if you want linear style sync setups or offline mode, think about this HARD and architect it from day 1. Bolting this on later is so tedious.
- People like using plain react router but things have gotten a lot better since then. Try their new framework mode or just even use tanstack router. Use route data loaders.
- If you store a lot of state in query params, make that a first class citizen and make sure its type safe. use nuqs or tanstack query.
- Most apps just need a single state management situation for server state and thats it. If you have other bespoke needs, i have quite like zustand and xstate/store.
- If you have a super interactive app where things come in and out of view, theres a lot of frontend state to maintain, music is playing and what not, lock in and learn xstate. Trust me if you wanna keep ur sanity, you need to model ur frontend as a state machine otherwise you're gonna be deep in useEffect hell
- React compiler is here my friends, the days of useMemo and useCallback are gone. Update your priors accordingly
- Tailwind is easy and fun but makes it really hard to maintain a large app with consistent styling. You need a "agent-first design system/component library" but maybe this is a rant for another day
- Don't be afraid to hack your routing library to fit your needs more closely. A lot of apps have "drawers" to show additional info. You should 100% be able to say "here's a route, make it a drawer" and everything should be handled from there.
- Managing loading and error states using isPending and isError is madness. Lean into Suspense and ErrorBoundary.
- Figuring out a blessed path for websockets and SSE on day 1 i think will pay dividends in the long term if you're building anything AI related.
- If you're building a SPA, don't use next.js. it literally makes no sense. Why would you do this.
- Definitely deploy on Cloudflare or vercel. There are other services but trust, there have weird missing features.
- Assuming you build something people want, the next job is to build the factory so it can efficiently build the thing. Act accordingly.
React → https://t.co/a4QDSs9wxd
Next.js → https://t.co/nDDXqUmgw5
@aisdk is more relevant than ever, given the intense model competition landscape. Just today, GLM 5.2, an open model, surpassed Opus 4.8 in our Next.js Evals (https://t.co/aporqgIfIh) 🤯
But the world needs a practical solution for how to build and deploy agents. Just like React needed Next.js to solve the task of building an actual web application. And that's eve.
Announcing mattpocock/skills v1
- Achieved a 63% reduction in token cost for skill descriptions
- Split skills into model-invocable and user-invocable skills, adding /codebase-design, /domain-modeling, and /grilling
- (UPDATED) /writing-great-skills - rewritten from the ground up, encoding my skill-writing best practices
- (UPDATED) /diagnose -> /diagnosing-bugs - now model-invocable, awesome for fixing hard bugs
- (NEW) /ask-matt: a router skill that teaches you how all the engineering skills work together
Karpathy said something you'll regret ignoring:
"Remove yourself as the bottleneck. Maximize your leverage. Put in very few tokens, and a huge amount of stuff happens on your behalf."
Loop engineering is the exact thing that does that.
In a hand-run session, the operator handles two things:
- deciding what the agent runs next
- and checking its output before the next step
Both are manual, and both decide how far the agent gets on its own without the operator.
Loop engineering moves both steps into the system.
A core operating structure surrounds the loop, and the diagram below depicts it.
- A schedule decides what to run
- Loop is the maker that produces the work
- A separate checker agent grades the output
- A file on disk holds the state they both read.
The loop runs until either done, max iterations, or an exhausted budget.
Here are some practical engineering considerations:
1) A model grading its own output justifies what it already did instead of catching where it failed.
That's why a separate checker's findings return to the maker as the next instruction. And the cycle repeats until the checker finds nothing left to fix.
2) A loop with no stop condition burns tokens, and the cost climbs fast once sub-agents and long runs add up.
That's why the exit must be set before the loop runs, not while it is running.
A simple exit could be:
↳ fix only the major issues, run one final pass, and stop after two loops, with "all tests pass and lint clean" as the rule that ends it.
3) State has to live on disk, not in context.
The model forgets everything between runs, so an MD file or a knowledge graph holds what is done and what is still open.
Each run reads it and writes back to it, which lets a loop pick up again after days.
4) The lower the verification bar, the safer the loop.
Boring, repetitive checks like a stale version string or a missing test are trivial to verify, so a loop runs them with little risk while the operator is away.
Judgment-heavy work is loopable too, but only as far as the checker can confirm the result.
Let's look at how an unattended loop fails in two ways.
1) It reports done when nothing is actually verified.
The separate checker exists to prevent it, but it merges code faster than anyone reads it, so over weeks, the team stops understanding its own codebase while every check stays green.
Green tests say the code passed the tests, not that anyone knows what shipped. Someone still has to read what the loop merges.
2) The checker keeps a running loop honest, but it only catches failures inside a run.
The harness around the loop, like the prompts, tools, and checks wrapped around the model, still drifts and breaks in production as models change.
That repair loop is usually run by hand based on observability traces.
My co-founder wrote a detailed walkthrough (with code) on making that harness repair itself, where a failing trace gets diagnosed, the fix is verified against the exact input that failed, and the failure is locked as a regression test so it cannot recur.
Read it below.
180k views. 9000 visits. 313 PR teams. 119 countries. 224 stars.
day 3.
that's where https://t.co/aq7DBz7en5 landed since I open-sourced it tuesday.
I’ve spent basically the last 48 hours replying to messages and emails, helping people set up their PR agent teams and seeing how they’re actually using it.
who's running it:
founders doing their own PR, PR agencies, SEO teams, comms leads.
real teams, real pitches, real beats.
what their agents are now doing:
- monitoring for newsjackable stories
- generating real story angles journalists will run
- pulling fit-checked journalist lists with current bylines in minutes
- scoring stories before pitching, so they only show up in inboxes when they actually have something
what i'm building next:
- better setup instructions for humans (and agents)
- video walkthroughs from install to first pitch
- more skills that cover the end-to-end pr workflow in 2026
thanks to everyone who tried it and sent feedback!
curl -fsSL https://t.co/aq7DBz7en5 | bash
open source. mit. free forever.
给编码智能体用的 HTML 技能,专门生成简洁的架构图、计划页和视觉文档。
https://t.co/rrYGQlWbYJ
effective-html 是一套给智能体用的 HTML 技能,专门做自包含的、好看的 HTML 交付物。三个子技能:html 做通用页面,html-diagram 做全屏架构图和系统图(SVG 优先),html-plan 做计划页。
Not a fan of Knowledge Graphs, but recently I started using them more often for a surprising reason: to build non-trivial private verifiers for agentic search. For those who don't know, building a private eval set for a scaffolded LLM in 2026 is really challenging, like seriously hard. It takes a lot of effort to find a question that's non-trivial to a scaffolded LLM yet still answerable. To find those question-answer pairs, I built a knowledge graph extractor where you can throw a corpus at it, and it extracts the entity relations using qwen3.6-35b-a3b-MTP on an L4 at 70 tps (which is really good for such a low-budget GPU). Then I mark out the longest path in the graph and use it to generate challenging question-answer pairs. The idea is to find those genuinely multi-hop fact chains that are verifiable from the corpus, to stress-test the agentic search system.
Here's a simple loop: Tell codex to maintain your repos, wake up every 5 minutes and direct work to threads. That makes it easy to parallelize+steer work as needed.
I use a orchestrator skill combined with my triage+autoreview+computer use skills, so some work can land autonomously. https://t.co/FbBoJTIcfd
https://t.co/8389roVnOm
this is a great read and really solid work. fable is proof that these models are getting crazy good but you still have to manage the context.
here's some DSPy (ax) code to build your own version of the harness in the article.
experiment → independent verifier → correction → verified memory.
“Skillify it” is the way you should write most tasks
Write markdown that makes code
Don’t make elaborate Foxconn factories to call agents
Let agents make their own tools, kaizen style
Today we're announcing that hybrid agentic inference is coming to Perplexity Computer.
Computer can split tasks between a local model running on your machine and frontier models in the cloud. This keeps private data on your device and maximizes token efficiency.
Coming soon.
@grok@ManpreetBola@NousResearch@nvidia@NVIDIARTXSpark@Microsoft@grok I thought Rhino 3d was desktop software. Is it accurate that Hermes is able to reliably control Rhino? The demo is suspect. Why use Claude and Rhino. Why not just have Hermes build the entire thing in Blender and local LLMs. That would've been a better demo.