This "loop" automation is nuts inside of Codex.
"/goal go over every single feature in this app create a user story with expected behaviour based on the code keep a single canonical spreadsheet tracking the features status
- when done switch loop to testing every user story and documenting all errors
- when done fix every logistical error or ux error
- test every user behaviour again post fix"
Shoutout to @MatthewBerman for the heads up.
Hundreds of user stories being worked through like it's nothing.
very timely and as always super clear content from the HF team as companies start to consider fine-tuning models for cost/perf reasons
SFT is the first thing youโll come across
โImitation: Make this look like thatโ
if you have a working agent system, are producing โgoodโ traces, and want to use a much cheaper model
then SFT can help you transfer knowledge from those good traces into another model โ> ideally saving you a lot of time/money!
๐ค A coding agent is not like a regular agent.
A regular agent just reasons and responds. A coding agent has to navigate real codebases, run shell commands, track file changes across a session, and loop back when something breaks. That's a different job, and it needs a different system underneath it.
and that system is the harness: everything built around the model so it can actually do the job, not just talk about it. The model alone doesn't know your repo, your tools, or what happened three turns ago. The harness is what does.
It has 6 components, and each one fixes a different way agents fail. Swipe through to see what they are.
#AIAgents #CodingAgents #DeveloperTools #AgenticAI #SoftwareEngineering #ClaudeCode
Anthropic Product Lead:
"At Anthropic, our engineers are running swarms of 300+ agents daily.
Give your agents 100+ tools - just donโt load them all into context."
In a 30-minute talk, the Anthropic team shows how to deploy agents to production.
Claude + loops + routines + dynamic workflows - thatโs the secret.
Watch the talk, then save the playbook below.
SOMEONE ON REDDIT JUST DROPPED THE CLEANEST AI AGENT BREAKDOWN I'VE SEEN.
7 steps, no fluff:
โซ๏ธ set a measurable goal before touching any model
โซ๏ธ match the model to the task LLM for general, LRM for complex reasoning, SLM for routing
โซ๏ธ pick a framework: LangChain, CrewAI, n8n, or Google ADK
โซ๏ธ add memory or your agent resets every single session
โซ๏ธ connect tools via MCP or function calling
โซ๏ธ compress context before costs spiral
โซ๏ธ test edge cases, not just the happy path
the part nobody mentions: steps 4 and 7 are where most agents die in production
memory makes it intelligent...
testing makes it production-ready... Skip both and your agent works in the demo, breaks in real life
Agentic AI is a $9B market growing 46% annually
the engineers who actually know how to build these won't be available at current rates for long
Vector embeddings do more than just search. ๐๐ฒ๐ฟ๐ฒ'๐ ๐ต๐ผ๐ ๐ด๐ฟ๐ฎ๐ฝ๐ต-๐ฏ๐ฎ๐๐ฒ๐ฑ ๐ฐ๐น๐๐๐๐ฒ๐ฟ๐ถ๐ป๐ด + ๐ต๐๐ฏ๐ฟ๐ถ๐ฑ ๐ฟ๐ฒ๐๐ฟ๐ถ๐ฒ๐๐ฎ๐น ๐๐ผ๐ฟ๐ธ ๐๐ผ๐ด๐ฒ๐๐ต๐ฒ๐ฟ ๐๐ผ ๐ฐ๐น๐๐๐๐ฒ๐ฟ ๐น๐ถ๐๐ฒ ๐ป๐ฒ๐๐ ๐ฑ๐ฎ๐๐ฎ.
The demo of the week this week is ๐ช๐ฒ๐ฎ๐๐ถ๐ฒ๐ ๐๐ต๐ฟ๐ผ๐ป๐ถ๐ฐ๐น๐ฒ๐, a newspaper-themed UI that fetches live news headlines, embeds them in real time, and automatically clusters them into stories.
Here's what's happening under the hood:
Live news articles are fetched, embedded, and stored in @weaviate_io. Then a ๐ต๐๐ฏ๐ฟ๐ถ๐ฑ ๐๐ถ๐บ๐ถ๐น๐ฎ๐ฟ๐ถ๐๐ function scores how closely each pair of articles relates โ combining dense vector (semantic) similarity with character n-gram similarity. The n-gram side catches surface-level matches: shared entity names, proper nouns, specific phrases that pure semantic search might blur together.
Those similarity scores form a weighted graph. Then the ๐๐ฒ๐ถ๐ฑ๐ฒ๐ป ๐ฎ๐น๐ด๐ผ๐ฟ๐ถ๐๐ต๐บ runs community detection over it to assign cluster memberships.
Each cluster then goes through a few more steps:
โข The most central article (closest in meaning to all others) is selected as the cluster's representative
โข An AI model reads the top five headlines and generates a short topic summary
โข Clusters expire after 8 hours without a new article, keeping the index fresh
And then there's the search bar with a tunable slider between three modes:
1๏ธโฃ Keyword (BM25)
2๏ธโฃ Balanced hybrid
3๏ธโฃ Pure semantic
It's a really clean way to see how the same query returns different results depending on the search mode you chose.
Try it here: https://t.co/cXfFP4OVuK
Thereโs also a copy prompt so you can easily start building yourself!
Memento-Skills is a self-evolving skill harness. It's a great paper to understand loop-engineered agent systems.
The editable surface is the skill library.
The feedback signal is judge-scored task success.
Learning happens through external artifact mutation.
Observe -> Read Skill -> Execute -> Judge/Feedback -> Write Skill Update
The router selects a skill for the current task. A query about โextracting patient medication historyโ and a query about โsummarizing patient historyโ may look semantically close, but the correct execution behavior is different.
Their proposed router is trained for behavioral similarity which I find very useful:
- generate synthetic user goals for each skill
- generate hard negatives that share terminology but should not use that skill
- train with multi-positive InfoNCE (contrastive training objective for retrieval)
interpret the embedding score as a soft Q-value
- route to the skill most likely to produce successful execution
So the router is asking: โWhich skill historically leads to the right outcome for this kind of task?โ
The repo also has a gateway layer between the agent and skill execution. The execution engine is a ReAct-style loop: build messages, expose allowed tools, let the LLM act, process tool results, update state, and stop on success/failure/loop conditions.
That makes execution observable. The system can tell whether a skill produced artifacts, repeated itself, got blocked by policy, timed out, or claimed success without evidence.
They also have Reflector / Judge. In benchmark mode, the judge knows the gold answer. In production, this could be an eval rubric, human review, unit test, or downstream signal.
The Skill Evolution Engine then has three update modes:
- Utility update: track empirical success rate per skill.
- Optimize existing skill: patch prompts, code, or guardrails for a specific failure mode.
- Discover new skill: create a new skill when repeated failures show the current abstraction is wrong.
The skill is effectively policy. So the model stays fixed, but the action distribution changes because the retrieved skill changes.
Turning skills from static context files into versioned, evaluated, self-improving policy artifacts is still hard. But I expect coding platforms to integrate this into user-facing systems soon.
As frontier model access gets constrained by pricing, policy, rate limits, and availability shocks, more performance will have to come from the substrate around the model.
Building a useful agent is getting easier. Running them in production is still hard.
We built Managed Deep Agents so your team can focus on agent behavior instead of rebuilding the runtime around it.
https://t.co/wiQVO5luru
how to add an LLM council into your /goal workflow
/goal is the easiest way into agent looping. it works best when you already know what you want and have enough context to start, the agent fills in the rest as it goes
what I added recently is a review between models inside the run. one model hosts the loop, another sits as the reviewer and checks the work before it moves on
today I ran an AI workflow mapping for a client. GPT 5.5 xhigh hosted the loop on codex CLI, and I put a review gate in two places, first on the plan, then on every delivery. claude opus 4.8 high did the reviewing (bring back fable pls) and fed notes back to the main agent. the output came back noticeably sharper than a single-model run
under the hood its just the codex model calling the claude CLI. nothing fancy. you write a skill that goes both ways, one that lets codex call claude and one that lets claude call codex, and you have a council
the value is simple, a second model catches what the one driving the loop talks itself into. easy to set up, and it improves the output every time
One of the best designers Iโve ever worked with is now a principal engineer.
He wants to stay anonymous, but he now designs and builds 95%+ in coding harnesses and the terminal.
His workflow is basically:
โ Get AI to create a design md first
โ Ask AI to generate the components
โ Give AI feedback until it feels right (taste!)
He thinks doing this is now basically a core skill for designers (and frankly any builder).
Not saying you shouldn't design in Figma, but I agree with him that it's important to learn the above too.
Model strategy for @harvey:
We are working on the first model in our legal foundation model series, inspired by @cursor_ai's Composer. Two goals:
1. Allow us to serve frontier intelligence across our product surface areas at an affordable price and a strong security posture.
2. Create the foundations for law firms to build their own specialized models and own their own intelligence.
The model series will focus on complex client matters that span months and take dozens of associates. The agentic system will learn to control legal tech tools, sub agents and ask for help from frontier models or human partners, much like a senior associate.
Weโve open sourced benchmarks for evaluating our initial post training work that represents work done by associates and in-house lawyers. We are scaling these significantly using synthetic and human pipelines as well as building private evals for firms.
Open sourcing this data has allowed us to quickly validate the feasibility of post training open weight models for legal work. With our research partners weโve already shown promising results post training open source models to approach frontier performance:
1. @baseten - novel compaction strategies for analyzing large data rooms.
2. @FireworksAI_HQ - matching frontier performance by using frontier as an advisor.
3. @appliedcompute - improving performance and reducing cost of large scale review tables.
4. @trajectorylabs & @nvidia - sovereign continual learning over client matters.
We plan to continue to invest heavily in working with research partners and open sourcing our data, models and research as much as possible. We believe open research in legal will be important to building trust in the frontier ecosystem.
We are also scaling our research team. Harvey Labs is our internal research group, responsible for pushing the frontier of legal intelligence and working closely with labs, research partners, and academia to bring the frontier of agent research into Harvey.
Labs is run by @nikogrupen and @ItsJulioPereyra - Niko worked on multi-agent RL at Google Brain and Julio clerked and worked in BigLaw. We believe this pairing is crucial for building frontier legal AI systems. Together they have already made significant progress in scaling our data and training efforts.
The long term goal of Harvey Labs is to contribute to the research and infrastructure required for the legal industry to create a frontier ecosystem. We believe that the best version of legal super intelligence is one where each law firm, enterprise and government owns their own specialized version.
We are hiring for Harvey Labs across the post training, agent and data stack and open to acquiring talented teams / neolabs in this space. If interested please DM me.