very timely and as always super clear content from the HF team as companies start to consider fine-tuning models for cost/perf reasons
SFT is the first thing you’ll come across
“Imitation: Make this look like that”
if you have a working agent system, are producing “good” traces, and want to use a much cheaper model
then SFT can help you transfer knowledge from those good traces into another model —> ideally saving you a lot of time/money!
🤖 A coding agent is not like a regular agent.
A regular agent just reasons and responds. A coding agent has to navigate real codebases, run shell commands, track file changes across a session, and loop back when something breaks. That's a different job, and it needs a different system underneath it.
and that system is the harness: everything built around the model so it can actually do the job, not just talk about it. The model alone doesn't know your repo, your tools, or what happened three turns ago. The harness is what does.
It has 6 components, and each one fixes a different way agents fail. Swipe through to see what they are.
#AIAgents #CodingAgents #DeveloperTools #AgenticAI #SoftwareEngineering #ClaudeCode
Anthropic Product Lead:
"At Anthropic, our engineers are running swarms of 300+ agents daily.
Give your agents 100+ tools - just don’t load them all into context."
In a 30-minute talk, the Anthropic team shows how to deploy agents to production.
Claude + loops + routines + dynamic workflows - that’s the secret.
Watch the talk, then save the playbook below.
SOMEONE ON REDDIT JUST DROPPED THE CLEANEST AI AGENT BREAKDOWN I'VE SEEN.
7 steps, no fluff:
▫️ set a measurable goal before touching any model
▫️ match the model to the task LLM for general, LRM for complex reasoning, SLM for routing
▫️ pick a framework: LangChain, CrewAI, n8n, or Google ADK
▫️ add memory or your agent resets every single session
▫️ connect tools via MCP or function calling
▫️ compress context before costs spiral
▫️ test edge cases, not just the happy path
the part nobody mentions: steps 4 and 7 are where most agents die in production
memory makes it intelligent...
testing makes it production-ready... Skip both and your agent works in the demo, breaks in real life
Agentic AI is a $9B market growing 46% annually
the engineers who actually know how to build these won't be available at current rates for long
Vector embeddings do more than just search. 𝗛𝗲𝗿𝗲'𝘀 𝗵𝗼𝘄 𝗴𝗿𝗮𝗽𝗵-𝗯𝗮𝘀𝗲𝗱 𝗰𝗹𝘂𝘀𝘁𝗲𝗿𝗶𝗻𝗴 + 𝗵𝘆𝗯𝗿𝗶𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 𝘄𝗼𝗿𝗸 𝘁𝗼𝗴𝗲𝘁𝗵𝗲𝗿 𝘁𝗼 𝗰𝗹𝘂𝘀𝘁𝗲𝗿 𝗹𝗶𝘃𝗲 𝗻𝗲𝘄𝘀 𝗱𝗮𝘁𝗮.
The demo of the week this week is 𝗪𝗲𝗮𝘃𝗶𝗲𝘄 𝗖𝗵𝗿𝗼𝗻𝗶𝗰𝗹𝗲𝘀, a newspaper-themed UI that fetches live news headlines, embeds them in real time, and automatically clusters them into stories.
Here's what's happening under the hood:
Live news articles are fetched, embedded, and stored in @weaviate_io. Then a 𝗵𝘆𝗯𝗿𝗶𝗱 𝘀𝗶𝗺𝗶𝗹𝗮𝗿𝗶𝘁𝘆 function scores how closely each pair of articles relates — combining dense vector (semantic) similarity with character n-gram similarity. The n-gram side catches surface-level matches: shared entity names, proper nouns, specific phrases that pure semantic search might blur together.
Those similarity scores form a weighted graph. Then the 𝗟𝗲𝗶𝗱𝗲𝗻 𝗮𝗹𝗴𝗼𝗿𝗶𝘁𝗵𝗺 runs community detection over it to assign cluster memberships.
Each cluster then goes through a few more steps:
• The most central article (closest in meaning to all others) is selected as the cluster's representative
• An AI model reads the top five headlines and generates a short topic summary
• Clusters expire after 8 hours without a new article, keeping the index fresh
And then there's the search bar with a tunable slider between three modes:
1️⃣ Keyword (BM25)
2️⃣ Balanced hybrid
3️⃣ Pure semantic
It's a really clean way to see how the same query returns different results depending on the search mode you chose.
Try it here: https://t.co/cXfFP4OVuK
There’s also a copy prompt so you can easily start building yourself!
Memento-Skills is a self-evolving skill harness. It's a great paper to understand loop-engineered agent systems.
The editable surface is the skill library.
The feedback signal is judge-scored task success.
Learning happens through external artifact mutation.
Observe -> Read Skill -> Execute -> Judge/Feedback -> Write Skill Update
The router selects a skill for the current task. A query about “extracting patient medication history” and a query about “summarizing patient history” may look semantically close, but the correct execution behavior is different.
Their proposed router is trained for behavioral similarity which I find very useful:
- generate synthetic user goals for each skill
- generate hard negatives that share terminology but should not use that skill
- train with multi-positive InfoNCE (contrastive training objective for retrieval)
interpret the embedding score as a soft Q-value
- route to the skill most likely to produce successful execution
So the router is asking: “Which skill historically leads to the right outcome for this kind of task?”
The repo also has a gateway layer between the agent and skill execution. The execution engine is a ReAct-style loop: build messages, expose allowed tools, let the LLM act, process tool results, update state, and stop on success/failure/loop conditions.
That makes execution observable. The system can tell whether a skill produced artifacts, repeated itself, got blocked by policy, timed out, or claimed success without evidence.
They also have Reflector / Judge. In benchmark mode, the judge knows the gold answer. In production, this could be an eval rubric, human review, unit test, or downstream signal.
The Skill Evolution Engine then has three update modes:
- Utility update: track empirical success rate per skill.
- Optimize existing skill: patch prompts, code, or guardrails for a specific failure mode.
- Discover new skill: create a new skill when repeated failures show the current abstraction is wrong.
The skill is effectively policy. So the model stays fixed, but the action distribution changes because the retrieved skill changes.
Turning skills from static context files into versioned, evaluated, self-improving policy artifacts is still hard. But I expect coding platforms to integrate this into user-facing systems soon.
As frontier model access gets constrained by pricing, policy, rate limits, and availability shocks, more performance will have to come from the substrate around the model.
Building a useful agent is getting easier. Running them in production is still hard.
We built Managed Deep Agents so your team can focus on agent behavior instead of rebuilding the runtime around it.
https://t.co/wiQVO5luru
how to add an LLM council into your /goal workflow
/goal is the easiest way into agent looping. it works best when you already know what you want and have enough context to start, the agent fills in the rest as it goes
what I added recently is a review between models inside the run. one model hosts the loop, another sits as the reviewer and checks the work before it moves on
today I ran an AI workflow mapping for a client. GPT 5.5 xhigh hosted the loop on codex CLI, and I put a review gate in two places, first on the plan, then on every delivery. claude opus 4.8 high did the reviewing (bring back fable pls) and fed notes back to the main agent. the output came back noticeably sharper than a single-model run
under the hood its just the codex model calling the claude CLI. nothing fancy. you write a skill that goes both ways, one that lets codex call claude and one that lets claude call codex, and you have a council
the value is simple, a second model catches what the one driving the loop talks itself into. easy to set up, and it improves the output every time
One of the best designers I’ve ever worked with is now a principal engineer.
He wants to stay anonymous, but he now designs and builds 95%+ in coding harnesses and the terminal.
His workflow is basically:
→ Get AI to create a design md first
→ Ask AI to generate the components
→ Give AI feedback until it feels right (taste!)
He thinks doing this is now basically a core skill for designers (and frankly any builder).
Not saying you shouldn't design in Figma, but I agree with him that it's important to learn the above too.
Model strategy for @harvey:
We are working on the first model in our legal foundation model series, inspired by @cursor_ai's Composer. Two goals:
1. Allow us to serve frontier intelligence across our product surface areas at an affordable price and a strong security posture.
2. Create the foundations for law firms to build their own specialized models and own their own intelligence.
The model series will focus on complex client matters that span months and take dozens of associates. The agentic system will learn to control legal tech tools, sub agents and ask for help from frontier models or human partners, much like a senior associate.
We’ve open sourced benchmarks for evaluating our initial post training work that represents work done by associates and in-house lawyers. We are scaling these significantly using synthetic and human pipelines as well as building private evals for firms.
Open sourcing this data has allowed us to quickly validate the feasibility of post training open weight models for legal work. With our research partners we’ve already shown promising results post training open source models to approach frontier performance:
1. @baseten - novel compaction strategies for analyzing large data rooms.
2. @FireworksAI_HQ - matching frontier performance by using frontier as an advisor.
3. @appliedcompute - improving performance and reducing cost of large scale review tables.
4. @trajectorylabs & @nvidia - sovereign continual learning over client matters.
We plan to continue to invest heavily in working with research partners and open sourcing our data, models and research as much as possible. We believe open research in legal will be important to building trust in the frontier ecosystem.
We are also scaling our research team. Harvey Labs is our internal research group, responsible for pushing the frontier of legal intelligence and working closely with labs, research partners, and academia to bring the frontier of agent research into Harvey.
Labs is run by @nikogrupen and @ItsJulioPereyra - Niko worked on multi-agent RL at Google Brain and Julio clerked and worked in BigLaw. We believe this pairing is crucial for building frontier legal AI systems. Together they have already made significant progress in scaling our data and training efforts.
The long term goal of Harvey Labs is to contribute to the research and infrastructure required for the legal industry to create a frontier ecosystem. We believe that the best version of legal super intelligence is one where each law firm, enterprise and government owns their own specialized version.
We are hiring for Harvey Labs across the post training, agent and data stack and open to acquiring talented teams / neolabs in this space. If interested please DM me.
This is a fantastic article on how moats can be made with software and AI agents and aligns with my thesis linked in the replies
Bolting agents into systems of records is just a “state machine with APIs for agents” - it makes accessing the data easier
But real value lies in capturing the traces - the data on how state changed, ie how decisions were made and why actions were taken
This is the opportunity with the right context layers that sit above the systems of record and integrate silo’d sources of data
“That’s the context graph, and that will be the single most valuable asset for companies in the era of AI”
Once you have that context graph in place, you can use it to create self-improving loops that make agents produce better outcomes as they do more work
Vercel cooked something genuinely special here. 🤯
They open-sourced the exact framework they use to run 100+ AI agents internally. And the way it works changes how you think about building agents.
It's called Eve. An agent is a folder. Tools are files. Skills are markdown files. Channels are files. The folder structure IS your agent.
One command to start:
npx eve@latest init my-agent
No plumbing. No boilerplate. Eve handles durable execution, sandboxed compute, human approvals, evals, tracing, and deployment all built in.
Add a tool? Drop a TypeScript file. Add a skill? Drop a markdown file. Add Slack? One command. Add a schedule? One more file. Deploy it? vercel deploy.
How Vercel already runs on Eve:
→ Data analyst agent handles 30K+ questions per month in Slack
→ Sales agent costs $5K/year and returns 32x that
→ Support agent solves 92% of tickets on its own
→ 29% of all Vercel deployments now come from agents
Their bet: Next.js ended the era of hand-rolling websites. Eve ends the era of hand-rolling agents.