Eric Wilson

This "loop" automation is nuts inside of Codex. "/goal go over every single feature in this app create a user story with expected behaviour based on the code keep a single canonical spreadsheet tracking the features status - when done switch loop to testing every user story and documenting all errors - when done fix every logistical error or ux error - test every user behaviour again post fix" Shoutout to @MatthewBerman for the heads up. Hundreds of user stories being worked through like it's nothing.

tomosman's tweet photo. This "loop" automation is nuts inside of Codex.

"/goal go over every single feature in this app create a user story with expected behaviour based on the code keep a single canonical spreadsheet tracking the features status
- when done switch loop to testing every user story and documenting all errors
- when done fix every logistical error or ux error
- test every user behaviour again post fix"

Shoutout to @MatthewBerman for the heads up.

Hundreds of user stories being worked through like it's nothing.

141

459K

CalaspellEd retweeted

Hugo Baraúna

@hugobarauna

3 days ago

https://t.co/bftaI6Levm

428

951

43K

CalaspellEd retweeted

Viv

@Vtrivedy10

2 days ago

very timely and as always super clear content from the HF team as companies start to consider fine-tuning models for cost/perf reasons SFT is the first thing you’ll come across “Imitation: Make this look like that” if you have a working agent system, are producing “good” traces, and want to use a much cheaper model then SFT can help you transfer knowledge from those good traces into another model —> ideally saving you a lot of time/money!

Who to follow

Kalyan Kumar Pichuka

@PichukaKumar

Senior Data Scientist. Passionate about solving complex problems. I am one who thinks Maths is fun

rancio

@r_rancio

AI follower! geology and geophysics data scientist!

David Utt

@utt_david

Practitioner of the Dark of Arts of Design. Former @RITtigers @mfadt. Currently - Senior UX Designer @wolterskluwer. Hug more often

CalaspellEd retweeted

Jamin Ball

@jaminball

2 days ago

https://t.co/zMyDgMMdOc

166

416

43K

CalaspellEd retweeted

Data Science Dojo

@DataScienceDojo

3 days ago

🤖 A coding agent is not like a regular agent. A regular agent just reasons and responds. A coding agent has to navigate real codebases, run shell commands, track file changes across a session, and loop back when something breaks. That's a different job, and it needs a different system underneath it. and that system is the harness: everything built around the model so it can actually do the job, not just talk about it. The model alone doesn't know your repo, your tools, or what happened three turns ago. The harness is what does. It has 6 components, and each one fixes a different way agents fail. Swipe through to see what they are. #AIAgents #CodingAgents #DeveloperTools #AgenticAI #SoftwareEngineering #ClaudeCode

DataScienceDojo's tweet photo. 🤖 A coding agent is not like a regular agent.

A regular agent just reasons and responds. A coding agent has to navigate real codebases, run shell commands, track file changes across a session, and loop back when something breaks. That's a different job, and it needs a different system underneath it.

and that system is the harness: everything built around the model so it can actually do the job, not just talk about it. The model alone doesn't know your repo, your tools, or what happened three turns ago. The harness is what does.

It has 6 components, and each one fixes a different way agents fail. Swipe through to see what they are.

#AIAgents #CodingAgents #DeveloperTools #AgenticAI #SoftwareEngineering #ClaudeCode

CalaspellEd retweeted

Dan Farrelly | Inngest.com

@djfarrelly

3 days ago

https://t.co/hewvRXczwq

799

104

559K

CalaspellEd retweeted

Cobus Greyling

@CobusGreylingZA

3 days ago

First we had prompt engineering, then context engineering, followed by harness engineering. Then loop engineering, now we have fleet engineering...

CalaspellEd retweeted

Matthew Berman

@MatthewBerman

3 days ago

Just launched Loop Library - a curated list of agent loops you can use right now. Find loops, submit your own, tokenmaxx!! https://t.co/7bVzOyZMrt

117

350

830K

Eric Wilson @CalaspellEd

3 days ago

@JebraFaushay more masturbatory nonsense from gen x... shut up

CalaspellEd retweeted

Movez

@0xMovez

4 days ago

Anthropic Product Lead: "At Anthropic, our engineers are running swarms of 300+ agents daily. Give your agents 100+ tools - just don’t load them all into context." In a 30-minute talk, the Anthropic team shows how to deploy agents to production. Claude + loops + routines + dynamic workflows - that’s the secret. Watch the talk, then save the playbook below.

133

239K

CalaspellEd retweeted

Vaishnavi

@_vmlops

3 days ago

SOMEONE ON REDDIT JUST DROPPED THE CLEANEST AI AGENT BREAKDOWN I'VE SEEN. 7 steps, no fluff: ▫️ set a measurable goal before touching any model ▫️ match the model to the task LLM for general, LRM for complex reasoning, SLM for routing ▫️ pick a framework: LangChain, CrewAI, n8n, or Google ADK ▫️ add memory or your agent resets every single session ▫️ connect tools via MCP or function calling ▫️ compress context before costs spiral ▫️ test edge cases, not just the happy path the part nobody mentions: steps 4 and 7 are where most agents die in production memory makes it intelligent... testing makes it production-ready... Skip both and your agent works in the demo, breaks in real life Agentic AI is a $9B market growing 46% annually the engineers who actually know how to build these won't be available at current rates for long

_vmlops's tweet photo. SOMEONE ON REDDIT JUST DROPPED THE CLEANEST AI AGENT BREAKDOWN I'VE SEEN.

7 steps, no fluff:

▫️ set a measurable goal before touching any model
▫️ match the model to the task LLM for general, LRM for complex reasoning, SLM for routing
▫️ pick a framework: LangChain, CrewAI, n8n, or Google ADK
▫️ add memory or your agent resets every single session
▫️ connect tools via MCP or function calling
▫️ compress context before costs spiral
▫️ test edge cases, not just the happy path

the part nobody mentions: steps 4 and 7 are where most agents die in production

memory makes it intelligent...
testing makes it production-ready... Skip both and your agent works in the demo, breaks in real life

Agentic AI is a $9B market growing 46% annually

the engineers who actually know how to build these won't be available at current rates for long

CalaspellEd retweeted

Victoria Slocum

@victorialslocum

3 days ago

Vector embeddings do more than just search. 𝗛𝗲𝗿𝗲'𝘀 𝗵𝗼𝘄 𝗴𝗿𝗮𝗽𝗵-𝗯𝗮𝘀𝗲𝗱 𝗰𝗹𝘂𝘀𝘁𝗲𝗿𝗶𝗻𝗴 + 𝗵𝘆𝗯𝗿𝗶𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 𝘄𝗼𝗿𝗸 𝘁𝗼𝗴𝗲𝘁𝗵𝗲𝗿 𝘁𝗼 𝗰𝗹𝘂𝘀𝘁𝗲𝗿 𝗹𝗶𝘃𝗲 𝗻𝗲𝘄𝘀 𝗱𝗮𝘁𝗮. The demo of the week this week is 𝗪𝗲𝗮𝘃𝗶𝗲𝘄 𝗖𝗵𝗿𝗼𝗻𝗶𝗰𝗹𝗲𝘀, a newspaper-themed UI that fetches live news headlines, embeds them in real time, and automatically clusters them into stories. Here's what's happening under the hood: Live news articles are fetched, embedded, and stored in @weaviate_io. Then a 𝗵𝘆𝗯𝗿𝗶𝗱 𝘀𝗶𝗺𝗶𝗹𝗮𝗿𝗶𝘁𝘆 function scores how closely each pair of articles relates — combining dense vector (semantic) similarity with character n-gram similarity. The n-gram side catches surface-level matches: shared entity names, proper nouns, specific phrases that pure semantic search might blur together. Those similarity scores form a weighted graph. Then the 𝗟𝗲𝗶𝗱𝗲𝗻 𝗮𝗹𝗴𝗼𝗿𝗶𝘁𝗵𝗺 runs community detection over it to assign cluster memberships. Each cluster then goes through a few more steps: • The most central article (closest in meaning to all others) is selected as the cluster's representative • An AI model reads the top five headlines and generates a short topic summary • Clusters expire after 8 hours without a new article, keeping the index fresh And then there's the search bar with a tunable slider between three modes: 1️⃣ Keyword (BM25) 2️⃣ Balanced hybrid 3️⃣ Pure semantic It's a really clean way to see how the same query returns different results depending on the search mode you chose. Try it here: https://t.co/cXfFP4OVuK There’s also a copy prompt so you can easily start building yourself!

victorialslocum's tweet photo. Vector embeddings do more than just search. 𝗛𝗲𝗿𝗲'𝘀 𝗵𝗼𝘄 𝗴𝗿𝗮𝗽𝗵-𝗯𝗮𝘀𝗲𝗱 𝗰𝗹𝘂𝘀𝘁𝗲𝗿𝗶𝗻𝗴 + 𝗵𝘆𝗯𝗿𝗶𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 𝘄𝗼𝗿𝗸 𝘁𝗼𝗴𝗲𝘁𝗵𝗲𝗿 𝘁𝗼 𝗰𝗹𝘂𝘀𝘁𝗲𝗿 𝗹𝗶𝘃𝗲 𝗻𝗲𝘄𝘀 𝗱𝗮𝘁𝗮.

The demo of the week this week is 𝗪𝗲𝗮𝘃𝗶𝗲𝘄 𝗖𝗵𝗿𝗼𝗻𝗶𝗰𝗹𝗲𝘀, a newspaper-themed UI that fetches live news headlines, embeds them in real time, and automatically clusters them into stories.

Here's what's happening under the hood:

Live news articles are fetched, embedded, and stored in @weaviate_io. Then a 𝗵𝘆𝗯𝗿𝗶𝗱 𝘀𝗶𝗺𝗶𝗹𝗮𝗿𝗶𝘁𝘆 function scores how closely each pair of articles relates — combining dense vector (semantic) similarity with character n-gram similarity. The n-gram side catches surface-level matches: shared entity names, proper nouns, specific phrases that pure semantic search might blur together.

Those similarity scores form a weighted graph. Then the 𝗟𝗲𝗶𝗱𝗲𝗻 𝗮𝗹𝗴𝗼𝗿𝗶𝘁𝗵𝗺 runs community detection over it to assign cluster memberships.

Each cluster then goes through a few more steps:
• The most central article (closest in meaning to all others) is selected as the cluster's representative
• An AI model reads the top five headlines and generates a short topic summary
• Clusters expire after 8 hours without a new article, keeping the index fresh

And then there's the search bar with a tunable slider between three modes:
1️⃣ Keyword (BM25)
2️⃣ Balanced hybrid
3️⃣ Pure semantic

It's a really clean way to see how the same query returns different results depending on the search mode you chose.

Try it here: https://t.co/cXfFP4OVuK

There’s also a copy prompt so you can easily start building yourself!

CalaspellEd retweeted

Muratcan Koylan

@koylanai

4 days ago

Memento-Skills is a self-evolving skill harness. It's a great paper to understand loop-engineered agent systems. The editable surface is the skill library. The feedback signal is judge-scored task success. Learning happens through external artifact mutation. Observe -> Read Skill -> Execute -> Judge/Feedback -> Write Skill Update The router selects a skill for the current task. A query about “extracting patient medication history” and a query about “summarizing patient history” may look semantically close, but the correct execution behavior is different. Their proposed router is trained for behavioral similarity which I find very useful: - generate synthetic user goals for each skill - generate hard negatives that share terminology but should not use that skill - train with multi-positive InfoNCE (contrastive training objective for retrieval) interpret the embedding score as a soft Q-value - route to the skill most likely to produce successful execution So the router is asking: “Which skill historically leads to the right outcome for this kind of task?” The repo also has a gateway layer between the agent and skill execution. The execution engine is a ReAct-style loop: build messages, expose allowed tools, let the LLM act, process tool results, update state, and stop on success/failure/loop conditions. That makes execution observable. The system can tell whether a skill produced artifacts, repeated itself, got blocked by policy, timed out, or claimed success without evidence. They also have Reflector / Judge. In benchmark mode, the judge knows the gold answer. In production, this could be an eval rubric, human review, unit test, or downstream signal. The Skill Evolution Engine then has three update modes: - Utility update: track empirical success rate per skill. - Optimize existing skill: patch prompts, code, or guardrails for a specific failure mode. - Discover new skill: create a new skill when repeated failures show the current abstraction is wrong. The skill is effectively policy. So the model stays fixed, but the action distribution changes because the retrieved skill changes. Turning skills from static context files into versioned, evaluated, self-improving policy artifacts is still hard. But I expect coding platforms to integrate this into user-facing systems soon. As frontier model access gets constrained by pricing, policy, rate limits, and availability shocks, more performance will have to come from the substrate around the model.

koylanai's tweet photo. Memento-Skills is a self-evolving skill harness. It's a great paper to understand loop-engineered agent systems.

The editable surface is the skill library.
The feedback signal is judge-scored task success.
Learning happens through external artifact mutation.

Observe -> Read Skill -> Execute -> Judge/Feedback -> Write Skill Update

The router selects a skill for the current task. A query about “extracting patient medication history” and a query about “summarizing patient history” may look semantically close, but the correct execution behavior is different.

Their proposed router is trained for behavioral similarity which I find very useful:
- generate synthetic user goals for each skill
- generate hard negatives that share terminology but should not use that skill
- train with multi-positive InfoNCE (contrastive training objective for retrieval)
interpret the embedding score as a soft Q-value
- route to the skill most likely to produce successful execution

So the router is asking: “Which skill historically leads to the right outcome for this kind of task?”

The repo also has a gateway layer between the agent and skill execution. The execution engine is a ReAct-style loop: build messages, expose allowed tools, let the LLM act, process tool results, update state, and stop on success/failure/loop conditions.

That makes execution observable. The system can tell whether a skill produced artifacts, repeated itself, got blocked by policy, timed out, or claimed success without evidence.

They also have Reflector / Judge. In benchmark mode, the judge knows the gold answer. In production, this could be an eval rubric, human review, unit test, or downstream signal.

The Skill Evolution Engine then has three update modes:
- Utility update: track empirical success rate per skill.
- Optimize existing skill: patch prompts, code, or guardrails for a specific failure mode.
- Discover new skill: create a new skill when repeated failures show the current abstraction is wrong.

The skill is effectively policy. So the model stays fixed, but the action distribution changes because the retrieved skill changes.

Turning skills from static context files into versioned, evaluated, self-improving policy artifacts is still hard. But I expect coding platforms to integrate this into user-facing systems soon.

As frontier model access gets constrained by pricing, policy, rate limits, and availability shocks, more performance will have to come from the substrate around the model.

CalaspellEd retweeted

Shubham Saboo

@Saboo_Shubham_

4 days ago

MUST READ. Google's new guide on building and evaluating Agent Skills. It also covers meta-skills and self-improving Agent skills. 100% free.

Saboo_Shubham_'s tweet photo. MUST READ.

Google's new guide on building and evaluating Agent Skills.

It also covers meta-skills and self-improving Agent skills.

100% free. https://t.co/RjIAmZrP0d

396

486

21K

CalaspellEd retweeted

LangChain

@LangChain

3 days ago

Building a useful agent is getting easier. Running them in production is still hard. We built Managed Deep Agents so your team can focus on agent behavior instead of rebuilding the runtime around it. https://t.co/wiQVO5luru

CalaspellEd retweeted

Shann³

@shannholmberg

4 days ago

how to add an LLM council into your /goal workflow /goal is the easiest way into agent looping. it works best when you already know what you want and have enough context to start, the agent fills in the rest as it goes what I added recently is a review between models inside the run. one model hosts the loop, another sits as the reviewer and checks the work before it moves on today I ran an AI workflow mapping for a client. GPT 5.5 xhigh hosted the loop on codex CLI, and I put a review gate in two places, first on the plan, then on every delivery. claude opus 4.8 high did the reviewing (bring back fable pls) and fed notes back to the main agent. the output came back noticeably sharper than a single-model run under the hood its just the codex model calling the claude CLI. nothing fancy. you write a skill that goes both ways, one that lets codex call claude and one that lets claude call codex, and you have a council the value is simple, a second model catches what the one driving the loop talks itself into. easy to set up, and it improves the output every time

shannholmberg's tweet photo. how to add an LLM council into your /goal workflow

/goal is the easiest way into agent looping. it works best when you already know what you want and have enough context to start, the agent fills in the rest as it goes

what I added recently is a review between models inside the run. one model hosts the loop, another sits as the reviewer and checks the work before it moves on

today I ran an AI workflow mapping for a client. GPT 5.5 xhigh hosted the loop on codex CLI, and I put a review gate in two places, first on the plan, then on every delivery. claude opus 4.8 high did the reviewing (bring back fable pls) and fed notes back to the main agent. the output came back noticeably sharper than a single-model run

under the hood its just the codex model calling the claude CLI. nothing fancy. you write a skill that goes both ways, one that lets codex call claude and one that lets claude call codex, and you have a council

the value is simple, a second model catches what the one driving the loop talks itself into. easy to set up, and it improves the output every time

307

432

29K

CalaspellEd retweeted

Peter Wang

@BrainsAndTennis

4 days ago

https://t.co/DVyPORQgPe

186

416

45K

CalaspellEd retweeted

Peter Yang

@petergyang

4 days ago

One of the best designers I’ve ever worked with is now a principal engineer. He wants to stay anonymous, but he now designs and builds 95%+ in coding harnesses and the terminal. His workflow is basically: → Get AI to create a design md first → Ask AI to generate the components → Give AI feedback until it feels right (taste!) He thinks doing this is now basically a core skill for designers (and frankly any builder). Not saying you shouldn't design in Figma, but I agree with him that it's important to learn the above too.

577

646

78K

CalaspellEd retweeted

Gabe Pereyra

@gabepereyra

4 days ago

Model strategy for @harvey: We are working on the first model in our legal foundation model series, inspired by @cursor_ai's Composer. Two goals: 1. Allow us to serve frontier intelligence across our product surface areas at an affordable price and a strong security posture. 2. Create the foundations for law firms to build their own specialized models and own their own intelligence. The model series will focus on complex client matters that span months and take dozens of associates. The agentic system will learn to control legal tech tools, sub agents and ask for help from frontier models or human partners, much like a senior associate. We’ve open sourced benchmarks for evaluating our initial post training work that represents work done by associates and in-house lawyers. We are scaling these significantly using synthetic and human pipelines as well as building private evals for firms. Open sourcing this data has allowed us to quickly validate the feasibility of post training open weight models for legal work. With our research partners we’ve already shown promising results post training open source models to approach frontier performance: 1. @baseten - novel compaction strategies for analyzing large data rooms. 2. @FireworksAI_HQ - matching frontier performance by using frontier as an advisor. 3. @appliedcompute - improving performance and reducing cost of large scale review tables. 4. @trajectorylabs & @nvidia - sovereign continual learning over client matters. We plan to continue to invest heavily in working with research partners and open sourcing our data, models and research as much as possible. We believe open research in legal will be important to building trust in the frontier ecosystem. We are also scaling our research team. Harvey Labs is our internal research group, responsible for pushing the frontier of legal intelligence and working closely with labs, research partners, and academia to bring the frontier of agent research into Harvey. Labs is run by @nikogrupen and @ItsJulioPereyra - Niko worked on multi-agent RL at Google Brain and Julio clerked and worked in BigLaw. We believe this pairing is crucial for building frontier legal AI systems. Together they have already made significant progress in scaling our data and training efforts. The long term goal of Harvey Labs is to contribute to the research and infrastructure required for the legal industry to create a frontier ecosystem. We believe that the best version of legal super intelligence is one where each law firm, enterprise and government owns their own specialized version. We are hiring for Harvey Labs across the post training, agent and data stack and open to acquiring talented teams / neolabs in this space. If interested please DM me.

865

879

206K

Eric Wilson

@CalaspellEd

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users