Eugene Ostapkovich

5 days ago

Very good advice on self-improving agents. (bookmark it) This is something I am seeing in my own experiments with coding agents and harnesses for long-horizon tasks. What I have found is that stronger models do not always evolve better agents. The current believe in self-evolving agents is that a bigger model writes better prompt and skill edits, so devs put their best model in the evolver seat. New research shows that intuition is mostly wrong. The work separates two abilities that usually get conflated. Producing harness updates stays flat across model capability, so Qwen3.5-9B writes edits roughly as good as Claude Opus 4.6. Benefiting from those updates follows an inverted-U that peaks at mid-tier models, while weak models fail to even activate the edits and strong models have little headroom left. This is important to understand as it tells you where to spend. Put a cheap model on the evolver and your expensive model on the solver, because the gains land solver-side, not evolver-side. Paper: https://t.co/8kJwR7NhmV Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

omarsar0's tweet photo. Very good advice on self-improving agents.

(bookmark it)

This is something I am seeing in my own experiments with coding agents and harnesses for long-horizon tasks.

What I have found is that stronger models do not always evolve better agents.

The current believe in self-evolving agents is that a bigger model writes better prompt and skill edits, so devs put their best model in the evolver seat.

New research shows that intuition is mostly wrong.

The work separates two abilities that usually get conflated. Producing harness updates stays flat across model capability, so Qwen3.5-9B writes edits roughly as good as Claude Opus 4.6. Benefiting from those updates follows an inverted-U that peaks at mid-tier models, while weak models fail to even activate the edits and strong models have little headroom left.

This is important to understand as it tells you where to spend. Put a cheap model on the evolver and your expensive model on the solver, because the gains land solver-side, not evolver-side.

Paper: https://t.co/8kJwR7NhmV

Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

740

109

55K

SPDenver retweeted

Ben Miller

@bensen

4 days ago

OpenClaw runs natively on Windows leveraging MXC "The Windows node and gateway run contained, so your system stays secure. You can easily install and use @OpenClaw in @Windows with its own companion app and set up your own claws or connect to existing ones, available in open-source. We are invested in continuing to make #OpenClaw run securely on Windows." #MSBuild

bensen's tweet photo. OpenClaw runs natively on Windows leveraging MXC
"The Windows node and gateway run contained, so your system stays secure. You can easily install and use @OpenClaw in @Windows with its own companion app and set up your own claws or connect to existing ones, available in open-source. We are invested in continuing to make #OpenClaw run securely on Windows." #MSBuild

SPDenver retweeted

Boris Cherny

@bcherny

16 days ago

In the next version of Claude Code: run /usage to see a breakdown of which Skills, Agents, MCPs, and Plugins are using your tokens CLI today, coming to Desktop next

bcherny's tweet photo. In the next version of Claude Code: run /usage to see a breakdown of which Skills, Agents, MCPs, and Plugins are using your tokens

CLI today, coming to Desktop next https://t.co/HK8XQO6bBA

338

388

282K

Federal CTO @PlanetTech | Author | International Speaker | Football (soccer) player/coach | #SharePoint #Azure #O365 #DevOps

30 days ago

What's New in Aspire 13.3 https://t.co/VGBvMAURZb

Who to follow

Patrick Curran

@PCfromDC

Mark

@delsk

My opinions are 100% my own.

Phil Harding

@phillipharding

Architect, Developer, Consultant, #Office365, #SPFx, #SharePoint, #Teams, #PowerAutomate, #Flow, #Azure

about 1 month ago

This is one way to solve it with auto-memory 👇 https://t.co/ti6TVx5sC4

SPDenver retweeted

about 1 month ago

// Agentic Harness Engineering // Pay attention to this one, AI devs. (bookmark it) Most coding-agent harnesses are still tuned by hand or brittle trial-and-error self-evolution. This new work introduces Agentic Harness Engineering, a framework that makes harness evolution observable. They do this through three layers: components as revertible files, experience as condensed evidence from millions of trajectory tokens, and decisions as falsifiable predictions checked against task outcomes. Each edit becomes a contract you can verify or revert. Results: pass@1 on Terminal-Bench 2 climbs from 69.7% to 77.0% in ten iterations, beating human-designed Codex-CLI (71.9%) and self-evolving baselines like ACE and TF-GRPO. The evolved harness also transfers across model families with +5.1 to +10.1 point gains, while using 12% fewer tokens than the seed on SWE-bench-verified. Harness work is the biggest hidden cost in most agent systems. This is the first credible recipe for letting the harness improve itself without drifting into noise. Paper: https://t.co/9fEgqwlTSf Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

omarsar0's tweet photo. // Agentic Harness Engineering //

Pay attention to this one, AI devs.

(bookmark it)

Most coding-agent harnesses are still tuned by hand or brittle trial-and-error self-evolution.

This new work introduces Agentic Harness Engineering, a framework that makes harness evolution observable. They do this through three layers: components as revertible files, experience as condensed evidence from millions of trajectory tokens, and decisions as falsifiable predictions checked against task outcomes.

Each edit becomes a contract you can verify or revert.

Results: pass@1 on Terminal-Bench 2 climbs from 69.7% to 77.0% in ten iterations, beating human-designed Codex-CLI (71.9%) and self-evolving baselines like ACE and TF-GRPO.

The evolved harness also transfers across model families with +5.1 to +10.1 point gains, while using 12% fewer tokens than the seed on SWE-bench-verified.

Harness work is the biggest hidden cost in most agent systems. This is the first credible recipe for letting the harness improve itself without drifting into noise.

Paper: https://t.co/9fEgqwlTSf

Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

233

140K

about 1 month ago

A2A v1 Is Here: Cross-Platform Agent Communication in Microsoft Agent Framework for .NET https://t.co/dP9VuNgQUd

SPDenver retweeted

.NET @dotnet

about 1 month ago

Microsoft Agent Framework 1.0 is here — stable APIs, multi‑agent orchestration, long‑running workflows, and full C# + Python support. The VS Code Foundry Toolkit adds “Create Agent,” built‑in skills, & Agent Inspector for visual debugging. Catch the demo: https://t.co/OKCvwOiln7

dotnet's tweet photo. Microsoft Agent Framework 1.0 is here — stable APIs, multi‑agent orchestration, long‑running workflows, and full C# + Python support. The VS Code Foundry Toolkit adds “Create Agent,” built‑in skills, & Agent Inspector for visual debugging. Catch the demo: https://t.co/OKCvwOiln7 https://t.co/rfjpoaHifn

163

10K

SPDenver retweeted

about 1 month ago

// Tool Attention Is All You Need // New research proposes a practical fix for the hidden "MCP tax." The work introduces a dynamic tool gating mechanism built on an Intent Schema Overlap score from sentence embeddings, paired with a state-aware gating function that enforces preconditions and access scopes. A two-phase lazy schema loader keeps a compact summary pool in context and only promotes full JSON schemas for the top-k gated tools. On a simulated 120-tool benchmark, tool tokens dropped from 47.3k to 2.4k per turn (95% reduction) while effective context utilization rose from 24% to 91%. Why does it matter? As MCP ecosystems grow, naive tool exposure will silently wreck both cost and reasoning quality. Dynamic tool gating and lazy schema might help your setup. Paper: https://t.co/ak4Koy93Ah Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

omarsar0's tweet photo. // Tool Attention Is All You Need //

New research proposes a practical fix for the hidden "MCP tax."

The work introduces a dynamic tool gating mechanism built on an Intent Schema Overlap score from sentence embeddings, paired with a state-aware gating function that enforces preconditions and access scopes.

A two-phase lazy schema loader keeps a compact summary pool in context and only promotes full JSON schemas for the top-k gated tools.

On a simulated 120-tool benchmark, tool tokens dropped from 47.3k to 2.4k per turn (95% reduction) while effective context utilization rose from 24% to 91%.

Why does it matter?

As MCP ecosystems grow, naive tool exposure will silently wreck both cost and reasoning quality. Dynamic tool gating and lazy schema might help your setup.

Paper: https://t.co/ak4Koy93Ah

Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

218

224

29K

SPDenver retweeted

about 2 months ago

Karpathy's autoresearch repo started an impressive trend. Agents can now train AI models to build SoTA agentic systems. And to think this is just scratching the surface. Ultimately, it boils down to good research questions or hypotheses. LLMs are not great at this (yet).

361

449

77K

SPDenver retweeted

about 2 months ago

NEW paper from Microsoft Every agent benchmark has the same hidden problem: how do you know the agent actually succeeded? Microsoft researchers introduce the Universal Verifier, which discusses lessons learned from building best-in-class verifiers for web tasks. It's built on four principles: non-overlapping rubrics, separate process vs. outcome rewards, distinguishing controllable from uncontrollable failures, and divide-and-conquer context management across full screenshot trajectories. It reduces false positive rates to near zero, down from 45%+ (WebVoyager) and 22%+ (WebJudge). Without reliable verifiers, both benchmarks and training data are corrupted. One interesting finding is that an auto-research agent reached 70% of expert verifier quality in 5% of the time, but couldn't discover the structural design decisions that drove the biggest gains. Human expertise and automated optimization play complementary roles. Paper: https://t.co/fWhG9I8vPP Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

omarsar0's tweet photo. NEW paper from Microsoft

Every agent benchmark has the same hidden problem: how do you know the agent actually succeeded?

Microsoft researchers introduce the Universal Verifier, which discusses lessons learned from building best-in-class verifiers for web tasks.

It's built on four principles: non-overlapping rubrics, separate process vs. outcome rewards, distinguishing controllable from uncontrollable failures, and divide-and-conquer context management across full screenshot trajectories.

It reduces false positive rates to near zero, down from 45%+ (WebVoyager) and 22%+ (WebJudge).

Without reliable verifiers, both benchmarks and training data are corrupted.

One interesting finding is that an auto-research agent reached 70% of expert verifier quality in 5% of the time, but couldn't discover the structural design decisions that drove the biggest gains. Human expertise and automated optimization play complementary roles.

Paper: https://t.co/fWhG9I8vPP

Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

335

344

38K

2 months ago

Microsoft Agent Framework Version 1.0 https://t.co/mfRcVYcv4g

SPDenver retweeted

ollama

@ollama

2 months ago

Visual Studio Code now integrates with Ollama via GitHub Copilot. If you have Ollama installed, any local or cloud model from Ollama can be selected for use within Visual Studio Code.

ollama's tweet photo. Visual Studio Code now integrates with Ollama via GitHub Copilot.

If you have Ollama installed, any local or cloud model from Ollama can be selected for use within Visual Studio Code. https://t.co/11BqI12oSV

117

511

347K

SPDenver retweeted

Anthropic

@AnthropicAI

2 months ago

New on the Anthropic Engineering Blog: How we use a multi-agent harness to push Claude further in frontend design and long-running autonomous software engineering. Read more: https://t.co/HWvmXk1ykn

316

922

SPDenver retweeted

2 months ago

This is one of the most interesting papers on self-improving agents for this year. (bookmark this one) Most self-improving AI systems hit the same wall: the mechanism that generates improvements is fixed and can't improve itself. This new work from Meta and collaborators breaks through this limitation. They introduce Hyperagents, self-referential agents where the self-improvement process itself is editable. The DGM-Hyperagent combines a task agent and a meta agent into a single modifiable program, enabling metacognitive self-modification. It autonomously discovers innovations like persistent memory and performance tracking, and these meta-improvements transfer across domains and compound across runs. Why does it matter? - On paper review, DGM-H improved from 0.0 to 0.710 test accuracy. - On robotics reward design, it went from 0.060 to 0.372. - Transfer hyperagents achieved 0.630 on Olympiad-level math grading in a domain they were never trained on. This is a step toward AI systems that don't just find better solutions but continuously improve how they search for improvements. Paper: https://t.co/Q0f7zWhNMD Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

omarsar0's tweet photo. This is one of the most interesting papers on self-improving agents for this year.

(bookmark this one)

Most self-improving AI systems hit the same wall: the mechanism that generates improvements is fixed and can't improve itself.

This new work from Meta and collaborators breaks through this limitation.

They introduce Hyperagents, self-referential agents where the self-improvement process itself is editable.

The DGM-Hyperagent combines a task agent and a meta agent into a single modifiable program, enabling metacognitive self-modification.

It autonomously discovers innovations like persistent memory and performance tracking, and these meta-improvements transfer across domains and compound across runs.

Why does it matter?

- On paper review, DGM-H improved from 0.0 to 0.710 test accuracy.

- On robotics reward design, it went from 0.060 to 0.372.

- Transfer hyperagents achieved 0.630 on Olympiad-level math grading in a domain they were never trained on.

This is a step toward AI systems that don't just find better solutions but continuously improve how they search for improvements.

Paper: https://t.co/Q0f7zWhNMD

Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

543

113

815

42K

2 months ago

Build a real-world example with Microsoft Agent Framework, Microsoft Foundry, MCP and Aspire https://t.co/YdllGQv5Re

3 months ago

Azure DevOps Remote MCP Server (public preview) https://t.co/uh6puhANHf

SPDenver retweeted

3 months ago

Pay attention to this one if you are building terminal-based coding agents. OpenDev is an 81-page paper covering scaffolding, harness design, context engineering, and hard-won lessons from building CLI coding agents. It introduces a compound AI system architecture with workload-specialized model routing, a dual-agent architecture separating planning from execution, lazy tool discovery, and adaptive context compaction. The industry is shifting from IDE plugins to terminal-native agents. Claude Code, Codex CLI, and others have proven the model works. This paper formalizes the design patterns that make these systems reliable, covering topics like event-driven system reminders to counteract instruction fade-out, automated memory across sessions, and strict safety controls for autonomous operation. Paper: https://t.co/tpAZFaSnog Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

omarsar0's tweet photo. Pay attention to this one if you are building terminal-based coding agents.

OpenDev is an 81-page paper covering scaffolding, harness design, context engineering, and hard-won lessons from building CLI coding agents.

It introduces a compound AI system architecture with workload-specialized model routing, a dual-agent architecture separating planning from execution, lazy tool discovery, and adaptive context compaction.

The industry is shifting from IDE plugins to terminal-native agents.

Claude Code, Codex CLI, and others have proven the model works.

This paper formalizes the design patterns that make these systems reliable, covering topics like event-driven system reminders to counteract instruction fade-out, automated memory across sessions, and strict safety controls for autonomous operation.

Paper: https://t.co/tpAZFaSnog

Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

284

145K