The team's been deep on the different shapes a browsing agent can take.
Agents fail in three ways on long runs: they quit early, grade their own work on a curve, and forget the rules after a compaction.
We'd been fighting them in a browser for months.
Three things we'd add from here ๐
Today we are launching Steel Skills.
Five agent skills for the web. Install one or the whole set. Runs in Claude Code, Cursor, Codex, opencode, Pi, or any compatible agent.
Today we are launching Steel Skills.
Five agent skills for the web. Install one or the whole set. Runs in Claude Code, Cursor, Codex, opencode, Pi, or any compatible agent.
A safety net for the runs where your code or agent never gets the chance to clean up.
Read more about it in this writeup by @junhssss
https://t.co/Pc0S2YLHzX
Stop paying for a browser after the work is done or your agent dies.
Steel sessions now take an inactivity timeout: if the agent stops driving the browser for the window you set, Steel releases it and the meter stops.
Read more โ
The honest fix was always "release the session yourself." And you should.
Now you can just set the timeout, and Steel handles the cleanup for you.
Just set the window longer than the longest gap you expect between commands.
Most of what you ship to an agent gets compressed before it acts โ docs, SDKs, blog posts, distilled by the model first.
Errors are the exception. They reach the agent intact.
We've been rebuilding ours around that. โ
The team's been deep on the different shapes a browsing agent can take.
Agents fail in three ways on long runs: they quit early, grade their own work on a curve, and forget the rules after a compaction.
We'd been fighting them in a browser for months.
Three things we'd add from here ๐
@0xbosta How we think about talking to an agent now โ three channels:
โฆ hard error: it failed, here's the fix
โฆ soft warning: it worked, but you'll regret it
โฆ agent notes: ambient per-site context
Full writeup by @0xbosta โ
https://t.co/DilGBvMrEt
Most of what you ship to an agent gets compressed before it acts โ docs, SDKs, blog posts, distilled by the model first.
Errors are the exception. They reach the agent intact.
We've been rebuilding ours around that. โ
What's new @ Steel - Changelog #027
โฆ New Agent Traces docs: overview, timeline + exports, and the API
โฆ Leaderboard refresh โ filterable /results index and cleaner benchmark pages
โฆ More benchmark entries, plus tooling to find new results worth listing
โฆ Plus docs-delivery fixes and browser-seconds metering under the hood
Link below โ
Latest: @AnthropicAI Claude Opus 4.8 is sitting #1 on OSWorld @XLangNLP at 83.4%, above the human baseline. Added the day it shipped.
https://t.co/41hgmZzPuW
Browser-agent benchmarks are getting crowded, stale, and hard to compare across.
We are collecting the benchmarks that actually matter for browser and computer-use agents, so you don't have to chase them down.
(Just rebuilt the whole leaderboard. Live now.)