Save money running Harbor rollouts ‼️
Sometimes cost is more important that reliability or reproducibility when running rollouts (e.g. during rapid iteration).
Now in Harbor you can configure resource enforcement policies to save money.
🚨 stop zipping job results 🚨
... upload results to Harbor Hub instead
The hub makes it easy to share results with team members, customers, or simply save for later in a centralized place.
Example of a TB2.1 job in 🧵
the harbor community will be @ CAIS - come say hi!
9am Tue @ RLEval workshop
Harbor & Terminal-Bench 3.0 talk by @alexgshaw / me
10:30am Tue @ RLEval workshop
OpenThoughts-Agent talk by @AlexGDimakis
4pm Tue @ Agent Software Engineering workshop
Harbor Adapters & Harbor Index talk by @LinShi592021
9am Wed: Keynote by @andykonwinski
1/🧵Can AI agents automate U.S. healthcare workflows end to end given just clinician & insurer apps and operations, medical policy library? Introducing CHI-Bench: 75 long-horizon realistic healthcare workflows × 30 frontier agents. Best agent solves only 28% #AIinHealthcare 👇
On Evals - getting messages on “ok so how do I actually start learning this?”
there is no better way than by just doing so you can copy this to Claude Code and get started today
<instructions>
1. Go look up the @harborframework and the Terminal Bench 2.0 dataset. Go look up the Harbor Skills GitHub repo for help. Pick 1 Task in the dataset and explain every single piece that’s in that task folder
2. Explain what my agent sees when it does the task, what it has to output, and how we know if it got the problem right?
3. Now let’s actually run a Task using the built in Claude Code integration, it’s just a flag
4. Once that’s done let’s read the ATIF file that was produced together and help me understand what just happened. Did we pass the task? If not can we dig into why it failed? Go check the verifier logic to see what went wrong.
5. Ok let’s try to improve our agent by adjusting the prompt. And let’s rerun on a few tasks? Is this helping?
6. Ok we’re doing evals! Using this same format, help me make my own. Let’s do this together
…
</instructions>
Spend a few days reading a bunch of traces, actually running evals, understanding traces, internalizing agent failure modes, and being super in the loop of what the agent sees and does
Have fun! Evals are super important, they don’t have to be scary. DM if I can help or just tweet out what you’re doing, someone will help I promise, we’re all learning
ha! yes absolutely
@harborframework is a really powerful way to build and run a suite of evals for agents.
harbor lets you define a dataset of tasks. each task:
- defines the execution env (dockerfile/compose)
- the prompt (instructions.md)
- the verifier (deterministic, LLM-judge/etc)
then run it against a cartesian multiple of:
- agent (off the shelf claude/codex/customized - just impl a simple python class)
- model
- arbitrary args
use -n to repeat enough to get stat-sig, -k to control concurrency (definitely use a cloud sandbox provider like https://t.co/GDU2ByOpGT to run 100s of trials in parallel, FD- i consult for islo)
We built Harbor to evaluate agents.
But why limit ourselves to just agents?
Today we're adding first-class support for evaluating skills, MCPs, prompts, and services.
Ablate your agents.
Separating the agent sandbox and verifier sandbox now supported in harbor!
https://t.co/cd7BLnovZT
Nice writeup below from harbor community member @rishi_desai2 on why this is an important design decision to prevent reward hacking.
Reward hacking is an arms race between coding agents and RL envs.
A common eval flaw: the agent and verifier share the same sandbox.
If the agent can tamper with the grader, “pass” may just mean “cheated.”
We're releasing support for running verification in a separate sandbox. Tasks pre-configure artifacts to move from the agent sandbox into the verifier sandbox for the grading phase, improving the security boundary between agent and verifier.
Blog post below. Happy building!
Great write up by @adithya_s_k about @harborframework .
I want to add some thoughts around coding agents = CUA and Harbor coding envs = computer envs.
One of the reasons we built Terminal-Bench was because we saw that terminals/code were/was a powerful way for language models to control a computer. We’ve always viewed TB as a computer-use benchmark.
Coding agents = CUA means measuring coding agents is essentially the same thing as measuring general purpose agents. This is becoming more obvious with products like Claude Cowork, which is essentially a non-technical interface around Claude Code, and OpenAI’s push to making Codex a more general purpose tool.
We see this on the Harbor side too. Users create coding tasks. But they also create finance, law, accounting, engineering, general computer work, etc. tasks as well. Terminal-Bench 3.0 will cover all of these domains.
The implication is that Harbor becomes a tool for representing and measuring agents’ abilities to perform arbitrary computer work, which right now is the exact scope that users build agents to automate.
In fact, the Harbor Framework (as opposed to the Harbor Format) is just one opinionated way of performing rollouts on Harbor tasks. It works particularly well for agent evals. But there is no reason people can’t/shouldn’t implement other means of performing rollouts on Harbor tasks (e.g. @PrimeIntellect, @GenReasoning, and @tinkerapi all support some variation of a Harbor rollout). We’ll have some releases around this soon.
To summarize, coding agents = CUA, Harbor’s coding environments = computer environments, which means the scope of Harbor is probably broader than you think (as our users will attest!)
As agents get more clever, so do their attempts at benchmark hacking.
Last Monday, we found one of our RL runs jumped ~20% on SWE-Bench-Pro over a weekend, reaching ~64% which would make it #1 on the leaderboard.
This was clearly benchmark hacking and we patched the exploit.
But this revealed deeper hacks across multiple public benchmarks, some of which were impossible to fix through environment design alone.
Evals need to evolve beyond just outcome based pass rates to better observability into how the agent is arriving at them.
These were our findings:
https://t.co/ncyf4liW7C
Examples below 👇
1/