Harbor Framework @harborframework - Twitter Profile

7 days ago

Save money running Harbor rollouts ‼️ Sometimes cost is more important that reliability or reproducibility when running rollouts (e.g. during rapid iteration). Now in Harbor you can configure resource enforcement policies to save money.

harborframework's tweet photo. Save money running Harbor rollouts ‼️

Sometimes cost is more important that reliability or reproducibility when running rollouts (e.g. during rapid iteration).

Now in Harbor you can configure resource enforcement policies to save money. https://t.co/HtqK9PCRk2

0

7

0

1K

Harbor Framework

@harborframework

8 days ago

https://t.co/cJAoKpB4jd

0

233

Harbor Framework

@harborframework

8 days ago

🚨 stop zipping job results 🚨 ... upload results to Harbor Hub instead The hub makes it easy to share results with team members, customers, or simply save for later in a centralized place. Example of a TB2.1 job in 🧵

harborframework's tweet photo. 🚨 stop zipping job results 🚨

... upload results to Harbor Hub instead

The hub makes it easy to share results with team members, customers, or simply save for later in a centralized place.

Example of a TB2.1 job in 🧵 https://t.co/hvaAd5zIai

1

18

2

3

4K

Harbor Framework

@harborframework

8 days ago

https://t.co/zue55Xqfar

1

0

563

Harbor Framework

@harborframework

9 days ago

come hang out at CAIS!

Ryan Marten

@ryanmart3n

9 days ago

the harbor community will be @ CAIS - come say hi! 9am Tue @ RLEval workshop Harbor & Terminal-Bench 3.0 talk by @alexgshaw / me 10:30am Tue @ RLEval workshop OpenThoughts-Agent talk by @AlexGDimakis 4pm Tue @ Agent Software Engineering workshop Harbor Adapters & Harbor Index talk by @LinShi592021 9am Wed: Keynote by @andykonwinski

3

22

6

2

3K

0

10

2

1

1K

Harbor Framework

@harborframework

12 days ago

https://t.co/Gqa7aevGNB

0

1

0

156

Harbor Framework

@harborframework

13 days ago

healthcare benchmark, built on harbor!

Weiran Yao

@iscreamnearby

15 days ago

1/🧵Can AI agents automate U.S. healthcare workflows end to end given just clinician & insurer apps and operations, medical policy library? Introducing CHI-Bench: 75 long-horizon realistic healthcare workflows × 30 frontier agents. Best agent solves only 28% #AIinHealthcare 👇

iscreamnearby's tweet photo. 1/🧵Can AI agents automate U.S. healthcare workflows end to end given just clinician & insurer apps and operations, medical policy library? Introducing CHI-Bench: 75 long-horizon realistic healthcare workflows × 30 frontier agents. Best agent solves only 28% #AIinHealthcare 👇 https://t.co/YoEtfHlVbu

12

42

24

63K

2

14

2

2K

harborframework retweeted

Viv

@Vtrivedy10

13 days ago

On Evals - getting messages on “ok so how do I actually start learning this?” there is no better way than by just doing so you can copy this to Claude Code and get started today <instructions> 1. Go look up the @harborframework and the Terminal Bench 2.0 dataset. Go look up the Harbor Skills GitHub repo for help. Pick 1 Task in the dataset and explain every single piece that’s in that task folder 2. Explain what my agent sees when it does the task, what it has to output, and how we know if it got the problem right? 3. Now let’s actually run a Task using the built in Claude Code integration, it’s just a flag 4. Once that’s done let’s read the ATIF file that was produced together and help me understand what just happened. Did we pass the task? If not can we dig into why it failed? Go check the verifier logic to see what went wrong. 5. Ok let’s try to improve our agent by adjusting the prompt. And let’s rerun on a few tasks? Is this helping? 6. Ok we’re doing evals! Using this same format, help me make my own. Let’s do this together … </instructions> Spend a few days reading a bunch of traces, actually running evals, understanding traces, internalizing agent failure modes, and being super in the loop of what the agent sees and does Have fun! Evals are super important, they don’t have to be scary. DM if I can help or just tweet out what you’re doing, someone will help I promise, we’re all learning

23

335

29

605

21K

harborframework retweeted

Hao Wang

@MogicianTony

15 days ago

This eliminates largely the reward hacks we found using BenchJack and make benchmarks much more reliable. Great work!

0

4

1

0

861

harborframework retweeted

Rotem Tamir

@rotemtam

16 days ago

ha! yes absolutely @harborframework is a really powerful way to build and run a suite of evals for agents. harbor lets you define a dataset of tasks. each task: - defines the execution env (dockerfile/compose) - the prompt (instructions.md) - the verifier (deterministic, LLM-judge/etc) then run it against a cartesian multiple of: - agent (off the shelf claude/codex/customized - just impl a simple python class) - model - arbitrary args use -n to repeat enough to get stat-sig, -k to control concurrency (definitely use a cloud sandbox provider like https://t.co/GDU2ByOpGT to run 100s of trials in parallel, FD- i consult for islo)

1

2

1

0

377

harborframework retweeted

Alex Shaw

@alexgshaw

16 days ago

FrontierCS now in @harborframework

1

32

4

13

3K

Harbor Framework

@harborframework

16 days ago

We built Harbor to evaluate agents. But why limit ourselves to just agents? Today we're adding first-class support for evaluating skills, MCPs, prompts, and services. Ablate your agents.

harborframework's tweet photo. We built Harbor to evaluate agents.

But why limit ourselves to just agents?

Today we're adding first-class support for evaluating skills, MCPs, prompts, and services.

Ablate your agents. https://t.co/xnwI9N2DU9

0

42

2

29

6K

Harbor Framework

@harborframework

17 days ago

Separating the agent sandbox and verifier sandbox now supported in harbor! https://t.co/cd7BLnovZT Nice writeup below from harbor community member @rishi_desai2 on why this is an important design decision to prevent reward hacking.

Rishi Desai

@rishi_desai2

17 days ago

Reward hacking is an arms race between coding agents and RL envs. A common eval flaw: the agent and verifier share the same sandbox. If the agent can tamper with the grader, “pass” may just mean “cheated.”

rishi_desai2's tweet photo. Reward hacking is an arms race between coding agents and RL envs.

A common eval flaw: the agent and verifier share the same sandbox.

If the agent can tamper with the grader, “pass” may just mean “cheated.” https://t.co/KAMAtmFIz2

5

40

7

27

17K

0

16

1

14

2K

harborframework retweeted

Alex Shaw

@alexgshaw

17 days ago

Evaluate biomedical agents using @harborframework . Congrats to the @phylo_bio team on a great benchmark!

0

24

4

5

3K

Harbor Framework

@harborframework

20 days ago

https://t.co/Ie5td5LnYl

0

1

0

310

Harbor Framework

@harborframework

20 days ago

We're releasing support for running verification in a separate sandbox. Tasks pre-configure artifacts to move from the agent sandbox into the verifier sandbox for the grading phase, improving the security boundary between agent and verifier. Blog post below. Happy building!

2

33

2

9

3K

harborframework retweeted

Alex Shaw

@alexgshaw

21 days ago

Great write up by @adithya_s_k about @harborframework . I want to add some thoughts around coding agents = CUA and Harbor coding envs = computer envs. One of the reasons we built Terminal-Bench was because we saw that terminals/code were/was a powerful way for language models to control a computer. We’ve always viewed TB as a computer-use benchmark. Coding agents = CUA means measuring coding agents is essentially the same thing as measuring general purpose agents. This is becoming more obvious with products like Claude Cowork, which is essentially a non-technical interface around Claude Code, and OpenAI’s push to making Codex a more general purpose tool. We see this on the Harbor side too. Users create coding tasks. But they also create finance, law, accounting, engineering, general computer work, etc. tasks as well. Terminal-Bench 3.0 will cover all of these domains. The implication is that Harbor becomes a tool for representing and measuring agents’ abilities to perform arbitrary computer work, which right now is the exact scope that users build agents to automate. In fact, the Harbor Framework (as opposed to the Harbor Format) is just one opinionated way of performing rollouts on Harbor tasks. It works particularly well for agent evals. But there is no reason people can’t/shouldn’t implement other means of performing rollouts on Harbor tasks (e.g. @PrimeIntellect, @GenReasoning, and @tinkerapi all support some variation of a Harbor rollout). We’ll have some releases around this soon. To summarize, coding agents = CUA, Harbor’s coding environments = computer environments, which means the scope of Harbor is probably broader than you think (as our users will attest!)

3

110

8

108

13K

harborframework retweeted

Poolside

@poolsideai

24 days ago

As agents get more clever, so do their attempts at benchmark hacking. Last Monday, we found one of our RL runs jumped ~20% on SWE-Bench-Pro over a weekend, reaching ~64% which would make it #1 on the leaderboard. This was clearly benchmark hacking and we patched the exploit. But this revealed deeper hacks across multiple public benchmarks, some of which were impossible to fix through environment design alone. Evals need to evolve beyond just outcome based pass rates to better observability into how the agent is arriving at them. These were our findings: https://t.co/ncyf4liW7C Examples below 👇 1/