Mandeep

@themandeepc

Building network-based computer use agents @ Systems engineer and PL enthusiast

London

Joined December 2016

485 Following

117 Followers

179 Posts

Pinned Tweet

Mandeep

@themandeepc

3 months ago

I think folks are being misled by "high performance" on browser use "benchmarks". It's not appreciated enough just how different they are to LLM benchmarks, and why they're difficult to do right and currently extremely flawed. LLM benchmarks are "closed world": the model generates text, and you verify it against some fixed ground truth that doesn't change. Even 'hard' benchmarks like Humanity's Last Exam fit this pattern. The benchmark dataset fully defines the expected inputs, outputs, and validation function. Browser use benchmarks, however, are fundamentally different because they're not closed world. "Actions" - things that change state on a website - are especially difficult. You can't go around willy nilly and mutate state on Twitter, Salesforce, etc, every time you run the evals. That especially applies to the websites we care about: internal enterprise software being the most obvious category. Even data retrieval can be difficult: websites and data change. Restaurant availability changes every hour, flight availability/prices change even faster. It's _slightly_ easier than actions since you can cache the HTML and make it closed world, as some benchmarks do, but this doesn't work for actions, and ages badly. Other benchmarks get around this by trying to fix the date of a check ("find me flights on 1 March 2024"). Ofc that trick doesn't work for most tasks (like that flights example - you can't view historical flight availability). Then there's CAPTCHAs, which exist on basically every high-value web task (even if hidden). Current benchmarks exclude all these 'inconvenient' tasks, which massively skews them to be totally unrepresentative of how humans use websites. Pure computer use have it easier because they're often closed world: the start and desired end state can be well-defined and evaluated inside a network-less container. Updating an Excel sheet has no harm (which tbf represents a lot of economic work). But once you're doing things in a browser, on websites over the internet, this nice property doesn't apply anymore. WebArena's answer to this conundrum was to create 'fake' websites that were supposed to be representative of real ones. The problem is, they're not. OSWorld makes it kinda closed world by providing cached versions of HTML, but this only really works for data retrieval. They're also very unrepresentative. WebVoyager is especially egregious: just 15 (!!) websites are represented, and the tasks are ridiculously easy. Take a look yourself: https://t.co/vdMyAYawfy So, how does this translate to the claims made by browser startups? Well, WebVoyager (the extremely easy one) is the benchmark the avg browser startup reports 85%+ accuracy on. Claude's performance is reported for computer use, and against OSWorld which is dominated by closed-world tasks. So really, high reported accuracies should be taken with a huge grain of salt, and there's still a long way to go before computer use is solved. That said, there's at least one other team thinking about these problems (@yutori_ai, with their release of Navi-Bench). From first principles, this is a really tricky problem to solve. The infra and data to properly benchmark web agent performance is extremely nascent and underdeveloped. It's a problem we think a lot about at Indices -- please reach out (DM) if you do too!

Mandeep

@themandeepc

about 19 hours ago

Building AI systems without evals is like building software without tests. Vibes-based evals will probs go down the path of vibes-based software engineering. You’re consumed in fear that any change will break your system in unexpected ways. It’s hard, but design good evals.

Mandeep

@themandeepc

1 day ago

@zeeg @chaliy the other big benefit of code mode (or cli or related) is composition and piping, right? without that, the LLM can be forced to write or repeat large blobs as args to tools.

Mandeep

@themandeepc

2 days ago

"Included in your plan until Jun 22, then switch to usage credits to continue" I guess the era of subsidised (& reasonably priced) coding agents is coming to an end.

Mandeep

@themandeepc

2 days ago

@skull8888888888 @jeffzwang this would be fun!

Mandeep

@themandeepc

3 days ago

@zeeg obviously not your point, but 'sorted set' feels like a misnomer, and is unfortunate naming from redis

520

Mandeep

@themandeepc

4 days ago

Saw a guy on the tube arguing with (and being gaslit by) Copilot. Incredible how many normal peoples' experience of AI is garbage "Enterprise AI" provided tools that genuinely suck. Huge difference between the frontier and what most ppl's experience of AI is.

Mandeep

@themandeepc

5 days ago

@typecraft_dev would recommend Alex Petrov's Database Internals too.

Mandeep

@themandeepc

7 days ago

reviving the "member of technical staff" title might be one of the best things the labs did. better title for cultures where SWEs work on a variety of technical projects, and less weird politics competing for "senior staff software engineer level II" I hope it sticks

Mandeep

@themandeepc

10 days ago

@perplexity_ai @PPLXDevs isn't this just exposing lower-level APIs rather than a high-level, rigid search API? and then relying on agents to compose them as appropriate for a given task? I'm either missing something (maybe!), or the post is a long-winded way of saying the above.

206

Mandeep

@themandeepc

10 days ago

@jianmjn 👋

Mandeep

@themandeepc

10 days ago

@willccbb (assuming this is about Effect the typescript project, and not effects more generally)

Mandeep

@themandeepc

10 days ago

@willccbb seems like it started as bringing effect system-like behaviour to typescript, and is now an assortment of distributed systems and reliability primitives for typescript applications

Mandeep

@themandeepc

10 days ago

@ctjlewis Probably, but they did also provide genuinely fun stuff & marketing of it (cafe and shuttle). Surprised how fast their fortunes changed after they raised a huge val + grind thought leadership.

139

Mandeep

@themandeepc

11 days ago

@willccbb @PradyuPrasad and otel is a specification, not a company 😭

Mandeep

@themandeepc

11 days ago

@HarryStebbings @nico_laqua real q tho: can we see the corgi tattoo?

733

Mandeep

@themandeepc

12 days ago

@gabriel1 also (historically) whittles numbers down to where you have resources to even interview remaining candidates. tho, when university became pretty common, that became leetcode/OA test scores, or similar.

249

Mandeep

@themandeepc

12 days ago

@garybasin @martin_casado “Best” isn’t just raw intelligence though. I think Claude models still have better coding taste, for instance. After a certain point of capability, taste > more capability

Mandeep

@themandeepc

13 days ago

@steipete @jjpcodes what do you not like about them?

Mandeep

@themandeepc

15 days ago

@petereliaskraft @qianl_cs Nice!

Mandeep

@themandeepc

Last Seen Users on Sotwe

Trends for you

Most Popular Users