We furthered AI research by reproducing CRUX #1 for Windows using @getnenai's infrastructure without needing to buy a Windows machine-- checkout our blog post https://t.co/Mjz1D5myF9
Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.
So true, that's why we're building this at Nen.
Btw, Daytona Windows desktops are still in private preview. If you want access to one right now, we're already available https://t.co/Emp6NK1UHq
Computer use will be one of the biggest drivers of compute usage for agents over the next 2-3 years.
Because the world is too slow to change, agents need to work in the world that exists.
Most legacy applications are Windows apps, and private equity, banks, enterprises, etc move very slowly.
Everyone knows that headless APIs are the future, but the transition will take time.
As part of battle-testing our infrastructure I used Nen to autonomously create and publish an app to the Windows store. This is an extension of the work by the CRUX eval team (https://t.co/SejTDvAnXj) and @random_walker and @sayashk. Check out my writeup https://t.co/156hLCmLce
We're launching Computer Use Desktops for Windows. Launch a fully-provisioned Windows desktop from your CLI with just `nen desktop create` in a few seconds.
See the benchmarks here --> https://t.co/Ewzn2XBMHl
Anthropic just launched a new memory system within 3 weeks of their last release https://t.co/DKncYPuKDJ
I wrote a blog post that explains why https://t.co/LDKOexpji1
When it comes to computer-use, 80 is the new 70.
Today, we broke a new barrier on the OS-World benchmark with an 80.4% success rate. Holo3 is officially #1 globally for computer-use agents, and it's not even close. 🏅
👉 See for yourself: https://t.co/jTUnRY3nYr
A massive congratulations to the whole team. They set a high standard with chart topping results two weeks ago and continue to raise the bar.
4. Here are some ideas
- Reimplement a common CLI (grep, cat, ls) in a new language. You know how it should work but you probably don't know how it's implemented
- Implement a protocol client (HTTP, SMTP) and test it with a real server
- Have fun-- build a nostalgic game from scratch
> treat it like an engineer you're delegating to
Definitely easier said than done for a BC engineer (Before Claude?) like myself used to making incremental changes, testing it, and reviewing by eye. Wanted to share some concrete steps that helped me break through -->
Opus 4.7 is live in Claude Code today!
The model performs best if you treat it like an engineer you're delegating to, not a pair programmer you're guiding line by line. Here are three workflow shifts we recommend for this model 🧵
https://t.co/bD5JO1xDMS
@SmokeStarlight@random_walker@SmokeStarlight what aspect of security? Not the author here but my understanding is that this was all conducted in a VM with agent-only credentials, so no concerns there
>We gave an AI agent an Apple Developer account, a Mac VM, and one task: build and publish an iOS app. It succeeded, at a cost of about $1,000.
Great research on what it takes for an agent to do real world work. Some interesting areas of improvements:
📢📢A double launch today! We’re releasing a paper analyzing the rapidly growing trend of “open-world evaluations” for measuring frontier AI capabilities. We’re also launching a new project, CRUX (Collaborative Research for Updating AI eXpectations), an effort to regularly conduct such evaluations ourselves.
I think open-world evals are the most important development in AI evaluation over the past year. Our paper explains why we need them, what they can and can’t tell us, and how to do them well.
In CRUX #1, we tasked an agent with building and publishing a simple iOS app to the Apple App store. The paper has many “lessons from the trenches” from running this experiment. We hope you find it interesting! CRUX #2 will be about AI R&D automation.
The core team is @sayashk, @PKirgis, @steverab, Andrew Schwartz, and me. We’re delighted to have assembled an amazing group of collaborators, many of whom have conducted important open-world evaluations: @fly_upside_down, @RishiBommasani, @DubMagda, @ghadfield, @ahall_research, @sarahookr, @sethlazar, @snewmanpv, @DimitrisPapail, @shostekofsky, @hlntnr, and @CUdudec.
Paper: https://t.co/M15jgh4PCP
HTML version: https://t.co/iuVW7RAlr5
CRUX website: https://t.co/g937gpS65j
Of course whether the app itself is useful or interesting isn't part of the evaluation. But this paper shows that agents are capable of the mechanics of doing computer work, manipulating software, understanding workflows, self-correcting for efficiencies. Very impressive!
We automated 80% of recruiting at @AudaciousHQ and I'm hosting a workshop mid-April to show you exactly how.
Here's what our system does right now:
1. Every candidate we meet gets matched against every open role in our portfolio. Automatically. No spreadsheets. No "let me think about who this might be good for."
2. Personalized outreach goes out warm without anyone writing a single message. We have detailed notes on 5K+ engineers and sales leaders. Their background, what they care about, what they're looking for next. We know their dog's name.
3. Within 30 seconds of ending a call, the candidate is tagged in our ATS, notes are uploaded, and a write-up is drafted.
4. And we built a recall system you can talk to. Describe the person you need: "senior engineer, distributed systems, wants to go early stage, based in SF". It pulls the best matches from our entire database instantly.
We went from drowning in admin to spending all of our time on the only thing that actually matters: building real relationships with people.
Comment below if you want the details.
cc: @jatingargiitk the master mind!
What if computer-use agents could do real work?
We built Gym-Anything: a framework that turns any software into a computer-use agent environment.
We used it to create CUA-World: 200+ real software, 10,000+ tasks and environments, across all major occupation groups, from medical imaging to financial trading.
🧵