Zi @dongyangzi - Twitter Profile

about 1 month ago

We furthered AI research by reproducing CRUX #1 for Windows using @getnenai's infrastructure without needing to buy a Windows machine-- checkout our blog post https://t.co/Mjz1D5myF9

Sayash Kapoor @sayashk

about 2 months ago

Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.

sayashk's tweet photo. Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks. https://t.co/CrvbEd9l7f

15

252

53

183

94K

0

11

3

6K

Zi

@dongyangzi

about 1 month ago

So true, that's why we're building this at Nen. Btw, Daytona Windows desktops are still in private preview. If you want access to one right now, we're already available https://t.co/Emp6NK1UHq

Ivan Burazin

@ivanburazin

about 1 month ago

Computer use will be one of the biggest drivers of compute usage for agents over the next 2-3 years. Because the world is too slow to change, agents need to work in the world that exists. Most legacy applications are Windows apps, and private equity, banks, enterprises, etc move very slowly. Everyone knows that headless APIs are the future, but the transition will take time.

10

39

1

12

4K

0

1

0

180

Zi

@dongyangzi

about 1 month ago

As part of battle-testing our infrastructure I used Nen to autonomously create and publish an app to the Windows store. This is an extension of the work by the CRUX eval team (https://t.co/SejTDvAnXj) and @random_walker and @sayashk. Check out my writeup https://t.co/156hLCmLce

Nen

@getnenai

about 1 month ago

We're launching Computer Use Desktops for Windows. Launch a fully-provisioned Windows desktop from your CLI with just `nen desktop create` in a few seconds. See the benchmarks here --> https://t.co/Ewzn2XBMHl

0

1

0

80

0

1

0

39

Zi

@dongyangzi

about 1 month ago

Anthropic just launched a new memory system within 3 weeks of their last release https://t.co/DKncYPuKDJ I wrote a blog post that explains why https://t.co/LDKOexpji1

0

1

0

55

Who to follow

Derek Ouyang

@derekouyang

Research Director at the Regulation, Evaluation, and Governance Lab (https://t.co/uyz1KwNzxM), Executive Director of City Systems (https://t.co/K2Dx00Hg05).

about 1 month ago

@sarahwooders @sarahwooders thoughts on https://t.co/DKncYPuKDJ?

0

10

dongyangzi retweeted

H @hcompany_ai

about 1 month ago

When it comes to computer-use, 80 is the new 70. Today, we broke a new barrier on the OS-World benchmark with an 80.4% success rate. Holo3 is officially #1 globally for computer-use agents, and it's not even close. 🏅 👉 See for yourself: https://t.co/jTUnRY3nYr A massive congratulations to the whole team. They set a high standard with chart topping results two weeks ago and continue to raise the bar.

hcompany_ai's tweet photo. When it comes to computer-use, 80 is the new 70.

Today, we broke a new barrier on the OS-World benchmark with an 80.4% success rate. Holo3 is officially #1 globally for computer-use agents, and it's not even close. 🏅

👉 See for yourself: https://t.co/jTUnRY3nYr

A massive congratulations to the whole team. They set a high standard with chart topping results two weeks ago and continue to raise the bar.

11

122

20

44

11K

Zi

@dongyangzi

about 2 months ago

4. Here are some ideas - Reimplement a common CLI (grep, cat, ls) in a new language. You know how it should work but you probably don't know how it's implemented - Implement a protocol client (HTTP, SMTP) and test it with a real server - Have fun-- build a nostalgic game from scratch

0

1

0

46

Zi

@dongyangzi

about 2 months ago

> treat it like an engineer you're delegating to Definitely easier said than done for a BC engineer (Before Claude?) like myself used to making incremental changes, testing it, and reviewing by eye. Wanted to share some concrete steps that helped me break through -->

cat

@_catwu

about 2 months ago

Opus 4.7 is live in Claude Code today! The model performs best if you treat it like an engineer you're delegating to, not a pair programmer you're guiding line by line. Here are three workflow shifts we recommend for this model 🧵 https://t.co/bD5JO1xDMS

48

1K

80

519

85K

1

3

0

100

Zi

@dongyangzi

about 2 months ago

3. Be lazy.

1

0

31

Zi

@dongyangzi

about 2 months ago

@SmokeStarlight @random_walker @SmokeStarlight what aspect of security? Not the author here but my understanding is that this was all conducted in a VM with agent-only credentials, so no concerns there

0

37

Zi

@dongyangzi

about 2 months ago

>We gave an AI agent an Apple Developer account, a Mac VM, and one task: build and publish an iOS app. It succeeded, at a cost of about $1,000. Great research on what it takes for an agent to do real world work. Some interesting areas of improvements:

Arvind Narayanan

@random_walker

about 2 months ago

📢📢A double launch today! We’re releasing a paper analyzing the rapidly growing trend of “open-world evaluations” for measuring frontier AI capabilities. We’re also launching a new project, CRUX (Collaborative Research for Updating AI eXpectations), an effort to regularly conduct such evaluations ourselves. I think open-world evals are the most important development in AI evaluation over the past year. Our paper explains why we need them, what they can and can’t tell us, and how to do them well. In CRUX #1, we tasked an agent with building and publishing a simple iOS app to the Apple App store. The paper has many “lessons from the trenches” from running this experiment. We hope you find it interesting! CRUX #2 will be about AI R&D automation. The core team is @sayashk, @PKirgis, @steverab, Andrew Schwartz, and me. We’re delighted to have assembled an amazing group of collaborators, many of whom have conducted important open-world evaluations: @fly_upside_down, @RishiBommasani, @DubMagda, @ghadfield, @ahall_research, @sarahookr, @sethlazar, @snewmanpv, @DimitrisPapail, @shostekofsky, @hlntnr, and @CUdudec. Paper: https://t.co/M15jgh4PCP HTML version: https://t.co/iuVW7RAlr5 CRUX website: https://t.co/g937gpS65j

random_walker's tweet photo. 📢📢A double launch today! We’re releasing a paper analyzing the rapidly growing trend of “open-world evaluations” for measuring frontier AI capabilities. We’re also launching a new project, CRUX (Collaborative Research for Updating AI eXpectations), an effort to regularly conduct such evaluations ourselves.

I think open-world evals are the most important development in AI evaluation over the past year. Our paper explains why we need them, what they can and can’t tell us, and how to do them well.

In CRUX #1, we tasked an agent with building and publishing a simple iOS app to the Apple App store. The paper has many “lessons from the trenches” from running this experiment. We hope you find it interesting! CRUX #2 will be about AI R&D automation.

The core team is @sayashk, @PKirgis, @steverab, Andrew Schwartz, and me. We’re delighted to have assembled an amazing group of collaborators, many of whom have conducted important open-world evaluations: @fly_upside_down, @RishiBommasani, @DubMagda, @ghadfield, @ahall_research, @sarahookr, @sethlazar, @snewmanpv, @DimitrisPapail, @shostekofsky, @hlntnr, and @CUdudec.

Paper: https://t.co/M15jgh4PCP
HTML version: https://t.co/iuVW7RAlr5
CRUX website: https://t.co/g937gpS65j

2

94

20

57

12K

2

10

1

5

3K

Zi

@dongyangzi

about 2 months ago

Of course whether the app itself is useful or interesting isn't part of the evaluation. But this paper shows that agents are capable of the mechanics of doing computer work, manipulating software, understanding workflows, self-correcting for efficiencies. Very impressive!

0

64

Zi

@dongyangzi

about 2 months ago

3. This was done without any additional agent harness improvements using Claude Opus 4.6

1

0

73

Zi

@dongyangzi

about 2 months ago

@hellosprice leads the most amazing recruiting team-- learn from the best!

Samantha Price

@hellosprice

about 2 months ago

We automated 80% of recruiting at @AudaciousHQ and I'm hosting a workshop mid-April to show you exactly how. Here's what our system does right now: 1. Every candidate we meet gets matched against every open role in our portfolio. Automatically. No spreadsheets. No "let me think about who this might be good for." 2. Personalized outreach goes out warm without anyone writing a single message. We have detailed notes on 5K+ engineers and sales leaders. Their background, what they care about, what they're looking for next. We know their dog's name. 3. Within 30 seconds of ending a call, the candidate is tagged in our ATS, notes are uploaded, and a write-up is drafted. 4. And we built a recall system you can talk to. Describe the person you need: "senior engineer, distributed systems, wants to go early stage, based in SF". It pulls the best matches from our entire database instantly. We went from drowning in admin to spending all of our time on the only thing that actually matters: building real relationships with people. Comment below if you want the details. cc: @jatingargiitk the master mind!

2

20

3

10

3K

1

0

48

Zi

@dongyangzi

about 2 months ago

Awesome work by @PranjalAggarw16 bringing computer-use research to the real economy!

Pranjal Aggarwal ✈️ ICLR'26

@PranjalAggarw16

about 2 months ago

What if computer-use agents could do real work? We built Gym-Anything: a framework that turns any software into a computer-use agent environment. We used it to create CUA-World: 200+ real software, 10,000+ tasks and environments, across all major occupation groups, from medical imaging to financial trading. 🧵

20

422

81

415

145K

0

2

0

1

199

Zi

@dongyangzi

about 2 months ago

@sarahwooders Makes sense, thanks for sharing!

0

2

0

198

Zi

@dongyangzi

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users