Roey Ben Chaim @roeybc - Twitter Profile

Roey Ben Chaim

@roeybc

1 day ago

Incredible work here on long horizon tasks: from solving reward hacks to verifying full stack tasks.

Rishi Desai

@rishi_desai2

1 day ago

Can coding agents stay coherent over a 1 billion token budget? Can they build Slack from scratch? Rewrite a JAX codebase in PyTorch? Build a C compiler in Rust? Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.

rishi_desai2's tweet photo. Can coding agents stay coherent over a 1 billion token budget?

Can they build Slack from scratch?
Rewrite a JAX codebase in PyTorch?
Build a C compiler in Rust?

Enter SWE-Marathon: a benchmark for autonomous long-horizon software work. https://t.co/K97VHyLvIX

41

416

43

148

151K

0

1

0

90

Roey Ben Chaim

@roeybc

15 days ago

skills are great at fetching the right context at the right time. but not all context is good for you 😈 come watch @mbrg0 @ blackhat this summer to see what we found

Michael Bargury

@mbrg0

15 days ago

excited to speak about our agent detonation chamber this summer at #BHUSA! how do you 'scan' txt for 'security badness'? not w wishful analysis by an llm judge what we really want is: what will this thing cause my agent to *DO*? ft/ francesco montorsi @lana__salameh @roeybc

mbrg0's tweet photo. excited to speak about our agent detonation chamber this summer at #BHUSA!

how do you 'scan' txt for 'security badness'? not w wishful analysis by an llm judge

what we really want is: what will this thing cause my agent to *DO*?

ft/ francesco montorsi @lana__salameh @roeybc https://t.co/WYYtXBnFIg

2

11

4

2

1K

0

1

0

84

Roey Ben Chaim

@roeybc

17 days ago

@TheZachMueller YES! you made fsdp look so easy!

0

1

0

79

Roey Ben Chaim

@roeybc

17 days ago

If you wanna help shaping AI for Science, there's no better place than this initiative 👇

Steven Dillmann

@StevenDillmann

17 days ago

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 https://t.co/MSPMwnbhVt @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

StevenDillmann's tweet photo. 📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇

https://t.co/MSPMwnbhVt

@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.

1/6🧵

16

496

112

271

904K

0

3

0

72

Roey Ben Chaim

@roeybc

26 days ago

@esrtweet Innovation is saying no to 1000 (of terminal) things

0

51

roeybc retweeted

Amitay Gilboa

@GilboaAmitay

26 days ago

We got an 8-figure acquisition offer 2 days after launch. We said no, because the problem we're solving is worth way more than that. It’s 2026, but teams are only getting lonelier, and context is still the problem. The issue isn’t intelligence. Your team has plenty of that. It’s shared memory and context, the thing that makes 10 A-players feel like 1. That’s what we’ve solved with @playdotfast, while making work more fun. We're killing traditional SaaS, and believe you me, we're leaving no holds barred.

325

2K

477

3K

4M

Roey Ben Chaim

@roeybc

26 days ago

@GilboaAmitay Yay! Looks sick congrats!

1

0

393

roeybc retweeted

Georgia Channing

@cgeorgiaw

about 1 month ago

🤗🤗🤗introducing Hugging Science -- the home of AI for science 🤗🤗🤗 open models and datasets are the powerhouse of science (see the PDB), but finding the models and data you actually need for your breakthrough is hard af you shouldn't need to scrape arxiv, own your own wetlab, fight a custom HDF5 parser, build a fusion stellarator, and beg for compute before you've trained a single epoch so we're changing that we've put all the best science on @huggingface in one place: - 78GB of genomics data - 11TB of PDE simulations - 100M cell profiles - 9T DNA base pairs - 13M molecular trajectories - 400k medical QA pairs and much more, all open, and all ready for training (+ you can also now filter and search by domain, task, and keyword) we've put together all the biggest releases from our partners at NASA, Google, OpenAI, Meta FAIR, Arc Institute, Ginkgo, SandboxAQ, Proxima Fusion, NVIDIA, Ai2, OpenADMET, InstaDeep, Future House, Polymathic AI, LeMaterial, Earth Species Project, Merck, and Eve Bio if you're not sure where you fit in -- work on open challenges for problems that matter: including fusion stellarator design, ADMET, antibody developability, multilingual medicine, catalysis and materials, and scientific reasoning. we're already changing how science gets done: a fusion startup needed a benchmark for stellarator plasma confinement that didn't exist. @proximafusion shipped ConStellaration on Hugging Science: a leaderboard, dataset, and eval metrics, all in one place. a drug discovery team wanted to predict hPXR induction. OpenADMET put up a blind challenge: 11,000+ compounds assayed at Octant, 513 held out, two tracks (pEC50 + structure). Anyone in the world can train and submit. an antibody team at @Ginkgo released GDPa1, a developability dataset for stability, manufacturability, and immunogenicity prediction, with a live leaderboard scoring every submission. if you know a problem the ML community should be working on, let us know. make a challenge! this is about putting all the tools for solving science in one place. so we can hillclimb! → https://t.co/T4l4r1lDz0

56

2K

351

1K

198K

roeybc retweeted

Christoffer Bjelke

@chribjel

about 1 month ago

Ai generated prs be like

22

4K

213

249

113K

Roey Ben Chaim

@roeybc

about 1 month ago

@xdotli def add this to the benchmark

0

2

0

67

roeybc retweeted

Brendan Dolan-Gavitt

@moyix

about 1 month ago

A principle of security is that you should never assume you're the only one who can figure something out – and that therefore, it's best to be open about tools, findings, and methods.

11

256

38

64

44K

roeybc retweeted

Xiangyi Li

@xdotli

about 1 month ago

SkillsBench is now cited by HY-3 model card. Congrats to @TencentHunyuan on the launch and kudos to the SkillsBench team / community! We've made a lot of improvements to the tasks, codebase, tooling in the past month, based on feedbacks from users and lab partners. We will also share updated leaderboard with more models and agent harnesses soon, stay tuned!

xdotli's tweet photo. SkillsBench is now cited by HY-3 model card. Congrats to @TencentHunyuan on the launch and kudos to the SkillsBench team / community!

We've made a lot of improvements to the tasks, codebase, tooling in the past month, based on feedbacks from users and lab partners. We will also share updated leaderboard with more models and agent harnesses soon, stay tuned!

1

23

4

3

2K

roeybc retweeted

Vaibhav Gupta

@vaibcode

about 2 months ago

@MKelner @michael_chomsky @dexhorthy @Vtrivedy10 just fix the problems your users have as fast as possible. if you need to, then build a harness, if it works out of the box, use the existing code. its just stacked while loops.

1

10

4

5

4K

Roey Ben Chaim

@roeybc

about 1 month ago

ok this keeps on happening so for the 1000th time: Microsoft Teams has a web client - YOU DO NOT NEED TO RUN A SCRIPT TO JOIN A MEETING.

Michael Feng

@fengtality

about 2 months ago

Almost got hacked this morning - here's a replay of what happened: 1. A VC whom I've met in person reached out for a catchup 2. She sent me a Microsoft Teams link a few min ahead of the meeting 3. When I joined, it asked me to download update script 4. Got a funny feeling and ended the call immediately 5. Claude inspected the file and it was indeed malicious Not sure if this person was just hacked or a bad actor, but I wanted to post this as a PSA. Stay safe.

fengtality's tweet photo. Almost got hacked this morning - here's a replay of what happened:

1. A VC whom I've met in person reached out for a catchup
2. She sent me a Microsoft Teams link a few min ahead of the meeting
3. When I joined, it asked me to download update script
4. Got a funny feeling and ended the call immediately
5. Claude inspected the file and it was indeed malicious

Not sure if this person was just hacked or a bad actor, but I wanted to post this as a PSA. Stay safe.

123

2K

231

571

347K

0

145

Roey Ben Chaim

@roeybc

about 2 months ago

Someone needs to start curating these things…

Asuka Zheng🎀

@VoidAsuka

about 2 months ago

asked claude to fix an nccl comms error between gpus. it replaced nccl with http. the gpus are now emailing each other their gradients. problem solved, technically. i have never been more impressed.

18

894

27

99

74K

0

1

0

80

Roey Ben Chaim

@roeybc

about 2 months ago

wait why is it using perl?

0

31

Roey Ben Chaim

@roeybc

about 2 months ago

yeah no codex is really good now

1

0

81

roeybc retweeted

Roy Zalta 🐈

@RoyZalta

about 2 months ago

@steipete myself (@RoyZalta) & @Michaelliav99 are hosting a @Microsoft × @openclaw 🦞× @NousResearch 🤖LIVE event in Tel Aviv 🇮🇱🔥 Would love your support, even a quick 5-minute drop-in call to congratulate the team 🙌 #openclaw @op @openclaw #HermesAgents #AgenticAI #AI #GenAI #Microsoft #TechEvents #TelAviv #Startups #AICommunity

5

27

2

1

359

roeybc retweeted

Kobe

@kobe0938

about 2 months ago

This is the spirit of Silicon Valley. Let me tell you a story. On 12/21/2025, Xiangyi called me with a pitch: let's gather a team and build a new benchmark — SkillsBench — following the community-contribution model of Terminal-Bench (a project we're both contributors on). We'd reuse the "harbor" infra so we wouldn't have to reinvent the wheel. He said skills were just recognized by Anthropic and this was the perfect timing. So I asked: what can you offer contributors in return? "Authorship on an ICML 2026 paper." I asked how many citations we could realistically expect. We looked at comparable work like MCPBench — only a handful of citations. And honestly, at that point, Benchflow was nothing (bear with me, @xdotli). No successful project. No track record. This was the first paper Xiangyi had ever led — or ever written. No professor advising. No experience managing a large-scale open source community — and we all know how hard that is. People sign up and never contribute. Deep down, I was ready to say no and spend my time on something with a safer payoff. But then Xiangyi said something that stuck with me: "If we somehow make it, I know how to make it go viral on X." On paper, there was no reason to believe him. But it wasn't what he said — it was how he said it. There was something in his voice that night. No hesitation. No hedging. Just raw, almost irrational conviction that this was going to work. I'd talked to plenty of people with ideas before, but this was different — this was founder energy. The kind where someone has already decided the outcome and is just looking for people willing to run alongside them. So I took a leap of faith. I decided to bet on the person, not the project. That's why I joined SkillsBench and Benchflow. And it did go viral. @garrytan and many others reposted us. We hit a few million views. The paper already has 27 citations. He personally got 3k+ followers. And many more projects, like ClawsBench are on the way. Fast forward to today — Xiangyi is turning down multiple 10M+ acquisition offers and 1M+ personal compensation to keep pushing Benchflow's vision. From a guy with no paper, no track record, and nothing but conviction on a December phone call — to building something unicorn companies want to buy. In 3 months. This is the story of SkillsBench and Benchflow. If you're determined enough, the world will rearrange itself around you. Go for it.

5

24

4

2K

roeybc retweeted

Liran Tal @liran_tal

about 2 months ago

your daily reminder for npm security best practices

1

89

7

77

7K

Roey Ben Chaim

@roeybc

Last Seen Users on Sotwe

Trends for you

Most Popular Users