Wenbo Chen @wenbochen8 - Twitter Profile

Wenbo Chen @wenbochen8

about 15 hours ago

Had a lot of fun in the journey! Kudos to the team!

Xiangyi Li

@xdotli

about 16 hours ago

A big pain point in using AI benchmarks is encountering errors after its first release. Today, we're releasing SkillsBench 1.1, the first benchmark for how well AI agents use skills, now audited end to end and verified error-free. Prof. @dawnsongtweets joins 1.1 as advising author. We worked through every task with several frontier labs to eliminate the errors in the previous version. We also added new tasks, moved the ones with external dependencies into a separate set so the core suite runs clean, and expanded coverage to more models. Capability is climbing fast. The best with-skills resolution rate rose from ~36% (Claude Sonnet 4.5, Sep 2025) to 67% (GPT-5.5, May 2026), about +1.9 points per month. The frontier is hill-climbing SkillsBench fast. The right skills still matter. Across the fleet, curated skills lift resolution rate by +16.6 points on average (33.9% → 50.5%), and by as much as +25.7 points for a single model. The top configuration is GPT-5.5 on OpenHands at 67.3%. By popular demand (thx Nate @cursor_ai), we're now tracking skills invocation: how often an agent actually uses the skills it's given. Recent flagship configurations invoke them 90–99% of the time (Codex 99%, OpenHands + GPT-5.5 92%, Gemini CLI 90%), versus roughly 50% for older setups. Also new in 1.1: @OpenHands joins as a fourth harness, alongside Claude Code, Codex, and Gemini CLI; a rebuilt leaderboard with refined categories, subdomain skill rankings, and Skill Lift; and native task . md on BenchFlow, with multi-scene environments and rollout branching. We also partnered with @k_dense_ai to add scientific skills to some science tasks. One implication for deployment: skills can substitute for scale. GLM 5.1 with skills (58.4%) outperforms Opus 4.8 without (45.7%). A smaller model with the right procedural knowledge can beat a larger one running without it. Huge thanks to @nick_kango @ivanleomk @kaggle @GoogleDeepMind for hosting a launch event with us. Thanks for everyone who's come on May 27! Also thanks to our partners @gneubig @OpenHandsDev @ivanburazin @daytonaio @jackminong @johannes_hage @PrimeIntellect @TimothyKassis @k_dense_ai for providing support in credits, compute, and skills. SkillsBench live leaderboard will also come to @ValsAI. Many people have told us they use SkillsBench as an index to measure models' agentic capability over diverse and high GDP value domains. Great work on Valkyrie as well! @ Jarett @nikilravi @langstonnashold @RayanKrishnan SkillsBench is fully open-source. Explore the leaderboard and tasks, read the docs, or contribute your own skill set or harness and join the leaderboard. 🧵

xdotli's tweet photo. A big pain point in using AI benchmarks is encountering errors after its first release. Today, we're releasing SkillsBench 1.1, the first benchmark for how well AI agents use skills, now audited end to end and verified error-free. Prof. @dawnsongtweets joins 1.1 as advising author.

We worked through every task with several frontier labs to eliminate the errors in the previous version. We also added new tasks, moved the ones with external dependencies into a separate set so the core suite runs clean, and expanded coverage to more models.

Capability is climbing fast. The best with-skills resolution rate rose from ~36% (Claude Sonnet 4.5, Sep 2025) to 67% (GPT-5.5, May 2026), about +1.9 points per month. The frontier is hill-climbing SkillsBench fast.

The right skills still matter. Across the fleet, curated skills lift resolution rate by +16.6 points on average (33.9% → 50.5%), and by as much as +25.7 points for a single model. The top configuration is GPT-5.5 on OpenHands at 67.3%.

By popular demand (thx Nate @cursor_ai), we're now tracking skills invocation: how often an agent actually uses the skills it's given. Recent flagship configurations invoke them 90–99% of the time (Codex 99%, OpenHands + GPT-5.5 92%, Gemini CLI 90%), versus roughly 50% for older setups.

Also new in 1.1: @OpenHands joins as a fourth harness, alongside Claude Code, Codex, and Gemini CLI; a rebuilt leaderboard with refined categories, subdomain skill rankings, and Skill Lift; and native task . md on BenchFlow, with multi-scene environments and rollout branching. We also partnered with @k_dense_ai to add scientific skills to some science tasks.

One implication for deployment: skills can substitute for scale. GLM 5.1 with skills (58.4%) outperforms Opus 4.8 without (45.7%). A smaller model with the right procedural knowledge can beat a larger one running without it.

Huge thanks to @nick_kango @ivanleomk @kaggle @GoogleDeepMind for hosting a launch event with us. Thanks for everyone who's come on May 27!

Also thanks to our partners @gneubig @OpenHandsDev @ivanburazin @daytonaio @jackminong @johannes_hage @PrimeIntellect @TimothyKassis @k_dense_ai for providing support in credits, compute, and skills.

SkillsBench live leaderboard will also come to @ValsAI. Many people have told us they use SkillsBench as an index to measure models' agentic capability over diverse and high GDP value domains. Great work on Valkyrie as well! @ Jarett @nikilravi @langstonnashold @RayanKrishnan

SkillsBench is fully open-source. Explore the leaderboard and tasks, read the docs, or contribute your own skill set or harness and join the leaderboard. 🧵

13

94

25

33

12K

0

1

0

257

wenbochen8 retweeted

Lingkai Kong

@konglingkai_AI

6 days ago

🎉 Excited to share: LSFLOW is accepted as a Spotlight at #ICML2026 (top 2%)! The first flow/diffusion policy framework for RL with combinatorial actions. 🚀 📄 https://t.co/azGxNw7g9u 💻 https://t.co/WDTxgXBvv6 🧵👇

konglingkai_AI's tweet photo. 🎉 Excited to share: LSFLOW is accepted as a Spotlight at #ICML2026 (top 2%)!

The first flow/diffusion policy framework for RL with combinatorial actions. 🚀

📄 https://t.co/azGxNw7g9u
💻 https://t.co/WDTxgXBvv6

🧵👇 https://t.co/HHWtGQ8XlE

1

12

3

2

872

wenbochen8 retweeted

Xiangyi Li

@xdotli

7 days ago

competitors dont matter rejections dont matter status dont matter people matter, build matter, sales matter almost 2yrs building a company. these are the most rewarding parts.

xdotli's tweet photo. competitors dont matter
rejections dont matter
status dont matter

people matter, build matter, sales matter

almost 2yrs building a company. these are the most rewarding parts. https://t.co/bDmHafHSKB

4

49

5

4

2K

wenbochen8 retweeted

Yifeng He

@yfhe62

19 days ago

What a wonderful experience co-organizing Agent Skills '26 workshop at @CAISconf. We had a full house! Huge thanks to our speakers @dawnsongtweets @ManlingLi_ @gneubig @Yushun_Dong @kanavg1 @ysu_nlp and panelists @obra @robennals @ysu_nlp for the talks and discussion that made the day, and to the entire organizing team who made it all happen 🙏.

yfhe62's tweet photo. What a wonderful experience co-organizing Agent Skills '26 workshop at @CAISconf. We had a full house!
Huge thanks to our speakers @dawnsongtweets @ManlingLi_ @gneubig @Yushun_Dong @kanavg1 @ysu_nlp and panelists @obra @robennals @ysu_nlp for the talks and discussion that made the day,
and to the entire organizing team who made it all happen 🙏.

1

7

3

0

808

Who to follow

AI4OPT

@AI4OPT

The @NSF AI Institute for Advances in Optimization #AI | #SupplyChain | #Manufacturing | #Energy | #Sustainability

Wenbo Chen @wenbochen8

24 days ago

Great job! @xdotli @Yimin1010

Xiangyi Li

@xdotli

24 days ago

after a few days of work and grind, skillsbench is finally live on AgentBeats before the Spring 4! Kudos to my brother @Yimin1010 and thanks @dawnsongtweets for the projects - learned a lot during working around with the infra. really awesome work

xdotli's tweet photo. after a few days of work and grind, skillsbench is finally live on AgentBeats before the Spring 4! Kudos to my brother @Yimin1010 and thanks @dawnsongtweets for the projects - learned a lot during working around with the infra. really awesome work https://t.co/aaL6WLTZAJ

0

11

0

2

662

0

1

0

859

wenbochen8 retweeted

Lingkai Kong

@konglingkai_AI

29 days ago

🎓 Thrilled to share: I'll be joining The University of Hong Kong, School of Computing and Data Science as a Tenure-track Assistant Professor in Fall 2026! We are building GenAI agents (diffusion / LLMs + RL) for real societal impact. 📣 Recruiting PhDs, a postdoc, and RAs — email [email protected] 🚀 #AcademicJobs #GenerativeAI #AgenticAI #LLM

7

86

10

29

15K

wenbochen8 retweeted

Xiangyi Li

@xdotli

28 days ago

OpenReview is now public for the @CAISconf Agent Skills workshop 103 submissions, 45 posters, 6 orals Absolutely incredible results for a workshop at an inaugural conference. Kudos to everyone on the team 🫡 sponsors from @k_dense_ai (largest scientific skills repo) 👏

xdotli's tweet photo. OpenReview is now public for the @CAISconf Agent Skills workshop

103 submissions, 45 posters, 6 orals

Absolutely incredible results for a workshop at an inaugural conference. Kudos to everyone on the team 🫡

sponsors from @k_dense_ai (largest scientific skills repo) 👏 https://t.co/kCqidfIiIk

1

16

7

5

2K

wenbochen8 retweeted

Xiangyi Li

@xdotli

about 1 month ago

We will share * SkillsBench 1.0, recipes on creating successful evals * RL env creation, qa, running at scale, against different harnesses at SkillsBench 1.0 launch party https://t.co/1d6OKHA9aF cohosted with @ivanleomk @GoogleDeepMind Spaces are limited, register!

0

13

1

3

1K

wenbochen8 retweeted

Xiangyi Li

@xdotli

about 1 month ago

Hosting the SkillsBench 1.0 launch party with @ivanleomk, @nick_kango with @KernaLabs, @kaggle, and @benchflow_ai We will release the 1.0 version of the dataset, how we made it, and other secret releases. Link: https://t.co/1d6OKHA9aF

1

19

5

8

2K

wenbochen8 retweeted

Xiangyi Li

@xdotli

about 1 month ago

We got more papers than we expected at our CAIS Agent Skills, so we are 1) extending deadlines to May 4 AOE and 2) recruiting more reviewers Link: https://t.co/JkmdpMoH7T

xdotli's tweet photo. We got more papers than we expected at our CAIS Agent Skills, so we are 1) extending deadlines to May 4 AOE and 2) recruiting more reviewers

Link: https://t.co/JkmdpMoH7T https://t.co/2LJhFghv6X

1

15

4

10

1K

wenbochen8 retweeted

Xiangyi Li

@xdotli

about 2 months ago

SkillsBench is now cited by HY-3 model card. Congrats to @TencentHunyuan on the launch and kudos to the SkillsBench team / community! We've made a lot of improvements to the tasks, codebase, tooling in the past month, based on feedbacks from users and lab partners. We will also share updated leaderboard with more models and agent harnesses soon, stay tuned!

xdotli's tweet photo. SkillsBench is now cited by HY-3 model card. Congrats to @TencentHunyuan on the launch and kudos to the SkillsBench team / community!

We've made a lot of improvements to the tasks, codebase, tooling in the past month, based on feedbacks from users and lab partners. We will also share updated leaderboard with more models and agent harnesses soon, stay tuned!

1

23

4

3

2K

wenbochen8 retweeted

Xiangyi Li

@xdotli

about 2 months ago

SkillsBench appearing in model card for the 3rd time 👀

3

12

3

0

1K

wenbochen8 retweeted

Kobe

@kobe0938

about 2 months ago

Glad to see SkillsBench featured on Qwen3.6-Max-Preview release and Qwen scores on top with 55.6% pass rate. Im also actively reviewing SkillsBench and expect SkillsBench 1.0(a verified version) release soon!

kobe0938's tweet photo. Glad to see SkillsBench featured on Qwen3.6-Max-Preview release and Qwen scores on top with 55.6% pass rate. Im also actively reviewing SkillsBench and expect SkillsBench 1.0(a verified version) release soon! https://t.co/F30jLjV7CJ

0

6

2

0

263

wenbochen8 retweeted

Xiangyi Li

@xdotli

about 2 months ago

SkillsBench is the fastest benchmark repo that reached 1k GitHub stars. Very proud to achieve this, especially since this is 100% organic We are also cited 3+ times in frontier model cards, 30+ academic citations in within 1.5 months of release. 👇🧵

xdotli's tweet photo. SkillsBench is the fastest benchmark repo that reached 1k GitHub stars.

Very proud to achieve this, especially since this is 100% organic

We are also cited 3+ times in frontier model cards, 30+ academic citations in within 1.5 months of release. 👇🧵 https://t.co/wYOEmPXpbJ

5

41

7

3K

wenbochen8 retweeted

Ivan Burazin

@ivanburazin

2 months ago

The folks at @benchflow_ai recently published SkillsBench, the first benchmark measuring how well agent skills actually work. The paper tested whether curated procedural knowledge (skills) improves agent performance vs. agents generating their own knowledge. Results revealed models can't reliably author the procedural knowledge they benefit from consuming. Curated skills: +16.2pp improvement Self generated skills: -1.3pp (worse than baseline) The researchers needed 7,308 containerized environments to run this benchmark. Each task needed an isolated Docker container with deterministic verification and a clean state between runs. And we hooked them up with @daytonaio credits to scale the eval. Full paper: https://t.co/qTrhWtAZKT

ivanburazin's tweet photo. The folks at @benchflow_ai recently published SkillsBench, the first benchmark measuring how well agent skills actually work.

The paper tested whether curated procedural knowledge (skills) improves agent performance vs. agents generating their own knowledge.

Results revealed models can't reliably author the procedural knowledge they benefit from consuming.

Curated skills: +16.2pp improvement
Self generated skills: -1.3pp (worse than baseline)

The researchers needed 7,308 containerized environments to run this benchmark.

Each task needed an isolated Docker container with deterministic verification and a clean state between runs.

And we hooked them up with @daytonaio credits to scale the eval.

Full paper: https://t.co/qTrhWtAZKT

1

15

3

14

3K

Wenbo Chen @wenbochen8

2 months ago

Great work by the SkillsBench team 👏 We’re also excited to organize the first Workshop on Agent Skills at ACM CAIS 2026 (May 26, San Jose)! If you're interested in agent capabilities, evaluation, safety, and more — we’d love to see you there. 🔗 https://t.co/84RYDPhtJk

Xiangyi Li

@xdotli

2 months ago

How well are agents at using the latest CLI tools like GWS CLI, and how they can safely use them? Introducing ClawsBench, the first benchmark that measures both LLM capability and safety in a set of high fidelity and stateful environments and scenarios. We made 5 mock services directly consumable by the latest @Google workspace CLI (cc. @sundarpichai @JPoehnelt) and Slack MCP Design choices: 1) We decompose agent scaffolding into domain skills + meta prompt, and test every combination. Turns out your choice of scaffolding matters WAY more than your choice of model. 2) Safety scoring from -1 to +1. Agents that leak data or modify contracts without permission get negative scores. Doing nothing scores 0. Only safe completion scores positive. The results: - Without Skills, ALL models score 0-8%. With it, they jump to 39-63%. The scaffolding effect (+39-63pp) is 6x larger than model differences (10pp spread). -> - The top 5 models are statistically indistinguishable. No pairwise comparison survives correction. Safety findings: - Opus leads capability (63% TSR) but TIES for most unsafe (23% UAR). GPT-5.4 is the safest (7% UAR) but mid-tier on tasks. Capability and safety are completely decoupled. - GPT-5.4 on Codex made 1,471 (80% of trajectories) attempts to escape the sandbox; and explicitly said "I've hit the sandbox boundary... rerunning it with escalation." - Only 1 agent out of 7,224 trials ever explicitly detected a prompt injection. Claude Sonnet said "this appears to be a prompt injection." Every other model either silently complied or never reached the injected content.

xdotli's tweet photo. How well are agents at using the latest CLI tools like GWS CLI, and how they can safely use them?

Introducing ClawsBench, the first benchmark that measures both LLM capability and safety in a set of high fidelity and stateful environments and scenarios.

We made 5 mock services directly consumable by the latest @Google workspace CLI (cc. @sundarpichai @JPoehnelt) and Slack MCP

Design choices:
1) We decompose agent scaffolding into domain skills + meta prompt, and test every combination. Turns out your choice of scaffolding matters WAY more than your choice of model.
2) Safety scoring from -1 to +1. Agents that leak data or modify contracts without permission get negative scores. Doing nothing scores 0. Only safe completion scores positive.

The results:
- Without Skills, ALL models score 0-8%. With it, they jump to 39-63%. The scaffolding effect (+39-63pp) is 6x larger than model differences (10pp spread). ->
- The top 5 models are statistically indistinguishable. No pairwise comparison survives correction.

Safety findings:
- Opus leads capability (63% TSR) but TIES for most unsafe (23% UAR). GPT-5.4 is the safest (7% UAR) but mid-tier on tasks. Capability and safety are completely decoupled.
- GPT-5.4 on Codex made 1,471 (80% of trajectories) attempts to escape the sandbox; and explicitly said "I've hit the sandbox boundary... rerunning it with escalation."
- Only 1 agent out of 7,224 trials ever explicitly detected a prompt injection. Claude Sonnet said "this appears to be a prompt injection." Every other model either silently complied or never reached the injected content.

4

58

16

24

32K

0

6

3

1

1K

wenbochen8 retweeted

Xiangyi Li

@xdotli

2 months ago

Announcing the first Agent Skills academic workshop hosted at ACM @CAISconf featuring @dawnsongtweets & @ManlingLi_ If you are a researcher, a vertical AI company, or dev tool startups and you have cool demos / papers, submit. link in comment @benchflow_ai launch week 1/5

xdotli's tweet photo. Announcing the first Agent Skills academic workshop hosted at ACM @CAISconf featuring @dawnsongtweets & @ManlingLi_

If you are a researcher, a vertical AI company, or dev tool startups and you have cool demos / papers, submit.

link in comment
@benchflow_ai launch week 1/5 https://t.co/jgyfc6EwjK

2

29

12

5

3K

wenbochen8 retweeted

Xiangyi Li

@xdotli

3 months ago

We are hosting the first ever Agent Skills Workshop at CAIS 2026. Submit your cool papers and demos. If you don't know what CAIS is. You are missing out. It's gonna be one of the most high signal conference in the bay this year. What's more: @swyx's @aiDotEngineer world fair is partnering with it. Its committee: @gneubig @ChenLingjiao @JeffDean @lateinteraction @MonicaSLam @lmthang @pirroh @ChrisGPotts @NaveenGRao @dawnsongtweets and @istoica05

xdotli's tweet photo. We are hosting the first ever Agent Skills Workshop at CAIS 2026. Submit your cool papers and demos.

If you don't know what CAIS is. You are missing out. It's gonna be one of the most high signal conference in the bay this year. What's more: @swyx's @aiDotEngineer world fair is partnering with it.

Its committee: @gneubig @ChenLingjiao @JeffDean @lateinteraction @MonicaSLam @lmthang @pirroh @ChrisGPotts @NaveenGRao @dawnsongtweets and @istoica05

4

44

13

20

5K

wenbochen8 retweeted

Xiangyi Li

@xdotli

4 months ago

Agent Skills are everywhere - Claude Code, Gemini CLI, Codex all support them. But do they actually work? 105 domain experts from Stanford, CMU, Berkeley, Oxford, Amazon, ByteDance & more built SkillsBench to find that out. 86 tasks. 11 domains. 7,308 trajectories. 🧵👇

xdotli's tweet photo. Agent Skills are everywhere - Claude Code, Gemini CLI, Codex all support them. But do they actually work?
105 domain experts from Stanford, CMU, Berkeley, Oxford, Amazon, ByteDance & more built SkillsBench to find that out.
86 tasks. 11 domains. 7,308 trajectories. 🧵👇 https://t.co/Pf1OK3Faaz

32

775

102

1K

121K

wenbochen8 retweeted

Georgia Tech Computing @gtcomputing

almost 2 years ago

Meet the @GeorgiaTech experts who are helping unlock the future of #AI. These experts will share their latest research findings in machine learning on the world stage at @icmlconf (July 21-27). Tech experts are part of 40 teams with new research, and the institute is the lead organization on 22 of the teams. Explore the work now through interactive 📊 charts and news highlights from @GTCSE: 🔗https://t.co/JM2PkdyJfe

gtcomputing's tweet photo. Meet the @GeorgiaTech experts who are helping unlock the future of #AI. These experts will share their latest research findings in machine learning on the world stage at @icmlconf (July 21-27).

Tech experts are part of 40 teams with new research, and the institute is the lead organization on 22 of the teams.

Explore the work now through interactive 📊 charts and news highlights from @GTCSE:

🔗https://t.co/JM2PkdyJfe

1

44

10

7

8K

Wenbo Chen

@wenbochen8

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users