Yifeng He @yfhe62 - Twitter Profile

Yifeng He

@yfhe62

about 12 hours ago

@demisama_ There is a paper in my field specificly discussing this lol https://t.co/C7KlJA9oOW

0

32

Yifeng He

@yfhe62

about 12 hours ago

@demisama_ At least focus on a category of tasks (or at least one task), and at lease show some generalization across datasets 👀. I feel often this is limited by compute and labor.

1

0

119

yfhe62 retweeted

Xiangyi Li

@xdotli

about 15 hours ago

A big pain point in using AI benchmarks is encountering errors after its first release. Today, we're releasing SkillsBench 1.1, the first benchmark for how well AI agents use skills, now audited end to end and verified error-free. Prof. @dawnsongtweets joins 1.1 as advising author. We worked through every task with several frontier labs to eliminate the errors in the previous version. We also added new tasks, moved the ones with external dependencies into a separate set so the core suite runs clean, and expanded coverage to more models. Capability is climbing fast. The best with-skills resolution rate rose from ~36% (Claude Sonnet 4.5, Sep 2025) to 67% (GPT-5.5, May 2026), about +1.9 points per month. The frontier is hill-climbing SkillsBench fast. The right skills still matter. Across the fleet, curated skills lift resolution rate by +16.6 points on average (33.9% → 50.5%), and by as much as +25.7 points for a single model. The top configuration is GPT-5.5 on OpenHands at 67.3%. By popular demand (thx Nate @cursor_ai), we're now tracking skills invocation: how often an agent actually uses the skills it's given. Recent flagship configurations invoke them 90–99% of the time (Codex 99%, OpenHands + GPT-5.5 92%, Gemini CLI 90%), versus roughly 50% for older setups. Also new in 1.1: @OpenHands joins as a fourth harness, alongside Claude Code, Codex, and Gemini CLI; a rebuilt leaderboard with refined categories, subdomain skill rankings, and Skill Lift; and native task . md on BenchFlow, with multi-scene environments and rollout branching. We also partnered with @k_dense_ai to add scientific skills to some science tasks. One implication for deployment: skills can substitute for scale. GLM 5.1 with skills (58.4%) outperforms Opus 4.8 without (45.7%). A smaller model with the right procedural knowledge can beat a larger one running without it. Huge thanks to @nick_kango @ivanleomk @kaggle @GoogleDeepMind for hosting a launch event with us. Thanks for everyone who's come on May 27! Also thanks to our partners @gneubig @OpenHandsDev @ivanburazin @daytonaio @jackminong @johannes_hage @PrimeIntellect @TimothyKassis @k_dense_ai for providing support in credits, compute, and skills. SkillsBench live leaderboard will also come to @ValsAI. Many people have told us they use SkillsBench as an index to measure models' agentic capability over diverse and high GDP value domains. Great work on Valkyrie as well! @ Jarett @nikilravi @langstonnashold @RayanKrishnan SkillsBench is fully open-source. Explore the leaderboard and tasks, read the docs, or contribute your own skill set or harness and join the leaderboard. 🧵

xdotli's tweet photo. A big pain point in using AI benchmarks is encountering errors after its first release. Today, we're releasing SkillsBench 1.1, the first benchmark for how well AI agents use skills, now audited end to end and verified error-free. Prof. @dawnsongtweets joins 1.1 as advising author.

We worked through every task with several frontier labs to eliminate the errors in the previous version. We also added new tasks, moved the ones with external dependencies into a separate set so the core suite runs clean, and expanded coverage to more models.

Capability is climbing fast. The best with-skills resolution rate rose from ~36% (Claude Sonnet 4.5, Sep 2025) to 67% (GPT-5.5, May 2026), about +1.9 points per month. The frontier is hill-climbing SkillsBench fast.

The right skills still matter. Across the fleet, curated skills lift resolution rate by +16.6 points on average (33.9% → 50.5%), and by as much as +25.7 points for a single model. The top configuration is GPT-5.5 on OpenHands at 67.3%.

By popular demand (thx Nate @cursor_ai), we're now tracking skills invocation: how often an agent actually uses the skills it's given. Recent flagship configurations invoke them 90–99% of the time (Codex 99%, OpenHands + GPT-5.5 92%, Gemini CLI 90%), versus roughly 50% for older setups.

Also new in 1.1: @OpenHands joins as a fourth harness, alongside Claude Code, Codex, and Gemini CLI; a rebuilt leaderboard with refined categories, subdomain skill rankings, and Skill Lift; and native task . md on BenchFlow, with multi-scene environments and rollout branching. We also partnered with @k_dense_ai to add scientific skills to some science tasks.

One implication for deployment: skills can substitute for scale. GLM 5.1 with skills (58.4%) outperforms Opus 4.8 without (45.7%). A smaller model with the right procedural knowledge can beat a larger one running without it.

Huge thanks to @nick_kango @ivanleomk @kaggle @GoogleDeepMind for hosting a launch event with us. Thanks for everyone who's come on May 27!

Also thanks to our partners @gneubig @OpenHandsDev @ivanburazin @daytonaio @jackminong @johannes_hage @PrimeIntellect @TimothyKassis @k_dense_ai for providing support in credits, compute, and skills.

SkillsBench live leaderboard will also come to @ValsAI. Many people have told us they use SkillsBench as an index to measure models' agentic capability over diverse and high GDP value domains. Great work on Valkyrie as well! @ Jarett @nikilravi @langstonnashold @RayanKrishnan

SkillsBench is fully open-source. Explore the leaderboard and tasks, read the docs, or contribute your own skill set or harness and join the leaderboard. 🧵

13

93

25

32

11K

Yifeng He

@yfhe62

2 days ago

The Day your mom come by

0

1

0

27

Yifeng He

@yfhe62

4 days ago

Why claude sometimes jump between Traditional Chinese and Simplified during inference?

0

30

Yifeng He

@yfhe62

7 days ago · Scotia

Not really a founder, but at the Founders Tree 🤣

0

1

0

32

yfhe62 retweeted

Xiangyi Li

@xdotli

8 days ago

the most fun part of the journey

0

21

5

2

2K

Yifeng He

@yfhe62

8 days ago

Some best ways to used the most powerful AI 🍨@TillamookDairy

0

22

Yifeng He

@yfhe62

9 days ago

https://t.co/alcbe9FZwW

0

3

1

0

255

Yifeng He

@yfhe62

10 days ago

Guys I just found the best place to work from home.

0

1

0

43

Yifeng He

@yfhe62

11 days ago

0

1

0

33

Yifeng He

@yfhe62

12 days ago

Suddenly started token maxing in the middle of a wine tasting 🫥

1

2

0

70

Yifeng He

@yfhe62

13 days ago

Any must-try coffee / tea places in Seattle? 👀

0

36

yfhe62 retweeted

Amber Liu

@JIACHENLIU8

21 days ago

So glad to be presenting our work “Agent-Native Research Artifact“ at the very first Conference on Agentic AI Systems @CAISconf ! Incredible to see researchers fly in from all over the world for it. As we’re building the systems for AI Scientists, our publication format is lagging behind. For the past hundreds of years, the paper PDF has been primarily designed to convince human reviewers with a polished story. Therefore, we end up stripping out the ‘messy’ intermediate thinking processes, the important but "trivial" details, and the failed attempts that actually drive discovery. We lose the true intellectual lineage of the work. To design the best media for research knowledge from first principles, we propose agent-native research artifacts to capture the full cognitive trajectory and the underlying human intuition. Let's rethink how we document and share science. Paper: https://t.co/aYIAQZ8cOq

0

67

10

13

5K

yfhe62 retweeted

Xiangyi Li

@xdotli

19 days ago

Agent Skills 26' workshop, if you missed it, here's a full 🧵👇

0

9

2

1

815

Yifeng He

@yfhe62

19 days ago

shout-out to the hard work of the organizing team @wenbochen8 @xuandongzhao @kywch500 @Yimin1010 @kobe0938 @shenghan_zheng @yfhe62 Hao Chen, @Yushun_Dong @yanliu_usc @HanchungLee Yue Zhao, Emilio Ferrara, @dawnsongtweets, and the sponsors from @k_dense_ai. Thanks @wenbochen8 for hosting the workshop whole day!

0

1

0

60

Yifeng He

@yfhe62

19 days ago

What a wonderful experience co-organizing Agent Skills '26 workshop at @CAISconf. We had a full house! Huge thanks to our speakers @dawnsongtweets @ManlingLi_ @gneubig @Yushun_Dong @kanavg1 @ysu_nlp and panelists @obra @robennals @ysu_nlp for the talks and discussion that made the day, and to the entire organizing team who made it all happen 🙏.

yfhe62's tweet photo. What a wonderful experience co-organizing Agent Skills '26 workshop at @CAISconf. We had a full house!
Huge thanks to our speakers @dawnsongtweets @ManlingLi_ @gneubig @Yushun_Dong @kanavg1 @ysu_nlp and panelists @obra @robennals @ysu_nlp for the talks and discussion that made the day,
and to the entire organizing team who made it all happen 🙏.

1

7

3

0

808

Yifeng He

@yfhe62

19 days ago

As the chair of our submission and reviewing process with 100+ submissions and 150+ reviewers, I gained first-hand experience using OpenReview and scripting its APIs 🤣. I also had the honor of chairing the oral presentation session and moderating the panel session — and it was a real pleasure to meet all of our panelists and accepted oral presenters @zxlzr @zijun_wang2002 @richard_epsilla @JIACHENLIU8 @holzsec @ReshabhSharma01.

yfhe62's tweet photo. As the chair of our submission and reviewing process with 100+ submissions and 150+ reviewers,
I gained first-hand experience using OpenReview and scripting its APIs 🤣.
I also had the honor of chairing the oral presentation session and moderating the panel session — and it was a real pleasure to meet all of our panelists and accepted oral presenters @zxlzr @zijun_wang2002 @richard_epsilla @JIACHENLIU8 @holzsec @ReshabhSharma01.

1

5

2

0

152

Yifeng He

@yfhe62

19 days ago

@E4poci Great job eve!

1

0

32

Yifeng He

@yfhe62

Last Seen Users on Sotwe

Trends for you

Most Popular Users