BenchFlow

5 days ago

play poker with agents @benchflow_ai incredbile work by @devfun!

2

12

3

1

4K

benchflow_ai retweeted

meg.ai 🇨🇦

@MeganRisdal

6 days ago

The agent skills hackathon from @benchflow_ai @xdotli is a great hands-on way to learn the importance of writing good skills before Kaggle's 5 Days of AI Agents event in a couple of weeks. $20K in prizes! https://t.co/pDw85eW2FE

1

16

4

3

1K

benchflow_ai retweeted

Kaggle @kaggle

7 days ago

When an AI agent succeeds, was it the model or the skill it was given? Launching today with @xdotli and @benchflow_ai — the BenchFlow AI Agent Skills Community Hackathon. Build skills that lift agent capability without crossing safety boundaries.

3

56

9

37

12K

7 days ago

First Skills Uplift competition Join on Kaggle!

Kaggle @kaggle

7 days ago

When an AI agent succeeds, was it the model or the skill it was given? Launching today with @xdotli and @benchflow_ai — the BenchFlow AI Agent Skills Community Hackathon. Build skills that lift agent capability without crossing safety boundaries.

3

56

9

37

12K

0

1

0

200

benchflow_ai retweeted

8 days ago

@Yushun_Dong another amazing talk by professor @ysu_nlp on Language as a Scaffold for Agents Intelligence

1

8

2

1

755

benchflow_ai retweeted

8 days ago

Another timely and 🔥 talk by @Yushun_Dong

1

3

1

591

8 days ago

visit San Carlos at the CAIS conference! -1 floor turn left at the elevator. amazing talks happening!

8 days ago

Kicking off the Agent Skills 26' @CAISconf with a full room of listeners of the awesome 'Building Organizational Memory' by Prof. @gneubig Also kudos to @OpenHandsDev for supporting the experiments at SkillsBench 1.1! Blog post soon 🔜

xdotli's tweet photo. Kicking off the Agent Skills 26' @CAISconf with a full room of listeners of the awesome 'Building Organizational Memory' by Prof. @gneubig

Also kudos to @OpenHandsDev for supporting the experiments at SkillsBench 1.1! Blog post soon 🔜 https://t.co/HioVAwrmPi

3

43

6

3K

0

162

benchflow_ai retweeted

8 days ago

SkillsBench is now among the top environments on @OpenReward with 32k tool calls!

0

13

3

4

1K

benchflow_ai retweeted

8 days ago

Great contribution to this field by adding richer domains and skills to agentic evals curated by experts @harvey icymi you can run this benchmark with any agents using @benchflow_ai

xdotli's tweet photo. Great contribution to this field by adding richer domains and skills to agentic evals curated by experts @harvey

icymi you can run this benchmark with any agents using @benchflow_ai https://t.co/KZWnzxZ6yG

1

21

3

12

3K

10 days ago

open source + the frontier 🙋‍♂️ better model = open weights + envs + compute + evals build your envs and evals with @benchflow_ai

10 days ago

Excited to co-host the @GoogleDeepMind Enterprise Build Day event with @agihouse_org @AlexaOrent on Coding Agents and Open Source and Frontier! Join us on May 30th and build! https://t.co/ZhR0KLsfld

xdotli's tweet photo. Excited to co-host the @GoogleDeepMind Enterprise Build Day event with @agihouse_org @AlexaOrent on Coding Agents and Open Source and Frontier!

Join us on May 30th and build!
https://t.co/ZhR0KLsfld https://t.co/bO49oi7N7H

3

23

3

5

5K

4

0

1K

benchflow_ai retweeted

Niels Rogge @NielsRogge

10 days ago

@xdotli @benchflow_ai @Yimin1010 @bingran_bry @kywch500 Looks really cool, should we integrate evals into the respective papers and leaderboards on https://t.co/tOqTY2ZA6h?

1

4

3

0

1K

10 days ago

mine open source tasks to curate your own eval set and environments hillclimb for your 1) latent space (models and 2) memory space (skills and agents.md)

10 days ago

releasing previews to benchlabs dm / reply for beta access! pretty excited about what you can achive in creating personal evals that has high signals. kudos to the @benchflow_ai community in making this! @Yimin1010 @bingran_bry @kywch500

xdotli's tweet photo. releasing previews to benchlabs

dm / reply for beta access! pretty excited about what you can achive in creating personal evals that has high signals. kudos to the @benchflow_ai community in making this!

@Yimin1010 @bingran_bry @kywch500 https://t.co/7IPjkCGGtt

6

18

5

3

6K

1

3

0

720

11 days ago

we are officially 500 followers 🥳

1

3

0

2K

15 days ago

skills x evals

15 days ago

OpenReview is now public for the @CAISconf Agent Skills workshop 103 submissions, 45 posters, 6 orals Absolutely incredible results for a workshop at an inaugural conference. Kudos to everyone on the team 🫡 sponsors from @k_dense_ai (largest scientific skills repo) 👏

xdotli's tweet photo. OpenReview is now public for the @CAISconf Agent Skills workshop

103 submissions, 45 posters, 6 orals

Absolutely incredible results for a workshop at an inaugural conference. Kudos to everyone on the team 🫡

sponsors from @k_dense_ai (largest scientific skills repo) 👏 https://t.co/kCqidfIiIk

1

16

7

5

2K

0

529

15 days ago

agents x tools (mcps, skills) x sandboxes

15 days ago

it's done. codex subscription is supported in @benchflow_ai in @daytonaio sandboxes evaluate + train agents and skills using benchflow with your subscription starting now made by creators of skillsbench. it's good. try it repo link 👇🧵

xdotli's tweet photo. it's done. codex subscription is supported in @benchflow_ai in @daytonaio sandboxes

evaluate + train agents and skills using benchflow with your subscription starting now

made by creators of skillsbench. it's good. try it

repo link 👇🧵 https://t.co/vtN8QFCTxl

1

9

1

929

0

233

16 days ago

> new benchmark release > programbench by swebench creators > general-agents by primeintellect > this guy loved benchmarks since 2024. > passion code until late night to try it out with configs > he shares how you can have the fun without setting up try: https://t.co/tlnD6BWKJN

16 days ago

Run ProgramBench by @jyangballin @OfirPress @KLieret with any agents you want with @benchflow_ai SWE-Bench is my starting point to running and learning about benchmarks. My first principles of a good benchmark is that good benchmarks should 1) reflect or predict how agents or models are used in real life and 2) be challenging for sota agents at the time at release. SkillsBench got massive success as it predicted the fundamental thing that agents will be deployed heavily in other domains. Remember the famous bar charts by Anthropic, we went earlier than that. Another thing it got right is that people will use skills to enable that deployment. Similarly, SWE-Bench is a good example as it predicted agentic coding. Terminal bench good example of showcasing power of terminal based harness. ProgramBench recently launched is interesting as it aims to predict agent generating whole repos from specs. For ProgramBench's case I heard people wanted to 1) customize the agent harness, 2) customize initial prompts and 3) customize verifiers. They are all doable now in benchflow.

xdotli's tweet photo. Run ProgramBench by @jyangballin @OfirPress @KLieret with any agents you want with @benchflow_ai

SWE-Bench is my starting point to running and learning about benchmarks. My first principles of a good benchmark is that good benchmarks should 1) reflect or predict how agents or models are used in real life and 2) be challenging for sota agents at the time at release.

SkillsBench got massive success as it predicted the fundamental thing that agents will be deployed heavily in other domains. Remember the famous bar charts by Anthropic, we went earlier than that. Another thing it got right is that people will use skills to enable that deployment. Similarly, SWE-Bench is a good example as it predicted agentic coding. Terminal bench good example of showcasing power of terminal based harness. ProgramBench recently launched is interesting as it aims to predict agent generating whole repos from specs.

For ProgramBench's case I heard people wanted to 1) customize the agent harness, 2) customize initial prompts and 3) customize verifiers. They are all doable now in benchflow.

0

21

0

10

2K

0

1

356

benchflow_ai retweeted

18 days ago

Introducing @harvey LAB in benchflow-ai/benchmarks Skills have significantly increased agents deployment in diverse domains outside of coding and more complex environments outside of terminal. Kudos to Harvey for an amazing open benchmark that demonstrate this 👇🧵

xdotli's tweet photo. Introducing @harvey LAB in benchflow-ai/benchmarks

Skills have significantly increased agents deployment in diverse domains outside of coding and more complex environments outside of terminal.

Kudos to Harvey for an amazing open benchmark that demonstrate this 👇🧵 https://t.co/7tdWbyNlPl

1

32

4

19

4K

benchflow_ai retweeted

26 days ago

SkillsBench being mentioned everywhere in the bay now 🔥🔥 thx @ivanleomk @kobe0938 We just merged our 94th tasks and will release our 1.0 version of dataset on 5/27 Big news ahead. Stay tuned 👀

xdotli's tweet photo. SkillsBench being mentioned everywhere in the bay now 🔥🔥 thx @ivanleomk @kobe0938

We just merged our 94th tasks and will release our 1.0 version of dataset on 5/27

Big news ahead. Stay tuned 👀 https://t.co/I9QTSUnq20

2

28

3

5

2K

about 1 month ago

👋