The agent skills hackathon from @benchflow_ai@xdotli is a great hands-on way to learn the importance of writing good skills before Kaggle's 5 Days of AI Agents event in a couple of weeks. $20K in prizes! https://t.co/pDw85eW2FE
When an AI agent succeeds, was it the model or the skill it was given?
Launching today with @xdotli and @benchflow_ai β the BenchFlow AI Agent Skills Community Hackathon. Build skills that lift agent capability without crossing safety boundaries.
When an AI agent succeeds, was it the model or the skill it was given?
Launching today with @xdotli and @benchflow_ai β the BenchFlow AI Agent Skills Community Hackathon. Build skills that lift agent capability without crossing safety boundaries.
Kicking off the Agent Skills 26' @CAISconf with a full room of listeners of the awesome 'Building Organizational Memory' by Prof. @gneubig
Also kudos to @OpenHandsDev for supporting the experiments at SkillsBench 1.1! Blog post soon π
Great contribution to this field by adding richer domains and skills to agentic evals curated by experts @harvey
icymi you can run this benchmark with any agents using @benchflow_ai
Excited to co-host the @GoogleDeepMind Enterprise Build Day event with @agihouse_org@AlexaOrent on Coding Agents and Open Source and Frontier!
Join us on May 30th and build!
https://t.co/ZhR0KLsfld
mine open source tasks to curate your own eval set and environments
hillclimb for your 1) latent space (models and 2) memory space (skills and agents.md)
releasing previews to benchlabs
dm / reply for beta access! pretty excited about what you can achive in creating personal evals that has high signals. kudos to the @benchflow_ai community in making this!
@Yimin1010@bingran_bry@kywch500
OpenReview is now public for the @CAISconf Agent Skills workshop
103 submissions, 45 posters, 6 orals
Absolutely incredible results for a workshop at an inaugural conference. Kudos to everyone on the team π«‘
sponsors from @k_dense_ai (largest scientific skills repo) π
it's done. codex subscription is supported in @benchflow_ai in @daytonaio sandboxes
evaluate + train agents and skills using benchflow with your subscription starting now
made by creators of skillsbench. it's good. try it
repo link ππ§΅
> new benchmark release
> programbench by swebench creators
> general-agents by primeintellect
> this guy loved benchmarks since 2024.
> passion code until late night to try it out with configs
> he shares how you can have the fun without setting up
try: https://t.co/tlnD6BWKJN
Run ProgramBench by @jyangballin@OfirPress@KLieret with any agents you want with @benchflow_ai
SWE-Bench is my starting point to running and learning about benchmarks. My first principles of a good benchmark is that good benchmarks should 1) reflect or predict how agents or models are used in real life and 2) be challenging for sota agents at the time at release.
SkillsBench got massive success as it predicted the fundamental thing that agents will be deployed heavily in other domains. Remember the famous bar charts by Anthropic, we went earlier than that. Another thing it got right is that people will use skills to enable that deployment. Similarly, SWE-Bench is a good example as it predicted agentic coding. Terminal bench good example of showcasing power of terminal based harness. ProgramBench recently launched is interesting as it aims to predict agent generating whole repos from specs.
For ProgramBench's case I heard people wanted to 1) customize the agent harness, 2) customize initial prompts and 3) customize verifiers. They are all doable now in benchflow.
Introducing @harvey LAB in benchflow-ai/benchmarks
Skills have significantly increased agents deployment in diverse domains outside of coding and more complex environments outside of terminal.
Kudos to Harvey for an amazing open benchmark that demonstrate this ππ§΅
SkillsBench being mentioned everywhere in the bay now π₯π₯ thx @ivanleomk@kobe0938
We just merged our 94th tasks and will release our 1.0 version of dataset on 5/27
Big news ahead. Stay tuned π
You can study every great golf swing & watch hundreds of instructional videos, but until you're on the range practicing reps, you won't actually learn. The same is true for AI.
To go from chatbot to capable agent, models need a different kind of training called reinforcement learning. It's how agents learn by running thousands of tasks in simulated environments, getting scored, & improving with every iteration.
The demand for these high-quality training environments is exploding. Frontier labs including @AnthropicAI, @OpenAI, @GoogleDeepMind, & @xai, are spending billions to build them. Enterprises are just getting started.
The companies building this infrastructure today are in their early days, but theyβre laying the groundwork for how all AI will be trained in the near future.