We built AstaBench to give the field a shared, transparent way to measure whether AI can do rigorous scientific work.
We’re pleased to see adoption with the @AISecurityInst via Inspect Evals and @GenReasoning, which added an AstaBench task to OpenReward.
🎉 We're now supporting the Agent Data Protocol as a default agentic trajectory format.
Any trajectories you log to @OpenReward can be exported in the ADP format.
Thanks to @gneubig@yueqi_song for the collaboration!
🧪 We’re experimenting with new features that allow for easier sampling with popular agentic harnesses.
Core use cases:
- Collecting diverse agentic midtraining data
- Evaluating the latest models on agentic environments
Try it out!
🔥🐴 Firehorse.
Run any model with any harness on any @OpenReward environment.
⚖️ Evaluate the latest models on environment endpoints.
🗂️ Collect agentic data for midtraining and SFT from open models.
🧪 Early experimental library. More support soon.
Link below.
🎲 Introducing KellyBench, a new long-horizon evaluation for frontier models.
KellyBench evaluates models within a year long sports betting market, a challenging and highly non-stationary environment.
Every frontier model we test loses money. They struggle to design ML strategies, manage risk, and adapt as the world changes.
Link and thread below.
Recently, I integrated @OpenReward into SkyRL (@NovaSkyAI), including an example demonstrating training with @modal. To verify the code, I ran several experiments—which proved to be a highly enriching experience! 😋
https://t.co/4zyGhp08ZY
timelapse 27 :)
- submitted the rust reasoning algo env to meta rl hack, (actually built a python then moved to the rust one) created rust dataset around 1000 problems will make it next to 2.5k
- define the whole reward logic not the optimal i think designed the way validation works, will refine it & push to @PrimeIntellect & @OpenReward envs.
- have some other tasks as well, deadline is Tomorrow so need to finish this
- this week was a pretty rough like peak locked in, so will chill & and just relax for few days
Introducing GLM-5.1: The Next Level of Open Source
- Top-Tier Performance: #1 in open source and #3 globally across SWE-Bench Pro, Terminal-Bench, and NL2Repo.
- Built for Long-Horizon Tasks: Runs autonomously for 8 hours, refining strategies through thousands of iterations.
Blog: https://t.co/hmyDe4Nel3
Weights: https://t.co/CuUjXcPKJD
API: https://t.co/fz6reja4fb
Coding Plan: https://t.co/Nk8Y98HNhU
Coming to https://t.co/WCqWT0qCQb in the next few days.
🌍 Environments of the Week
The theme this week...environments for science 👩🔬.
First up, LLM-SR Bench by @ParshinShojaee et al is an environment for evaluating language model agents on scientific equation discovery tasks.
https://t.co/zzx4Hv46LS
🪐 Researcher Credits
We’re announcing researcher credits for OpenReward: helping researchers develop the next generation of environments and evaluations.
Read more and apply below.
https://t.co/MMl97BSqip
🌍 Environments of the Week
It's been a week since we launched @OpenReward. Here are some of our favourite environments this week - some newly added, some heavily used, and some hidden gems.
First, the most used environment of the week is EndlessTerminals by @gandhikanishk with 830k+ tool calls.
https://t.co/ZpustB7zYK
🧵
Cool idea from @AashaySachdeva: unified environment interfaces like @OpenReward can enable LLM meta-learning research!
Pleased with where things are going with more parts of the stack accessible publically. For e.g. I now look forward to weekly @tinkerapi roundups as much as John Oliver episodes!
Played around with this. This was exactly something I was looking for!
Tried a few things -
Creating an env - pretty dope! end to end claude was able to port it from github with only minor issues. One shotted @ShashwatGoel7 OpenForecaster env here. A lot more people should contribute their own envs. I hope they launch monetisation here.
Running a curator over env tasks during RL - When there are so many tasks, which one should you focus on? This is the auto-curriculum/meta-learning bit. I am still not able to beat random/pass@k but I think signals are there over long run this will help with diversity. This obviously has a power law, every run will have top envs dominating but I feel those 20% random tasks will give a big boost to any model.
optimise the GEPA optimiser - gepa is great but pretty slow. What if we could teach a model to do this better? This was in my list for so long, finally with openreward was able to attempt it.
.@benchflow_ai started in 09/24 as unity for benchmarks and a hosting hub with early users from Stanford and Princeton. 4 months before R1 dropped
We stopped after 9 months with 0 traction.
Today our latest work SkillsBench is #1 trending on @OpenReward. Game of eval is just on
OpenReward serves hundreds of RL environments through a single API with autoscaled compute. Plug into Tinker to train agents on millions of tasks from anywhere.
https://t.co/sn5rSdamdl