That’s it for now. We’d really appreciate your feedback on what you like and what could be better.
Note that training support is a research preview for now, and CPU capacity may be limited based on demand. Please bear with us if we need to scale up our clusters!
Learn more about our vision below:
https://t.co/40hSMvl59B
Real intelligence isn’t just about getting the right answer, but most importantly, navigating a complex world that keeps changing.
Our Complex Worlds Hackathon was a small step toward that shift.
What stood out wasn’t just technical strength, but intent. Builders chose hard, meaningful problems, e.g. robotics for elder care, better nursing systems, fully aware they’re difficult, but worth solving.
After the event, people came up to me saying: “I knew it was tough, I knew it may not win the prize, but I want to work on it.”
That’s what stayed with me: brilliant minds, grounded in humanity. The future of true intelligence.
🌍 Last month we hosted the Complex Worlds Hackathon in London.
Participants built an impressive range of environments, spanning synthetic game pipelines, arable farm management, dynamic vehicle routing, hospital triage, robotics, cybersecurity, and more.
Congrats to our winners, Julie Huang and Khalid A!
This Saturday, we’re hosting the Complex Worlds Hackathon in London, in partnership with @join_ef and @airstreet. The response from the community has been incredible.
But this isn’t just a hackathon for engineers.
It’s about a deeper question: how intelligence actually develops.
Today’s AI is powerful, but often short-sighted. It struggles with long-term planning, adapting to change, and operating in messy, real-world conditions. That’s because intelligence doesn’t emerge from static questions. It emerges from interaction, feedback, and experience, from the environments we place agents in.
This is why at @GenReasoning, we’re focusing on building long-horizon reinforcement learning environments: worlds where agents must act over hundreds or thousands of steps, adapt to non-stationarity, and develop capabilities that don’t show up in short tasks.
If we want AI that can truly operate in the real world - in science, business, creativity - we need to rethink the environments we train it in. That’s what this weekend's hackathon is about.
The next leap in AI isn’t just bigger models; it’s better environments. Intelligence isn’t built from a single answer; it’s built from experience.
AI seems very smart these days, but can it actually make good decisions over time? Can it adapt when the world changes? And what does the next frontier of AI capability really look like?
Last week, my AI company @GenReasoning released a research paper testing frontier models in a sports betting market environment. The result was striking: every model we tested lost money.
That sparked a strong interest from the community, including a front-page feature in the @FinancialTimes. But for me, the real story is not the headline. It is what this result reveals about the next frontier in AI.
Today’s models can often analyse well in the moment. But real-world intelligence requires more than analysis. It requires judgment over time - the ability to adapt, manage risk, respond to changing conditions, and stay coherent across a long horizon.
That is why I sat down with my co-founders, @rosstaylor90, @latent_spaced, and @Kipothy, to talk through what we built, what we found, and what it means.
For me, this points to a much bigger question in AI: how do we build systems that do not just produce strong answers, but understand context more deeply, adapt over time, and make better decisions in the real world? Big, meaningful work depends on this - whether in drug discovery or space exploration. In each case, progress depends on evolving with scientific, social, and cultural context over time, not just getting one static answer right.
🌄 Beyond SWE: The Future of Long Horizon Environments
A discussion with our founders about KellyBench, and the need for new environments that require agents to adapt over time and act under uncertainty.
0:00:17 What is KellyBench?
0:02:10 Openendedness, non-stationarity and continual learning
0:03:40 Analytical versus operational capabilities
0:04:13 Why are models bad at KellyBench?
0:05:39 Situational awareness in dynamic environments
0:06:37 Feature stability and real-world non-stationarity
0:07:07 The power of context
0:07:34 "The first principle is that you must not fool yourself"
0:08:20 Machiavelli, fortuna and the ability to adapt to change
0:09:23 How can models improve on evals like KellyBench?
0:10:12 Limitations of KellyBench: data availability and market odds timing
0:11:44 Implications beyond quant finance / sports betting
0:13:26 Civilisations as the ultimate time horizon
0:14:05 Would a mega prompt / better elicitation do much better on the benchmark?
0:14:52 What new types of capability is GR excited about?
0:17:48 Taste and the ability to pursue long-term goals even if they aren't immediately rewarding
0:18:56 Deep learning as an example of a method that took a long time to bear fruit
0:19:25 Optimism about the future of AI
🌍🇬🇧 Complex Worlds Hackathon, London
We're hosting an RL environments hackathon in London on the 25th April, partnering with @join_ef and @airstreet
Come join us to build the next generation of RL environments that model complex worlds over long horizons!
https://t.co/xp2EGT5vv9
“AI models from Google, OpenAI and Anthropic lost money betting on football matches over a Premier League season, in a new study by @GenReasoning suggesting even the most advanced systems struggle to analyse the real world over long periods of time.
The “KellyBench” report released this week by AI start-up General Reasoning highlights the gap between AI’s rapidly advancing capabilities in certain tasks, such as writing software, and its shortcomings in other kinds of human problems.”
https://t.co/tePM7wqWun
Most real-world decisions don’t happen in a single moment. They unfold over time, under uncertainty, with real consequences.
Succeeding in the real world isn’t just about being right once. It’s about staying right as conditions change, managing downside, and making decisions over time.
This gap matters.
🎲 Introducing KellyBench, a new long-horizon evaluation for frontier models.
KellyBench evaluates models within a year long sports betting market, a challenging and highly non-stationary environment.
Every frontier model we test loses money. They struggle to design ML strategies, manage risk, and adapt as the world changes.
Link and thread below.
🌍 Environments of the Week
It's been a week since we launched @OpenReward. Here are some of our favourite environments this week - some newly added, some heavily used, and some hidden gems.
First, the most used environment of the week is EndlessTerminals by @gandhikanishk with 830k+ tool calls.
https://t.co/ZpustB7zYK
🧵
Cool idea from @AashaySachdeva: unified environment interfaces like @OpenReward can enable LLM meta-learning research!
Pleased with where things are going with more parts of the stack accessible publically. For e.g. I now look forward to weekly @tinkerapi roundups as much as John Oliver episodes!
.@benchflow_ai started in 09/24 as unity for benchmarks and a hosting hub with early users from Stanford and Princeton. 4 months before R1 dropped
We stopped after 9 months with 0 traction.
Today our latest work SkillsBench is #1 trending on @OpenReward. Game of eval is just on
Played around with this. This was exactly something I was looking for!
Tried a few things -
Creating an env - pretty dope! end to end claude was able to port it from github with only minor issues. One shotted @ShashwatGoel7 OpenForecaster env here. A lot more people should contribute their own envs. I hope they launch monetisation here.
Running a curator over env tasks during RL - When there are so many tasks, which one should you focus on? This is the auto-curriculum/meta-learning bit. I am still not able to beat random/pass@k but I think signals are there over long run this will help with diversity. This obviously has a power law, every run will have top envs dominating but I feel those 20% random tasks will give a big boost to any model.
optimise the GEPA optimiser - gepa is great but pretty slow. What if we could teach a model to do this better? This was in my list for so long, finally with openreward was able to attempt it.
OpenReward serves hundreds of RL environments through a single API with autoscaled compute. Plug into Tinker to train agents on millions of tasks from anywhere.
https://t.co/sn5rSdamdl
rl environment companies have a version of the same problem as traditional human data vendors. the know-how is sold instead of compounding. labs receive environments, tasks, and verifiers directly. they can inspect the verifier design, extend the architecture to new domains, and eventually automate the creation of more complex and diverse environments without the original builder. just less obvious.
hosted environments could change this. labs get reward signals without ever receiving the docker image. IP stays protected. openreward is an exciting first step towards getting the best environment builders onto a platform where high-quality private environments can be accessed before they saturate, and builders keep what they build.
330+ environments, 4.5M+ tasks through one API is impressive. Love seeing native integration with Miles and Slime, makes spinning up RL experiments so much cleaner 🔥