Poolside is hosting a 2-day model research hackathon in London.
Join us to push an open-weight agent model as far as you can. RL and fine-tune Laguna XS.2, our latest-generation model, on Prime Intellect Lab.
Dates: May 29–30
Partners: @nvidia + @PrimeIntellect + @huggingface
Prize: NVIDIA DGX Spark
Agents need better models.
Better models need cracked researchers.
Link below.
Benchmaxxing and Benchmark hacking are obviously a thing, but they're also a thing Poolside does not do. Agents and models need to be generally useful and, ironically, the more useful they become the less we'll find all the traditional benchmarks useful to tell how useful they are.
So, here's to finding new ways to evaluate AI going forward.
Fun times!
As agents get more clever, so do their attempts at benchmark hacking.
Last Monday, we found one of our RL runs jumped ~20% on SWE-Bench-Pro over a weekend, reaching ~64% which would make it #1 on the leaderboard.
This was clearly benchmark hacking and we patched the exploit.
But this revealed deeper hacks across multiple public benchmarks, some of which were impossible to fix through environment design alone.
Evals need to evolve beyond just outcome based pass rates to better observability into how the agent is arriving at them.
These were our findings:
https://t.co/ncyf4liW7C
Examples below 👇
1/
Today we're launching our first public Poolside models: Laguna M.1 and Laguna XS.2, and we've built ❈Shimmer, an instant-on VM sandbox with Poolside Agent pre-installed so you can try them out.
Go play out our new models for free, and build something fun → https://t.co/G2YKxQmT9L
Today @poolsideai is releasing Laguna M.1 & Laguna XS.2, our latest generation models and first public models
We started Poolside because we believed that to build truly capable coding agents, you need to own the full stack: data, training, reinforcement learning, inference.
These models are the first result of that work, and we’re making them available to everyone
@poolsideai just released Laguna M.1 and Laguna XS.2 — our first publicly available foundation models, built for agentic coding. XS.2 is open weights under Apache 2.0 on @huggingface today. https://t.co/dqkCDP3UIx
Incredible first re:Invent in the books, thank you to everyone we met and learned from! See you all next year 💜
If you missed the chance to connect with us and want to chat with someone on the team, let us know here → https://t.co/7aW5NbI1o8
Really foundational ship from one of my teams today. We're starting to build experiences for security team members, who work across many, many repositories. The very start of that journey is a view of all the alerts across an organisation
Introducing GitHub Office Hours: Join us this Wednesday the 15th at 11am PT on Twitch. We’ll be tackling software development challenges by topic each week, including security, DevOps, and more. Code scanning is up first.
https://t.co/ClES1g3oWR
So proud to announce GitHub code scanning today. This is the culmination of months of work from an amazing team of engineers from @github and former @Semmle folks. If you'd like to try it, please sign up at https://t.co/R4wOvkGprM
@verizonfios SHOCKED that I got billed an early termination fee! For 2 weeks begged you to move me but no technicians were available. I WFH so i was forced to switch to another company. Hoping to switch back to FIOS in the future, but will never switch back with an ET fee
@nntaleb released Black Swan right before the 2008 financial crisis and now released Skin in the Game right before tokens added skin in the game for everything. Interesting timing.