@communicating@roocode@Zai_org@openrouter Here are the programming exercises for our evals suite: https://t.co/IfWVZeJlPs
Full results here: https://t.co/NEeYO8Ah4g
Your take isn't spicy enough!😂
My sense is that there are trade-offs with all of these tools and in the long run I wouldn't bet against giving these models more tools and letting them judge which is most appropriate given the constraints.
It would be nice to have some eval data backing these takes (we're working on that).
@soyhenryxyz@GosuCoder This is amazing. The next big push on evals is going to testing various orchestration configurations and show some data that backs our intuition about it, so I'd love to help.
@GosuCoder@bindureddy I just updated the Roo Code evals - https://t.co/NEeYO8Ah4g - o4 Mini (High) doesn't come near the top-tier of coding models but the price to performance is reasonable.
@cdossman@roocode I think it's on pace to be slightly below Sonnet 3.7 and Gemini 2.5. The price to intelligence ratio of 4.1 mini seems to be trending really well...
@mattpocockuk The Aider polyglot benchmarks are a good start. I wired up the Cursor-like product I’m working on to run the benchmarks and see how it compares to the publicly available scores.
Amazingly, the chyron is not the most foolish thing about this picture.
To get ahead of a potential refugee crisis caused by great suffering in Central America, it would make sense to use our resources to help reduce that suffering.
This is self-defeating.
EXCLUSIVE: DHS test of steel prototype for border wall, Trump's preference, showed it could be sawed through.
We've obtained a never-before-seen photo.
Our report, with @JuliaEAinsley. https://t.co/VfRTSt36mr
Let me get this straight: Trump wants the federal court to postpone indefinitely hearing a case claiming that he is illegally profiting from his Washington hotel, because the government shutdown, created by Trump, prohibits his attorneys from working.
https://t.co/VuZejVDCoP