Gemini 2.5 Pro #1 across ALL categories, tied #1 with Grok-3/GPT-4.5 for Hard Prompts and Coding, and edged out across all others to take the lead ππ
There's finally a proper benchmark for @openclaw model performance.
I just found that @kilocode built an open source benchmark that tests models across 23 real world openclaw tasks like scheduling meetings, writing code, triaging email etc
gpt-5.3-codex is sitting at number one. tbh that matches my experience.
gemini 3 flash in second place. didn't expect that.
curious to see where gpt-5.4 will land on this.
@agihippo Well, the issue is, I'd like to do a coffee place like in Italy, hole in the wall, 1 euro caffe normale at the bar stand, maybe ocasional cappuccino, but there is no place for it in Bay Area.
AI folks radically overestimate how much LLMs help for practical bio lab work and so get weirdly fixated on biorisk scifi scenarios. Lab work is gated by a researcher's personal pain tolerance, relentlessness, and a huge body of tacit knowledge passed down by apprenticeship.