@scaling01 better than gpt-5.5 on swe bench pro, ehh?
also svg-bench being better than opus, that too being much smaller than opus. big if true and the capabilities carry on and not just be another "bench-maxxed" model.
BrowserComp looks promising for computer use agents
@yacineMTB i did a similar thing but albeit with obsidian markdown files as source of truth for website content, yours is definitely much cooler.
https://t.co/qdc60gpbhW
@scaling01 Claude code aint even that great of a harness to begin with + these models especially deepseek really shine when you turn into a subagents hoard trying to solve a problem simply because they are so cheap to run
@scaling01 Mythos 90% seems a stress tho (an ignorant being who is yet to try out mythos and is basing his views based on what his friends using it said)