i have low conviction on model routers - very open to changing my mind but this is a snapshot of my current thoughts
- i don't think it's good to not be aware of what model you're using. coding with LLMs is a skill you develop and getting a feel for models is part of that
- people (at scale) don't have this skill right now which is why a lot of companies are complaining that people are using expensive models for dumb things. a model router promises to solve this without the user having to do anything but i think the issue is missing feedback loops to the user. id rather we figure out how to help users get smarter
- i dont even know how much you can model route when factoring in things like prompt cache. only so much you can do
- their effectiveness is a bit exaggerated by the same dynamic that's impacting everything AI. so many companies desperately searching for opportunities and trying anything. model routing is the one thing models labs cannot do so everyone is jumping on it
We've updated the Artificial Analysis Coding Agent Index, replacing SWE-Bench Pro with Datacurve's DeepSWE benchmark - the swap lifts Codex with GPT-5.5 (xhigh) above Claude Code with Opus 4.8 (max), while the newly released Claude Fable 5 (max) in Claude Code debuts at the top
DeepSWE, built by @datacurve, writes its tasks from scratch rather than adapting them from public GitHub issues or pull requests, so no model has seen the solutions during training. That matters because SWE-Bench Pro, the benchmark it replaces in our Coding Agent Index, had grown gameable, with some models recovering the fix from the repository's commit history instead of solving the task.
The swap reorders the index: Codex with GPT-5.5 (xhigh) rises from 65 to 76, overtaking Claude Code with Opus 4.8 (max) at 73. Claude Code with Fable 5 (max), which enters directly on the refreshed index, leads at 77. SWE-Bench Pro had been flattering some combinations and penalizing others.
More below.
FYI Claude Code is mostly a vibe-coded product (as they say, 100% written by Claude)
It's the worst harness for Opus 4.6 among ANY harness on Terminal-Bench 2
NVIDIA IS LITERALLY GIVING AWAY FREE AI INFERENCE
I literally set it up in 5 minutes and couldn't believe it was free
DeepSeek, MiniMax, Kimi, GLM, Llama - all on NVIDIA's DGX Cloud via clean OpenAI-compatible API.
Setup in 5 min:
→ https://t.co/2zMHb4Q8zV → grab API key
→ base_url = https://t.co/zUbPzFZA7J
→ drop it into any OpenAI SDK
We've been using it. Yes, it slows down under heavy load. Yes, free tier has limits.
But for solo devs, indie hackers, and students learning AI engineering?
This is the best free playground that exists right now.
Stop paying $20/mo to experiment. Use this first.
"girl i'm running OpenClaw on 2 Mac studios and 4 Mac minis running 3 local models and 8 different agents that I constantly monitor with a cute monitoring dashboard, i need 4 screens just to look at it."
"OK BUT WHAT THE FUCK ARE YOU USING IT FOR?"
Introducing Muse Spark, the first in the Muse family of models developed by Meta Superintelligence Labs.
Muse Spark is a natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration.
Muse Spark is available today at https://t.co/wHkMPH82ZH and the Meta AI app. We’re also making it available in private preview via API to select partners, and we hope to open-source future versions of the model.
Learn more: https://t.co/PloE9q5x96