meta and google looking to raise additional capital to fund their capex as FCF goes negative
compute resources for AI is pushing the hyperscalers past their limits as they want to win
these high cost will eventually be passed onto the consumer, secure local operations asap 👀
🧩@Gradient_HQ Puzzle Mastermind!
Play from now til June 6th 1AM EDT!
🏆 Top 10 Participants will receive Quiz Mastermind on Discord!
Rules:
- username must match DC username
- play as many times & anytime you want during event duration
./ puzzle here: https://t.co/TWMxUb6FTw
AI companies in the 2020s is what Dell and Intel was to the boomers back in the 90s
Dell did so well back in the day 2,700 employees became millionaires and nicknamed “Dellionaires”
AI is a appreciative asset, open source leading the way
TwinRouterBench is a new benchmark designed for step level routing in long horizon, multi turn agentic workflows. Differing from traditional routing benchmarks that focus on single prompt routing, TwinRouterBench evaluates how well a "router" can choose the right model for each individual step of a complex task.
It implements dual tracks evaluation between fast development and realistic testing:
Track 1: Static Track (Fast Offline Track)
• 970 router visible prefixes from 520 trajectory instances.
• Covers 5 diverse benchmarks: SWE-bench, BFCL, mtRAG, QMSum, and PinchBench.
• Each example comes with an execution verified target tier (cheapest sufficient model tier).
• Uses deterministic scoring (based on tier correctness, trajectory membership, and token cost) no LLM judges needed.
Ideal for: training routers, rapid iteration, and cheap offline evaluation.
Track 2: Dynamic Track (Live Validation Track)
• Full evaluation harness on SWE-bench Verified (500 tasks).
• Reports results on a 100 case heldout split (disjoint from static data).
• Router must choose a real model from a locked pool at every step.
• Measures real outcomes: Official task resolution success, Actual API spend (real dollars), Includes failure penalties for unresolved tasks
By providing both a Static (fixed) and Dynamic (flowing) track, TwinRouterBench solves the problem where a router looks good on paper but fails when the agent actually has to live with its choices.
TwinRouterBench is set for the agentic era where every step is measured in routing vs just one shot prompt testing. This benchmark targets the realism distortion by testing routing within the actual context of multi step, stateful agent trajectories.
M3 delivers. outperforms Opus 4.7 in BrowserComp.
MiniMax’s first 1M context model pretty much 5x context from its previous models of around 204,800
and it supports video input too, multimodal built in as core.. 🥴🎥
beautiful.
Claude Opus 4.8, the latest frontier from Anthropic is available on @commonstack_ai
Model is able to independently work for longer than its predecessors.
Same cost as Claude Opus 4.7 and better performance.
OpenAI is gearing up to release its first hardware device. A mobile device, likely to compete with Apple as they bought Johnny Ives sometime ago.
For a long time this has been part of the ecosystem they don’t have control over.
Would you buy one or trust it with your info?
“AI will replace all human jobs, everyone will be useless”
meanwhile many of the fear mongering labs selling this narrative are workforcemaxxing with their headcount.
there’s some truth to having repetitive work eliminated but alot of it is probably greatly exaggerated.
and agentic token usage growth expectations look like this btw.
if you cannot supply, the price goes vertical. bullish on local and distributed edge ai inference.
Gemini 3.5 Flash from Google is on @commonstack_ai!
This is Google’s most intelligent model for sustained frontier performance on agentic and coding tasks matching or surpassing many other models at a fraction of the cost.
Build and try out now!
Uncommonroute trained router matches Claude Opus 4.6 in SWE-bench Verified evaluation.
In TwinRouterBench you test realism agentic trajectories and the results are staggering:
Uncommonroute Trained vs Claude Opus 4.6
75/100 vs 74/100 (matched in resolution)
$25.66 vs $54.73 (53% cost saving with Uncommonroute trained)
Models cost more now with advanced reasoning and agentic tasks, time to save to get same quality at a better price.
AI compute doubling every couple of months since 2022.
The largest compute buildout being blocked by resource constraints and old infrastructure.
With current compute order backlog looks like many of it will be sitting idle til everything else catches up.
Multi turn harness, the construction pipeline of TwinRouterBench, designed to optimize the cost efficiency of LLM workflows.
The process:
It starts with a successful interaction (trace) generated by a high end "strong" model > It isolates the critical parts of the interaction to create more concise data points > attempts to swap out expensive model calls for cheaper ones (the 'l', 'm', 'mh', 'h' labels represent Low, Mid, Mid-High, and High tiers) > It runs these "downgraded" sequences through a Multi turn task harness. If the cheaper model still results in a successful task completion, the downgrade is accepted > The final result is a "verified tier label" for every single LLM call in a sequence, showing exactly where a cheap model is "enough" and where a powerful model is "necessary."
This benchmark provides High Fidelity Training Data, creating a dataset of "optimal routing" decisions. This data can be used to train specialized "Router" models that decide in real time which LLM to call for a specific prompt. Previously the standard benchmarks grade a whole conversation, this provides granular labels for individual turns within a complex task
Running every prompt through the most powerful model is too expensive, but using only cheap models leads to task failure. TwinRouterBench helps find the balance of actual work completion at the most effective pricing.
Traditional routing is easy for single questions but very hard for multi step agent workflows. By focusing on execution-based verification, you get a more experienced version of reality grounded towards truth.
there’s server racks sitting in warehouses that can’t be plugged in due to lack of power from a lot of the current buildout.
asynchronous compute from different availability will be the second efficiency layer on top of base hardware spec improvements.