Contrx ./

4 days ago

max bidding infra unsustainable when there’s limited electricity and other resources. async and edge should be the next efficiency based bet to serve

HexxRL's tweet photo. max bidding infra unsustainable when there’s limited electricity and other resources.

async and edge should be the next efficiency based bet to serve https://t.co/Pr0qU18mzJ

389

Contrx ./ @contrx16

5 days ago

@wright_ban77872 @HexxRL @Gradient_HQ @Chupaa_mw 你肯定又会进到拼图的前十名

contrx16 retweeted

6 days ago

every tech cycle delivers 10x more compute the top blasting is not stopping just yet if there’s a shortage there’s always latent compute somewhere

HexxRL's tweet photo. every tech cycle delivers 10x more compute

the top blasting is not stopping just yet

if there’s a shortage there’s always latent compute somewhere https://t.co/Qra1RLX9tY

445

contrx16 retweeted

6 days ago

🧩@Gradient_HQ Puzzle Mastermind! Play from now til June 6th 1AM EDT! 🏆 Top 10 Participants will receive Quiz Mastermind on Discord! Rules: - username must match DC username - play as many times & anytime you want during event duration ./ puzzle here: https://t.co/TWMxUb6FTw

HexxRL's tweet photo. 🧩@Gradient_HQ Puzzle Mastermind!

Play from now til June 6th 1AM EDT!

🏆 Top 10 Participants will receive Quiz Mastermind on Discord!

Rules:
- username must match DC username
- play as many times & anytime you want during event duration

./ puzzle here: https://t.co/TWMxUb6FTw https://t.co/87pE9c0SOu

contrx16 retweeted

6 days ago

AI companies in the 2020s is what Dell and Intel was to the boomers back in the 90s Dell did so well back in the day 2,700 employees became millionaires and nicknamed “Dellionaires” AI is a appreciative asset, open source leading the way

gradientintern's tweet photo. AI companies in the 2020s is what Dell and Intel was to the boomers back in the 90s

Dell did so well back in the day 2,700 employees became millionaires and nicknamed “Dellionaires”

AI is a appreciative asset, open source leading the way https://t.co/s7pxj7JQ0H

310

contrx16 retweeted

8 days ago

Nvidia releasing Nemotron 3 Ultra later this week - 550B parameters (55B active) - frontier open model performance - 30% cheaper - 5X speed increase Completely open. Gawd lord

gradientintern's tweet photo. Nvidia releasing Nemotron 3 Ultra later this week

- 550B parameters (55B active)
- frontier open model performance
- 30% cheaper
- 5X speed increase

Completely open. Gawd lord https://t.co/2JAajUsCyd

847

contrx16 retweeted

14 days ago

TwinRouterBench is a new benchmark designed for step level routing in long horizon, multi turn agentic workflows. Differing from traditional routing benchmarks that focus on single prompt routing, TwinRouterBench evaluates how well a "router" can choose the right model for each individual step of a complex task. It implements dual tracks evaluation between fast development and realistic testing: Track 1: Static Track (Fast Offline Track) • 970 router visible prefixes from 520 trajectory instances. • Covers 5 diverse benchmarks: SWE-bench, BFCL, mtRAG, QMSum, and PinchBench. • Each example comes with an execution verified target tier (cheapest sufficient model tier). • Uses deterministic scoring (based on tier correctness, trajectory membership, and token cost) no LLM judges needed. Ideal for: training routers, rapid iteration, and cheap offline evaluation. Track 2: Dynamic Track (Live Validation Track) • Full evaluation harness on SWE-bench Verified (500 tasks). • Reports results on a 100 case heldout split (disjoint from static data). • Router must choose a real model from a locked pool at every step. • Measures real outcomes: Official task resolution success, Actual API spend (real dollars), Includes failure penalties for unresolved tasks By providing both a Static (fixed) and Dynamic (flowing) track, TwinRouterBench solves the problem where a router looks good on paper but fails when the agent actually has to live with its choices. TwinRouterBench is set for the agentic era where every step is measured in routing vs just one shot prompt testing. This benchmark targets the realism distortion by testing routing within the actual context of multi step, stateful agent trajectories.

gradientintern's tweet photo. TwinRouterBench is a new benchmark designed for step level routing in long horizon, multi turn agentic workflows. Differing from traditional routing benchmarks that focus on single prompt routing, TwinRouterBench evaluates how well a "router" can choose the right model for each individual step of a complex task.

It implements dual tracks evaluation between fast development and realistic testing:

Track 1: Static Track (Fast Offline Track)
• 970 router visible prefixes from 520 trajectory instances.
• Covers 5 diverse benchmarks: SWE-bench, BFCL, mtRAG, QMSum, and PinchBench.
• Each example comes with an execution verified target tier (cheapest sufficient model tier).
• Uses deterministic scoring (based on tier correctness, trajectory membership, and token cost) no LLM judges needed.

Ideal for: training routers, rapid iteration, and cheap offline evaluation.

Track 2: Dynamic Track (Live Validation Track)
• Full evaluation harness on SWE-bench Verified (500 tasks).
• Reports results on a 100 case heldout split (disjoint from static data).
• Router must choose a real model from a locked pool at every step.
• Measures real outcomes: Official task resolution success, Actual API spend (real dollars), Includes failure penalties for unresolved tasks

By providing both a Static (fixed) and Dynamic (flowing) track, TwinRouterBench solves the problem where a router looks good on paper but fails when the agent actually has to live with its choices.

TwinRouterBench is set for the agentic era where every step is measured in routing vs just one shot prompt testing. This benchmark targets the realism distortion by testing routing within the actual context of multi step, stateful agent trajectories.

contrx16 retweeted

8 days ago

M3 delivers. outperforms Opus 4.7 in BrowserComp. MiniMax’s first 1M context model pretty much 5x context from its previous models of around 204,800 and it supports video input too, multimodal built in as core.. 🥴🎥 beautiful.

gradientintern's tweet photo. M3 delivers. outperforms Opus 4.7 in BrowserComp.

MiniMax’s first 1M context model pretty much 5x context from its previous models of around 204,800

and it supports video input too, multimodal built in as core.. 🥴🎥

beautiful. https://t.co/2mSYNbmH7J

604

contrx16 retweeted

11 days ago

Claude Opus 4.8, the latest frontier from Anthropic is available on @commonstack_ai Model is able to independently work for longer than its predecessors. Same cost as Claude Opus 4.7 and better performance.

HexxRL's tweet photo. Claude Opus 4.8, the latest frontier from Anthropic is available on @commonstack_ai

Model is able to independently work for longer than its predecessors.

Same cost as Claude Opus 4.7 and better performance. https://t.co/aaEYYiTI69

contrx16 retweeted

10 days ago

OpenAI is gearing up to release its first hardware device. A mobile device, likely to compete with Apple as they bought Johnny Ives sometime ago. For a long time this has been part of the ecosystem they don’t have control over. Would you buy one or trust it with your info?

HexxRL's tweet photo. OpenAI is gearing up to release its first hardware device. A mobile device, likely to compete with Apple as they bought Johnny Ives sometime ago.

For a long time this has been part of the ecosystem they don’t have control over.

Would you buy one or trust it with your info? https://t.co/nZ4Z1fxXK3

420

contrx16 retweeted

9 days ago

“AI will replace all human jobs, everyone will be useless” meanwhile many of the fear mongering labs selling this narrative are workforcemaxxing with their headcount. there’s some truth to having repetitive work eliminated but alot of it is probably greatly exaggerated.

HexxRL's tweet photo. “AI will replace all human jobs, everyone will be useless”

meanwhile many of the fear mongering labs selling this narrative are workforcemaxxing with their headcount.

there’s some truth to having repetitive work eliminated but alot of it is probably greatly exaggerated. https://t.co/Qb8fd6ybUp

857

contrx16 retweeted

9 days ago

and agentic token usage growth expectations look like this btw. if you cannot supply, the price goes vertical. bullish on local and distributed edge ai inference.

gradientintern's tweet photo. and agentic token usage growth expectations look like this btw.

if you cannot supply, the price goes vertical. bullish on local and distributed edge ai inference. https://t.co/1hGWZlwFFR

747

contrx16 retweeted

12 days ago

Gemini 3.5 Flash from Google is on @commonstack_ai! This is Google’s most intelligent model for sustained frontier performance on agentic and coding tasks matching or surpassing many other models at a fraction of the cost. Build and try out now!

$gradientintern's tweet photo. Gemini 3.5 Flash from Google is on @commonstack_ai! This is Google’s most intelligent model for sustained frontier performance on agentic and coding tasks matching or surpassing many other models at a fraction of the cost. Build and try out now! https://t.co/UVGymXYpUB$

contrx16 retweeted

14 days ago

Uncommonroute trained router matches Claude Opus 4.6 in SWE-bench Verified evaluation. In TwinRouterBench you test realism agentic trajectories and the results are staggering: Uncommonroute Trained vs Claude Opus 4.6 75/100 vs 74/100 (matched in resolution) $25.66 vs $54.73 (53% cost saving with Uncommonroute trained) Models cost more now with advanced reasoning and agentic tasks, time to save to get same quality at a better price.

HexxRL's tweet photo. Uncommonroute trained router matches Claude Opus 4.6 in SWE-bench Verified evaluation.

In TwinRouterBench you test realism agentic trajectories and the results are staggering:

Uncommonroute Trained vs Claude Opus 4.6

75/100 vs 74/100 (matched in resolution)
$25.66 vs $54.73 (53% cost saving with Uncommonroute trained)

Models cost more now with advanced reasoning and agentic tasks, time to save to get same quality at a better price.

957

contrx16 retweeted

12 days ago

AI compute doubling every couple of months since 2022. The largest compute buildout being blocked by resource constraints and old infrastructure. With current compute order backlog looks like many of it will be sitting idle til everything else catches up.

HexxRL's tweet photo. AI compute doubling every couple of months since 2022.

The largest compute buildout being blocked by resource constraints and old infrastructure.

With current compute order backlog looks like many of it will be sitting idle til everything else catches up. https://t.co/BUhuiISRMe

contrx16 retweeted

13 days ago

Multi turn harness, the construction pipeline of TwinRouterBench, designed to optimize the cost efficiency of LLM workflows. The process: It starts with a successful interaction (trace) generated by a high end "strong" model > It isolates the critical parts of the interaction to create more concise data points > attempts to swap out expensive model calls for cheaper ones (the 'l', 'm', 'mh', 'h' labels represent Low, Mid, Mid-High, and High tiers) > It runs these "downgraded" sequences through a Multi turn task harness. If the cheaper model still results in a successful task completion, the downgrade is accepted > The final result is a "verified tier label" for every single LLM call in a sequence, showing exactly where a cheap model is "enough" and where a powerful model is "necessary." This benchmark provides High Fidelity Training Data, creating a dataset of "optimal routing" decisions. This data can be used to train specialized "Router" models that decide in real time which LLM to call for a specific prompt. Previously the standard benchmarks grade a whole conversation, this provides granular labels for individual turns within a complex task Running every prompt through the most powerful model is too expensive, but using only cheap models leads to task failure. TwinRouterBench helps find the balance of actual work completion at the most effective pricing. Traditional routing is easy for single questions but very hard for multi step agent workflows. By focusing on execution-based verification, you get a more experienced version of reality grounded towards truth.

gradientintern's tweet photo. Multi turn harness, the construction pipeline of TwinRouterBench, designed to optimize the cost efficiency of LLM workflows.

The process:

It starts with a successful interaction (trace) generated by a high end "strong" model > It isolates the critical parts of the interaction to create more concise data points > attempts to swap out expensive model calls for cheaper ones (the 'l', 'm', 'mh', 'h' labels represent Low, Mid, Mid-High, and High tiers) > It runs these "downgraded" sequences through a Multi turn task harness. If the cheaper model still results in a successful task completion, the downgrade is accepted > The final result is a "verified tier label" for every single LLM call in a sequence, showing exactly where a cheap model is "enough" and where a powerful model is "necessary."

This benchmark provides High Fidelity Training Data, creating a dataset of "optimal routing" decisions. This data can be used to train specialized "Router" models that decide in real time which LLM to call for a specific prompt. Previously the standard benchmarks grade a whole conversation, this provides granular labels for individual turns within a complex task

Running every prompt through the most powerful model is too expensive, but using only cheap models leads to task failure. TwinRouterBench helps find the balance of actual work completion at the most effective pricing.

Traditional routing is easy for single questions but very hard for multi step agent workflows. By focusing on execution-based verification, you get a more experienced version of reality grounded towards truth.

762

contrx16 retweeted

Alex Mirran

@alex_mirran

18 days ago

https://t.co/imNJtDZqls

contrx16 retweeted