rw ./ @gradientintern - Twitter Profile

Pinned Tweet

5 months ago

⬜️⬜️⬜️ ⬜️ ⬜️ ⬜️ ⬜️⬜️ ⬜️⬜️⬜️ ⬜️⬜️⬜️ ⬜️⬜️⬜️ ⬜️ ⬜️ ⬜️ ⬜️ ⬜️ ⬜️ ⬜️⬜️⬜️ ⬜️⬜️⬜️ ⬜️ ⬜️ ⬜️⬜️⬜️ ◻️◻️ ./ training mode on… @Gradient_HQ

25

93

14

1

12K

rw ./

@gradientintern

1 day ago

maybe after heavy dilution could be 80% less. even then it’s still well over a cool mill

1

5

0

77

rw ./

@gradientintern

1 day ago

AI companies in the 2020s is what Dell and Intel was to the boomers back in the 90s Dell did so well back in the day 2,700 employees became millionaires and nicknamed “Dellionaires” AI is a appreciative asset, open source leading the way

gradientintern's tweet photo. AI companies in the 2020s is what Dell and Intel was to the boomers back in the 90s

Dell did so well back in the day 2,700 employees became millionaires and nicknamed “Dellionaires”

AI is a appreciative asset, open source leading the way https://t.co/s7pxj7JQ0H

4

28

1

0

242

rw ./

@gradientintern

3 days ago

CPU semis on a real run. agentic AI uses cpu as task orchestrators for memory, i/o and enforcement as the gpu processes the reasoning for AI (core inference) context windows, data movement and tool calling are heavily cpu based.

gradientintern's tweet photo. CPU semis on a real run.

agentic AI uses cpu as task orchestrators for memory, i/o and enforcement as the gpu processes the reasoning for AI (core inference)

context windows, data movement and tool calling are heavily cpu based. https://t.co/7X7iU78a4n

0

27

1

247

rw ./

@gradientintern

3 days ago

Nvidia releasing Nemotron 3 Ultra later this week - 550B parameters (55B active) - frontier open model performance - 30% cheaper - 5X speed increase Completely open. Gawd lord

gradientintern's tweet photo. Nvidia releasing Nemotron 3 Ultra later this week

- 550B parameters (55B active)
- frontier open model performance
- 30% cheaper
- 5X speed increase

Completely open. Gawd lord https://t.co/2JAajUsCyd

NVIDIA AI

@NVIDIAAI

3 days ago

Nemotron 3 Ultra is coming this week. ⌛️

106

3K

358

467

382K

0

35

3

0

791

rw ./

@gradientintern

3 days ago

M3 delivers. outperforms Opus 4.7 in BrowserComp. MiniMax’s first 1M context model pretty much 5x context from its previous models of around 204,800 and it supports video input too, multimodal built in as core.. 🥴🎥 beautiful.

gradientintern's tweet photo. M3 delivers. outperforms Opus 4.7 in BrowserComp.

MiniMax’s first 1M context model pretty much 5x context from its previous models of around 204,800

and it supports video input too, multimodal built in as core.. 🥴🎥

beautiful. https://t.co/2mSYNbmH7J

MiniMax (official) @MiniMax_AI

3 days ago

Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax Sparse Attention scales context to 1M - Natively Multimodal from Step Zero API: https://t.co/fHRdSV7BwZ Token Plan: https://t.co/BDCycxepZw 🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul Weights & Tech Report in ~10 Days

MiniMax_AI's tweet photo. Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities

- Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas
- MiniMax Sparse Attention scales context to 1M
- Natively Multimodal from Step Zero

API: https://t.co/fHRdSV7BwZ
Token Plan: https://t.co/BDCycxepZw
🚀New! MiniMax Code: https://t.co/GvB4YiB6Ul

Weights & Tech Report in ~10 Days

529

8K

1K

3K

3M

0

28

4

0

579

rw ./

@gradientintern

3 days ago

@MiniMax_AI ❤️🔥

0

2

0

253

rw ./

@gradientintern

4 days ago

and agentic token usage growth expectations look like this btw. if you cannot supply, the price goes vertical. bullish on local and distributed edge ai inference.

gradientintern's tweet photo. and agentic token usage growth expectations look like this btw.

if you cannot supply, the price goes vertical. bullish on local and distributed edge ai inference. https://t.co/1hGWZlwFFR

Florian Kronawitter

@fkronawitter1

6 days ago

JPM on '27 data center build out: "The latest analysis based on satellite images shows that over 60% of data center capacity planned for completion in 2027 has not begun construction with another 7% delayed"

75

2K

245

714

972K

3

29

4

0

716

rw ./

@gradientintern

6 days ago

emerging markets with critical AI supply chains significantly outgrew the US in returns once the countries with critical supply chains are removed so does the growth AI boom currently is not being equally felt across the globe, open source models filling some access gaps in AI

gradientintern's tweet photo. emerging markets with critical AI supply chains significantly outgrew the US in returns

once the countries with critical supply chains are removed so does the growth

AI boom currently is not being equally felt across the globe, open source models filling some access gaps in AI https://t.co/Q9n29XbbIY

0

29

1

0

243

rw ./

@gradientintern

7 days ago

Many other frontiers can also be accessed on Commonstack, try with test credits at no cost: https://t.co/EdPjdETXCQ

0

10

0

147

rw ./

@gradientintern

7 days ago

Gemini 3.5 Flash from Google is on @commonstack_ai! This is Google’s most intelligent model for sustained frontier performance on agentic and coding tasks matching or surpassing many other models at a fraction of the cost. Build and try out now!

$gradientintern's tweet photo. Gemini 3.5 Flash from Google is on @commonstack_ai! This is Google’s most intelligent model for sustained frontier performance on agentic and coding tasks matching or surpassing many other models at a fraction of the cost. Build and try out now! https://t.co/UVGymXYpUB$

5

46

15

1

3K

rw ./

@gradientintern

8 days ago

they already increased it before this to the 30 billion now

0

7

0

1

103

rw ./

@gradientintern

8 days ago

ByteDance, TikTok’s parent company is considering to more than 2x their capex from 30 billion dollars to 70 billion dollars. Infrastructure spending continues to go vertical. One of the most popular AI apps in China is Doubao which is created by ByteDance.

gradientintern's tweet photo. ByteDance, TikTok’s parent company is considering to more than 2x their capex from 30 billion dollars to 70 billion dollars.

Infrastructure spending continues to go vertical. One of the most popular AI apps in China is Doubao which is created by ByteDance. https://t.co/DZdFltz4cv

4

26

1

526

rw ./

@gradientintern

9 days ago

arXiv: https://t.co/SIPC79dBYX TwinRouterBench: https://t.co/13sh98vlNt Uncommonroute: https://t.co/Xi5f04gYTm

0

10

0

146

rw ./

@gradientintern

9 days ago

Multi turn harness, the construction pipeline of TwinRouterBench, designed to optimize the cost efficiency of LLM workflows. The process: It starts with a successful interaction (trace) generated by a high end "strong" model > It isolates the critical parts of the interaction to create more concise data points > attempts to swap out expensive model calls for cheaper ones (the 'l', 'm', 'mh', 'h' labels represent Low, Mid, Mid-High, and High tiers) > It runs these "downgraded" sequences through a Multi turn task harness. If the cheaper model still results in a successful task completion, the downgrade is accepted > The final result is a "verified tier label" for every single LLM call in a sequence, showing exactly where a cheap model is "enough" and where a powerful model is "necessary." This benchmark provides High Fidelity Training Data, creating a dataset of "optimal routing" decisions. This data can be used to train specialized "Router" models that decide in real time which LLM to call for a specific prompt. Previously the standard benchmarks grade a whole conversation, this provides granular labels for individual turns within a complex task Running every prompt through the most powerful model is too expensive, but using only cheap models leads to task failure. TwinRouterBench helps find the balance of actual work completion at the most effective pricing. Traditional routing is easy for single questions but very hard for multi step agent workflows. By focusing on execution-based verification, you get a more experienced version of reality grounded towards truth.

gradientintern's tweet photo. Multi turn harness, the construction pipeline of TwinRouterBench, designed to optimize the cost efficiency of LLM workflows.

The process:

It starts with a successful interaction (trace) generated by a high end "strong" model > It isolates the critical parts of the interaction to create more concise data points > attempts to swap out expensive model calls for cheaper ones (the 'l', 'm', 'mh', 'h' labels represent Low, Mid, Mid-High, and High tiers) > It runs these "downgraded" sequences through a Multi turn task harness. If the cheaper model still results in a successful task completion, the downgrade is accepted > The final result is a "verified tier label" for every single LLM call in a sequence, showing exactly where a cheap model is "enough" and where a powerful model is "necessary."

This benchmark provides High Fidelity Training Data, creating a dataset of "optimal routing" decisions. This data can be used to train specialized "Router" models that decide in real time which LLM to call for a specific prompt. Previously the standard benchmarks grade a whole conversation, this provides granular labels for individual turns within a complex task

Running every prompt through the most powerful model is too expensive, but using only cheap models leads to task failure. TwinRouterBench helps find the balance of actual work completion at the most effective pricing.

Traditional routing is easy for single questions but very hard for multi step agent workflows. By focusing on execution-based verification, you get a more experienced version of reality grounded towards truth.

rw ./

@gradientintern

9 days ago

TwinRouterBench is a new benchmark designed for step level routing in long horizon, multi turn agentic workflows. Differing from traditional routing benchmarks that focus on single prompt routing, TwinRouterBench evaluates how well a "router" can choose the right model for each individual step of a complex task. It implements dual tracks evaluation between fast development and realistic testing: Track 1: Static Track (Fast Offline Track) • 970 router visible prefixes from 520 trajectory instances. • Covers 5 diverse benchmarks: SWE-bench, BFCL, mtRAG, QMSum, and PinchBench. • Each example comes with an execution verified target tier (cheapest sufficient model tier). • Uses deterministic scoring (based on tier correctness, trajectory membership, and token cost) no LLM judges needed. Ideal for: training routers, rapid iteration, and cheap offline evaluation. Track 2: Dynamic Track (Live Validation Track) • Full evaluation harness on SWE-bench Verified (500 tasks). • Reports results on a 100 case heldout split (disjoint from static data). • Router must choose a real model from a locked pool at every step. • Measures real outcomes: Official task resolution success, Actual API spend (real dollars), Includes failure penalties for unresolved tasks By providing both a Static (fixed) and Dynamic (flowing) track, TwinRouterBench solves the problem where a router looks good on paper but fails when the agent actually has to live with its choices. TwinRouterBench is set for the agentic era where every step is measured in routing vs just one shot prompt testing. This benchmark targets the realism distortion by testing routing within the actual context of multi step, stateful agent trajectories.

gradientintern's tweet photo. TwinRouterBench is a new benchmark designed for step level routing in long horizon, multi turn agentic workflows. Differing from traditional routing benchmarks that focus on single prompt routing, TwinRouterBench evaluates how well a "router" can choose the right model for each individual step of a complex task.

It implements dual tracks evaluation between fast development and realistic testing:

Track 1: Static Track (Fast Offline Track)
• 970 router visible prefixes from 520 trajectory instances.
• Covers 5 diverse benchmarks: SWE-bench, BFCL, mtRAG, QMSum, and PinchBench.
• Each example comes with an execution verified target tier (cheapest sufficient model tier).
• Uses deterministic scoring (based on tier correctness, trajectory membership, and token cost) no LLM judges needed.

Ideal for: training routers, rapid iteration, and cheap offline evaluation.

Track 2: Dynamic Track (Live Validation Track)
• Full evaluation harness on SWE-bench Verified (500 tasks).
• Reports results on a 100 case heldout split (disjoint from static data).
• Router must choose a real model from a locked pool at every step.
• Measures real outcomes: Official task resolution success, Actual API spend (real dollars), Includes failure penalties for unresolved tasks

By providing both a Static (fixed) and Dynamic (flowing) track, TwinRouterBench solves the problem where a router looks good on paper but fails when the agent actually has to live with its choices.

TwinRouterBench is set for the agentic era where every step is measured in routing vs just one shot prompt testing. This benchmark targets the realism distortion by testing routing within the actual context of multi step, stateful agent trajectories.

3

41

10

0

2K

4

42

8

0

750

rw ./

@gradientintern

9 days ago

@MiniMax_AI M3 👀

0

3

0

346

rw ./

@gradientintern

9 days ago

arXiv: https://t.co/SIPC79dBYX TwinRouterBench: https://t.co/13sh98vlNt Uncommonroute: https://t.co/Xi5f04gYTm

0

9

0

170

rw ./

@gradientintern

9 days ago

TwinRouterBench is a new benchmark designed for step level routing in long horizon, multi turn agentic workflows. Differing from traditional routing benchmarks that focus on single prompt routing, TwinRouterBench evaluates how well a "router" can choose the right model for each individual step of a complex task. It implements dual tracks evaluation between fast development and realistic testing: Track 1: Static Track (Fast Offline Track) • 970 router visible prefixes from 520 trajectory instances. • Covers 5 diverse benchmarks: SWE-bench, BFCL, mtRAG, QMSum, and PinchBench. • Each example comes with an execution verified target tier (cheapest sufficient model tier). • Uses deterministic scoring (based on tier correctness, trajectory membership, and token cost) no LLM judges needed. Ideal for: training routers, rapid iteration, and cheap offline evaluation. Track 2: Dynamic Track (Live Validation Track) • Full evaluation harness on SWE-bench Verified (500 tasks). • Reports results on a 100 case heldout split (disjoint from static data). • Router must choose a real model from a locked pool at every step. • Measures real outcomes: Official task resolution success, Actual API spend (real dollars), Includes failure penalties for unresolved tasks By providing both a Static (fixed) and Dynamic (flowing) track, TwinRouterBench solves the problem where a router looks good on paper but fails when the agent actually has to live with its choices. TwinRouterBench is set for the agentic era where every step is measured in routing vs just one shot prompt testing. This benchmark targets the realism distortion by testing routing within the actual context of multi step, stateful agent trajectories.

3

41

10

0

2K

rw ./

@gradientintern

10 days ago

@commonstack_ai 💪

0

1

0

77

gradientintern retweeted

Commonstack

@commonstack_ai

10 days ago

Great to see TwinRouterBench accepted to the #RLEval Workshop at #CAIS2026! Per-step routing is quickly becoming essential infrastructure for agentic systems: each planning, coding, retrieval, and verification call should use the cheapest sufficient model without hurting final task success. Proud to open-source TwinRouterBench and contribute a practical benchmark for this problem.

10

33

13

2

2K

rw ./

@gradientintern

11 days ago

this will probably carry over to robotics as well great bets can be made on heterogeneous flops, agentic models ripping cloud cpu prices higher jensen wasn’t kidding about a new $200B market for Nvidia

gradientintern's tweet photo. this will probably carry over to robotics as well

great bets can be made on heterogeneous flops, agentic models ripping cloud cpu prices higher

jensen wasn’t kidding about a new $200B market for Nvidia https://t.co/wAhzxLePU6

SemiAnalysis

@SemiAnalysis_

12 days ago

FACT ALERT 🚨 : In modern agentic coding, 42% of the time is spent on CPU doing tool use such as editing files, running Bash scripts, running lints, etc. The economy of traditional cloud computing charges at $ per cpu core. In the economy of agents, the business model is $ per token thus to increase token revenue, you need to increase the amount of CPUs power u have so that you can generate your tokens.

SemiAnalysis_'s tweet photo. FACT ALERT 🚨 : In modern agentic coding, 42% of the time is spent on CPU doing tool use such as editing files, running Bash scripts, running lints, etc. The economy of traditional cloud computing charges at $ per cpu core. In the economy of agents, the business model is $ per token thus to increase token revenue, you need to increase the amount of CPUs power u have so that you can generate your tokens.

49

789

86

424

212K

0

36

1

749

rw ./

@gradientintern

Last Seen Users on Sotwe

Trends for you

Most Popular Users