Commonstack @commonstack_ai - Twitter Profile

Pinned Tweet

17 days ago

Fraction of the bill. Same results. Fully local, open source, works with any client. Just > pipx install uncommon-route https://t.co/6bJ9WhtB52

39

133

44

100

447K

commonstack_ai retweeted

Hanchen Li

@lihanc02

12 days ago

A lot of routing work evaluates isolated prompts, but real agent systems are fundamentally multi-step and budget-constrained. Cool to see benchmarks moving toward execution-grounded, end-to-end evaluation instead of just token-level proxies. TwinRouterBench is a strong step toward realistic agentic routing evaluation — especially the separation between static supervision and dynamic SWE-bench execution. Excited to see where this goes!

1

9

3

0

1K

Commonstack

@commonstack_ai

10 days ago

Great to see TwinRouterBench accepted to the #RLEval Workshop at #CAIS2026! Per-step routing is quickly becoming essential infrastructure for agentic systems: each planning, coding, retrieval, and verification call should use the cheapest sufficient model without hurting final task success. Proud to open-source TwinRouterBench and contribute a practical benchmark for this problem.

Yuhang Yao

@yuhang_yao

12 days ago

Excited to share that TwinRouterBench has been accepted to the #RLEval Workshop at #CAIS2026 🎉 As LLM apps become long-horizon agents, one request can trigger many model calls across planning, tool use, retrieval, coding, and verification. That makes per-step LLM routing a core infrastructure problem: sending each call to the cheapest sufficient model without breaking downstream success. TwinRouterBench introduces: ⚡ Static track: 970 router-visible prefixes from 520 instances across SWE-bench, BFCL, mtRAG, QMSum, and PinchBench 🚀 Dynamic track: live SWE-bench Verified evaluation with official task resolution + realized API spend Key result: a router trained on static labels achieves comparable SWE-bench resolve rate while cutting API cost by ~53% vs. an unrouted Opus 4.6 baseline. Paper: https://t.co/dVZaZvvUk5 Code: https://t.co/ZRnswiW3T7 Dataset: https://t.co/65l1rEcOs7 Website: https://t.co/O45KZwLPCt #LLM #AgenticAI #LLMRouting #Benchmark #SWEBench

yuhang_yao's tweet photo. Excited to share that TwinRouterBench has been accepted to the #RLEval Workshop at #CAIS2026 🎉

As LLM apps become long-horizon agents, one request can trigger many model calls across planning, tool use, retrieval, coding, and verification.

That makes per-step LLM routing a core infrastructure problem: sending each call to the cheapest sufficient model without breaking downstream success.

TwinRouterBench introduces:

⚡ Static track: 970 router-visible prefixes from 520 instances across SWE-bench, BFCL, mtRAG, QMSum, and PinchBench

🚀 Dynamic track: live SWE-bench Verified evaluation with official task resolution + realized API spend

Key result: a router trained on static labels achieves comparable SWE-bench resolve rate while cutting API cost by ~53% vs. an unrouted Opus 4.6 baseline.

Paper: https://t.co/dVZaZvvUk5
Code: https://t.co/ZRnswiW3T7
Dataset: https://t.co/65l1rEcOs7
Website: https://t.co/O45KZwLPCt
#LLM #AgenticAI #LLMRouting #Benchmark #SWEBench

1

19

1

0

5K

10

33

13

2

2K

Commonstack

@commonstack_ai

10 days ago

@yuhang_yao @lihanc02 @RLCommons 🎉🎉

1

0

69

commonstack_ai retweeted

Alex Mirran

@alex_mirran

14 days ago

https://t.co/imNJtDZqls

3

29

10

4

2K

Commonstack

@commonstack_ai

16 days ago

Step-level routing matters. The benchmark to measure it is open today. Bench: https://t.co/cCyKNLxuKl The current leader: https://t.co/6bJ9Whu8UA Paper coming soon on ArXiv.

1

6

0

142

Commonstack

@commonstack_ai

16 days ago

How do you evaluate an LLM router fairly? Most benchmarks look at prompts, but routers operate at an agentic-step level. A router that saves money but breaks the task could be worse than no router. We open-sourced TwinRouterBench to measure this honestly. 🧵

commonstack_ai's tweet photo. How do you evaluate an LLM router fairly?

Most benchmarks look at prompts, but routers operate at an agentic-step level. A router that saves money but breaks the task could be worse than no router.

We open-sourced TwinRouterBench to measure this honestly.
🧵 https://t.co/dsXNRIJ2ga

6

45

17

3

2K

Commonstack

@commonstack_ai

16 days ago

Conflict of interest? acknowledged! We know our router (UncommonRoute) currently leads the leaderboard. Open submissions, locked pricing, public scoring code. If a different router wins, the leaderboard will say so.

3

8

0

186

Commonstack

@commonstack_ai

29 days ago

Run Claude Code with Commonstack in 4 steps: - generate an API key - set 4 environment variables - run claude - /status to verify Set it up now in 5 minutes with @alex_mirran.

8

41

21

2

4K

Commonstack

@commonstack_ai

about 1 month ago

Quickstart -> https://t.co/Lnf9xBw2bJ.

0

7

0

132

Commonstack

@commonstack_ai

about 1 month ago

GPT-5.5 is live on https://t.co/L4uejEYZ40! 🚀🚀 Use the strong reasoning and coding capabilities of GPT-5.5 in your application or with your favorite agentic harness.

3

29

14

0

917

Commonstack

@commonstack_ai

about 1 month ago

Here's a guide for using Commonstack with your OpenClaw agent -> https://t.co/cCV3swMcOY.

1

9

0

1

193

commonstack_ai retweeted

Alex Mirran

@alex_mirran

about 1 month ago

https://t.co/w13o7xWEYe

3

23

9

2

1K

Commonstack

@commonstack_ai

about 1 month ago

DeepSeek-V4-Flash is now live on https://t.co/ksndyxekOL Time to feed your agents!

DeepSeek

@deepseek_ai

about 1 month ago

DeepSeek-V4-Flash 🔹 Reasoning capabilities closely approach V4-Pro. 🔹 Performs on par with V4-Pro on simple Agent tasks. 🔹 Smaller parameter size, faster response times, and highly cost-effective API pricing. 3/n

deepseek_ai's tweet photo. DeepSeek-V4-Flash

🔹 Reasoning capabilities closely approach V4-Pro.
🔹 Performs on par with V4-Pro on simple Agent tasks.
🔹 Smaller parameter size, faster response times, and highly cost-effective API pricing.

3/n

17

2K

136

134

408K

2

13

5

0

806

Commonstack

@commonstack_ai

Last Seen Users on Sotwe

Trends for you

Most Popular Users