江枫 @maple_cq - Twitter Profile

江枫

@maple_cq

5 days ago

非常好的文章，值得一读终于有模型发布不是聚焦于跑分，而是聚焦于减少token的成本了这才是开启token经济的正确之路

Fuli Luo

@_LuoFuli

6 days ago

Inference Optimizations Behind the MiMo-V2.5 Series API Price Reductions Read the full technical blog: https://t.co/B5tp4tdnim The V2.5 model family, including MiMo-V2.5 and MiMo-V2.5-Pro, is built on a Hybrid Sliding Window Attention (Hybrid SWA) architecture, which compresses KVCache storage to roughly 1/7 that of Full Attention. However, architectural advantages rarely translate directly into measurable gains in production serving. To realize these gains, we redesigned KVCache management, tiered caching, and the prefix-cache tree; addressed key challenges in SWA KVCache handling; and optimized scheduling as well as the Prefill/Decode pipeline. Validated on real production traffic, these optimizations have increased effective KVCache capacity by nearly 5x, with server-side cache hit rates averaging 93%–95% across mainstream harness frameworks. Together with MoE configuration tuning and multimodal inference optimizations, they enable more efficient long-context inference and form part of what makes the recent API price cuts possible.

54

934

104

406

128K

0

14

江枫

@maple_cq

5 days ago

@_LuoFuli 非常好的技术文章

0

20

江枫

@maple_cq

6 days ago

Minimax 要回归A股上市了，我猜DeepSeek离上市也不远了

0

1

0

93

江枫

@maple_cq

6 days ago

Dell的爆发说明AI产业链开始变化了客户从采购单颗GPU，到采购一整套的AI基础设施过去三年属于算力时代，未来3年属于AI Factory时代电力，液冷，网络，服务器等基础设施将开始大幅增长 AI终于回归到物理世界！

0

1

0

88

江枫

@maple_cq

6 days ago

@MichaelDell 没想到Dell靠AI打了这个漂亮的翻身仗

0

3

0

1

42

江枫

@maple_cq

6 days ago

@00Sindirella Hello

0

6

江枫

@maple_cq

6 days ago

@james84_ 你好

0

8

江枫

@maple_cq

6 days ago

NVIDIA 和Microsoft估计要联手推出AI PC了。 NVIDIA的订单来源又多了一笔

NVIDIA

@nvidia

7 days ago

A new era of PC. 25.0528, 121.5990

2K

28K

2K

3K

12M

0

1

0

33

江枫

@maple_cq

6 days ago

@MarioNawfal 挺像一个退休老大爷的

0

35

江枫

@maple_cq

6 days ago

@unusual_whales 是因为发现token的消费成本比人更高了？

0

9

江枫

@maple_cq

6 days ago

x上线自动翻译的最大好处就是，看外文博主的帖子更方便了 @BrianFeroldi

0

1

0

56

江枫

@maple_cq

7 days ago

codex含金量又提升了，computer use也可以用了

OpenAI

@OpenAI

7 days ago

Windows users, this one’s for you. Computer use now works on Windows, so Codex can take action on your Windows computer. And with Windows support for Codex in the ChatGPT mobile app, you can start, review, and steer tasks on the go while work continues on your Windows machine. An early experience, but we’re working on more ways to keep your work moving, wherever you are.

884

9K

958

2K

1M

0

16

江枫

@maple_cq

7 days ago

@jarredsumner 用rust重写是考虑到loop执行的内存消耗么

0

55

江枫

@maple_cq

7 days ago

@googlegemma 参数和树莓派5相比如何呢？

0

12

江枫

@maple_cq

7 days ago

这不就是把n8n,dify的活给干了么。现在有人用n8n和dify么

ClaudeDevs

@ClaudeDevs

8 days ago

New in Claude Code (research preview): dynamic workflows. Claude writes an orchestration script on the fly, then spins up a large fleet of coordinated subagents in parallel to take on your most complex tasks. Use the word "workflow" in a prompt to get started.

ClaudeDevs's tweet photo. New in Claude Code (research preview): dynamic workflows.

Claude writes an orchestration script on the fly, then spins up a large fleet of coordinated subagents in parallel to take on your most complex tasks.

Use the word "workflow" in a prompt to get started. https://t.co/re4SG3AyDm

366

10K

955

6K

4M

0

1

0

63

江枫

@maple_cq

8 days ago

为啥有些帖能自动翻译，有些不能？是哪里需要设置么？

0

18

江枫

@maple_cq

9 days ago

@MaxForAI 连我这样不是做web开发的人，都可以用AI写前端和后端，并上线部署一个web，互联网公司裁员也不奇怪了身处芯片行业，感觉嵌入式开发也快了

0

1K

江枫

@maple_cq

9 days ago

当初小米加入手机市场，把手机价格打成骨折价现在小米加入token市场，token看来也要骨折了

Fuli Luo

@_LuoFuli

9 days ago

Behind the MiMo API Price Reduction: The deepest price cut, up to 99%, is for Input (Cache Hit). The core reason is our inference framework now supports hierarchical KV cache optimization for SWA. Production inference engine tests show this optimization increases cached token capacity by 5x, equivalent to an 80% reduction in caching costs. Combined with Cache Read Overlap among multiple Full Attention modules in the Hybrid model, actual costs are further reduced. Prices for Input (Cache Miss) and Output are also reduced by 60%-80%. This mainly benefits from the extreme 1:7 Full:SWA sparsity ratio brought by the model architecture (the prefill compute of the 70-layer MiMo-V2.5-Pro roughly equals a 10-layer GQA model). This kept our original inference costs well below the industry average, naturally leaving a 2x-3x profit margin in pricing. This price adjustment simply reflects our decision to pass these structural cost efficiencies directly to developers. Operating at these newly reduced API prices, our production inference engine is running at near full capacity, and we can still essentially break even. We previously advised LLM companies not to "blindly cut prices" precisely because very few model architectures and inference optimizations can keep API costs from running at a loss. If more architectures that save compute and KV cache emerge, along with better inference Infra to drive down API costs, this will form an excellent virtuous cycle in the industry. More crucially, affordable, high-performance model APIs will drive real, sustained, and at-scale inference demand. This upstream demand pulls forward the development of the entire AI infrastructure chain—including chips, servers, optical transceivers, PCBs, liquid cooling, power, energy storage, and data centers—serving as a strategic fulcrum for a systemic revaluation of AI hardware. In the long run, this injects more affordable and accessible compute into both training and inference pipelines, accelerating the parallel evolution of global AGI across multiple regions and technical routes. For more technical details, we will release a detailed Blog post later.

154

2K

187

438

186K

0

42

江枫

@maple_cq

9 days ago

@daluoseo 可能大家的思维是，3个臭皮匠，顶个诸葛亮，但这明显不适合AI场景

0

782

江枫

@maple_cq

10 days ago

@VincentLogic 又回到原点了，手搓文字能力又成了香饽饽

1

2

0

2K

江枫

@maple_cq

Last Seen Users on Sotwe

Trends for you

Most Popular Users