@dylan522p@iAmHenryMascot Sorry I assumed that was a broader view rather than your users. Interesting insight, I am curious whether it plays out similarly on open router https://t.co/8LpHdhbeVZ
@dah_uk@edzitron Yeah that makes sense, I’m on the $20 plan and it’s so easy to hit limits (I can do it in like 20 minutes). Though oddly it sometimes just lets one continue. Not sure if bug or expected behaviour. It’s not well documented and it changes quite often (over the last two months)
@thetreygoff My problem with Claude Opus models in software development is that they jump to conclusions without verifying assumptions and then double down until explicitly presented with counter evidence. In contrast, GPT models will not paint their incorrect fantasy world for you.
Nemotron 3 Ultra is fast and genuinely good
Compared it with 3 frontier models: DeepSeek V4, MiniMax M3, and Qwen 3.7 Max on 2 prompts
very impressive results
@dah_uk@edzitron Thanks, I see, what does the UI look like re: billing for this? Does the billing / credits kick in immediately after expiring the 5HR?
Hi. Over the last 24 hours we had three separate small incidents that affected Codex reliability. Those are three too many and we are taking active steps for them to not reproduce.
I have reset usage limits for Codex across all paid plans. May the tokens flow again.
DeepSeek just made its 75% price cut on V4-Pro permanent. Xiaomi's MiMo slashed V2.5 pricing by up to 99%, effective today. Most coverage frames this as a price war. The more interesting part is the engineering that makes these numbers sustainable.
DeepSeek's V4 paper describes a *hybrid attention architecture* that attacks the core bottleneck of long-context inference: the KV cache. Traditional transformers store key-value pairs for every token in the context. At 1 million tokens, this cache alone can fill an entire GPU's memory. V4 introduces two interleaved attention types.
Compressed Sparse Attention (CSA) compresses every 4 tokens into a single KV entry, then selects only the top-k most relevant compressed blocks per query. Heavily Compressed Attention (HCA) goes further, compressing 128 tokens into one entry and running dense attention over the result. The compressed sequence is short enough that dense attention stays cheap.
V4-Pro's KV cache at 1M tokens is 10% (!!) of V3.2's. Single-token inference FLOPs drop to 27% (!!). The model has 1.6 trillion total parameters but only activates 49 billion per token through Mixture-of-Experts routing, the knowledge capacity of a massive model at the compute cost of one thirty times smaller.
MiMo's approach is different but lands in the same place. Xiaomi's team implemented Sliding Window Attention via SGLang HiCache, reducing KV cache data transfer across GPU memory, CPU memory, and SSD to roughly 1/7 (!!) of previous volume. Cacheable tokens expanded by 5x (!!). Combined with expert parallelism optimization and input length bucketing, per-token serving cost dropped enough to make permanent pricing at these levels viable.
V4-Pro now sits at $0.87 per million output tokens. MiMo V2.5-Pro at roughly $3/M output, with Flash variants far below that. A year ago, sub-dollar output pricing meant you were using a small distilled model with real capability tradeoffs. These are frontier-class reasoners with million-token context windows.
Both companies can commit to permanent cuts because the reductions come from the architecture itself. When your attention mechanism physically processes fewer FLOPs per token and your cache occupies a fraction of the memory, the cost to serve is structurally lower. The price follows the cost curve.
Grok foundation model V9-Medium (1.5T) has finished training. Evals look good. A lot of Cursor data was added in supplementary training and there is more to come.
Fine-tuning is underway and reinforcement learning begins in a few days. 2 to 3 weeks to public release.
This will be a major improvement over the 0.5T v8-small that currently serves all Grok production traffic, especially for difficult coding tasks.
You know what that means?
I can keep generating massive training datasets.
Using Codex 5.5 as orchestrator and Deepseek v4 pro as executor.
For reference, it costed ~$60 for 200M high quality dataset.
Introducing a 100% free coding agent with DeepSeek v4 Pro
Choose any model, all free:
- DeepSeek v4 Pro/Flash
- Kimi K2.6
- MiniMax M2.7
npm i -g freebuff
@jahooma@kwlckgf that’s great to hear, I’ve just used an hour of DeepSeek Pro and it is solid, used the Flash model previously. I’ve tried out Antigravity CLI and used up the free weekly quota in about 5 min!