Closed-source models aren't worth the premium.
We generated 12 landing pages with Kimi K2.7 Code and Claude Fable 5. Kimi came in 16x cheaper with comparable quality, especially once we gave it visual context through a design MCP server.
Open-source models are already a practical choice for this kind of workflow.
This is what open-model tokenomics look like in production.
When teams are running billions of tokens, small differences in caching, throughput, and serving efficiency become product-level economics.
MiniMax M3 on Together AI is a strong example: frontier-adjacent quality, open-model economics, and a serving stack built for scale.
M3 by @MiniMax_AI is the best value in AI.
At @HedyAI_ we run close to a billion tokens through the model each day, and @togethercompute's input caching brings our cost down to $0.128/million input tokens.
For a model that is close to the frontier in intelligence and the second most powerful open source model.
Unbelievable ๐คฏ
Open models are what make collective agent intelligence possible.
James Zou from Together AI and Venkat Srinivasan from NVIDIA are joining us July 1 at AI Engineer World's Fair to dig into exactly that.
GLM 5.2 is available now on @togethercompute!
Very fast speeds (200+ tps), try it out & let me know what you think! Video is not sped up.
https://t.co/pggGhspSdj
Introducing GLM-5.2 from @Zai_org, https://t.co/2XG0WEPpHdโs latest flagship open model for long-horizon tasks with 1M context, flexible thinking effort, and stronger agentic coding.
Now available on Together AI, GLM-5.2 runs on research-powered inference for long-context, tool-heavy agent workloads.
@Zai_org Highlights:
๐ 1M context built to sustain long-horizon work
๐ Stronger coding with flexible effort levels to balance latency and depth
๐ Improved architecture with IndexShare, reducing per-token FLOPs 2.9x at 1M context
๐ MIT-licensed open weights for broad technical access
We tested closed and open models by asking them to build small, playable games.
Open models were much cheaper and faster, while producing games that were often close in quality.
โ Opus 4.8 was 15x more expensive than MiniMax M3
โ GPT-5.5 was 10x more expensive than Nemotron Ultra
โ Kimi K2.7 Code was 7x cheaper than Opus 4.8
For more workloads, the closed-to-open shift is becoming hard to ignore: strong quality, better tokenomics, and faster inference.
Built a visual benchmark where I asked closed and open source models to build small games.
Main takeaway: OSS models were a lot faster, cheaper, & produced games with similar quality.
Specifically:
* Opus 4.8 was 15x more expensive than MiniMax M3
* GPT-5.5 was 10x more expensive than Nemotron Ultra
* Kimi K2.7 Code was 7x cheaper than Opus 4.8
You can even play the generated games yourself, the quality gap is surprisingly small (even non existent for some games). Open source models are getting hard to ignore!
Of course, this doesn't extend to all tasks. There are definitely certain hard tasks where you'd benefit from using an Opus 4.8 level model.
But increasingly, you're able to do more and more tasks with cheaper and faster open source models which is a trend I'm seeing with our customers too.
.@DecagonAI cut voice agent cost per turn nearly 6x with Together AI.
They moved from closed models to fine-tuned open models, while keeping latency low enough for real-time voice:
โ <400ms p95 model latency per turn
โ custom speculators and prompt caching
โ optimized serving on NVIDIA Blackwell
โ weekly, sometimes daily model deployment velocity
This is the closed-to-open shift: more control, better tokenomics, and production performance without being locked into proprietary APIs.
Itโs a great question! End user TPS does come at the cost of concurrency. Itโs essentially a dial you can optimize for speed, concurrency, and cost.
That being said, weโve also built in a lot of optimizations into our inference stack so we can provide a higher per user TPS at equivalent concurrency, or offer a higher concurrency at equivalent end user TPS (two sides of the same coin) vs. leading OSS inference engines.
Check out a recent deep dive on our work optimizing coding agents for production: https://t.co/aFTWY0IWEE
Optimizing GLM 5.1 came down to three things:
-> Rewrote the indexer topk kernel
-> Fused the indexer kernel to reduce memory and launch overhead
-> Eliminated CPU overhead that was gating prefill throughput
The bigger win was in the indexer. Once we fixed that, the rest made it even faster.
GLM 5.1 is available on Together AI.