The longer the context, the more memory your LLM needs. We introduce research techniques to compress that memory 200x on the fly without changing the base model.
1/ You can shrink a language model's KV cache by 200×, in a single forward pass, and it still answers correctly.
At 256k context that's 36 GiB of cache down to ~360 MiB, with no change to the base model.
Here's how we did it 👇
Baseten is live on the Respan Gateway.
Congratulations to the @RespanAI team on their Gateway launch as they bring observability, evals, and routing to agents.
Try Baseten Model APIs now on Respan.
Model selection isn't just a fancy term for "looking at benchmarks". If you're just auto-updating and going off twitter vibes, you're not really adding any value to your business or your customers. To do this well, it means you need to deeply understand your use cases, how much value your customers ascribe to a problem, how much margin you want to make on that product, and how much time you want to invest into growing that margin. Came here me rant more on June 25 https://t.co/4GI8G8XFGW
Working in the Training team at Baseten, I often see companies agonize over which model to use. So many people worry about how to keep up with benchmarks and new releases
But with post-training and specialization, and as we see a rising tide in the intelligence of many open-source models, what really matters is your learning signal. Do you have the right user metrics to say whether a model is doing poorly or well at your task, and to use that to learn and hillclimb the task?
If you want to learn more, I’m moderating a panel on June 25th in SF at 6 PM with Gamma co-founder Jon Noronha (@thatsjonsense) and Notion AI lead Sarah Sachs (@sarahmsachs) on model selection in a multi-model landscape.
Are you tired of waiting 17 minutes for an AI agent to finish a code change?
As an agent’s context grows, standard transformer attention can turn long runs into a bottleneck.
@NVIDIAAI Nemotron 3 Ultra addresses this with a hybrid architecture that replaces several attention-heavy layers with Mamba layers.
This makes long-context inference far more efficient. In benchmarked settings, this means:
→ step 300 runs as fast as step 3
→ up to 5x higher throughput
→ up to 30% lower cost
Today, Nemotron 3 Ultra, Nemotron 3.5 ASR, and Nemotron 3.5 Content Safety are available on Baseten for production AI teams.
Introducing NVIDIA Nemotron 3 Ultra.
A frontier smart open model built for long-running agents that need to plan, reason, use tools and keep working across complex coding, research and enterprise workflows.
Up to 5x faster inference and up to 30% lower cost for agentic tasks.
Learn more: https://t.co/h9XLqqYPFf
Today we're announcing MAI-Thinking-1 with Microsoft and it will be available on Baseten soon.
Microsoft built something genuinely different here: a commercial-grade thinking model trained on clean data with no distillation from third-party models and designed to be fine-tuned by the enterprises using it. Microsoft AI guarantees 100% eyes-off on post-training data and Baseten will handle the fine-tuning and deployment at scale.
The future isn't one model. It's many models, each owned by the businesses that shaped them and MAI-Thinking-1 is a big step in that direction.
https://t.co/8w9k4jwrgq
I’m thrilled to welcome Gabe Stern to Baseten to lead Legal. Gabe is the whole package: deeply experienced, sharp, highly trusted, and commercially minded. We first got to work together at Slack, where he was an exceptional partner and played a critical role through Slack's hyper-growth & IPO. I’m personally very happy to be reunited with Gabe, and even happier that Baseten gets to benefit from his judgment, partnership, and instincts. Welcome, Gabe!
The next wave of AI companies will be built on fast, reliable infrastructure, and the trust to deploy it in production. Gabe has helped iconic technology companies scale through this exact phase. I'm excited to welcome him to Baseten as our General Counsel.
Agents append to their own context. But attention is quadratic, so 2x context = 4x work per step.
Nemotron 3 Ultra swaps most attention for Mamba, so state is fixed-size and compute cost is linear. That means 5x faster inference that's 30% cheaper.
10M developers use @opencode every month. This means the experience has to feel the same every hour of every day; slow or inconsistent inference breaks productivity and flow.
Enter Baseten. With Baseten's Model APIs, OpenCode achieves 5x faster TPS, sub-second TTFT, and 33% blended cost savings passed directly to users through cache token pricing.