We've launched the fastest GLM 5 API available at 190 TPS and 0.79 sec TTFT with the Baseten Inference Stack.
Ready for your coding and agentic workflows.
https://t.co/iiRmQK3D5U
It’s Monday, and we could all use a little help thinking. Thankfully we have the new Kimi K2 Thinking to do it for us.
Kimi K2 Thinking is now live in our Model APIs with the most performant TTFT (0.3 sec) and TPS (140) on @openrouter & @ArtificialAnlys . If you’re looking for an alternative to GPT-5, utilize coding or are building agentic AI, you *need* to give this model a try.
Congrats @Kimi_Moonshot , you all are astounding.
Get access in the comments ➡️
speculation, in this case a eagle-3, remains one of the biggest levers to go from good to great. amazing job to leapfrog the market and get the most out of our GPUs
This week, Baseten's model performance team unlocked the fastest TPS and TTFT for gpt-oss 120b on @nvidia hardware. When gpt-oss launched we sprinted to offer it at 450 TPS... now we've exceeded 650 TPS and 0.11 sec TTFT... and we'll keep working to keep raising the bar.
We are proud to offer the best E2E latency available with near-limitless scale, incredible performance, and the highest uptime 99.99%.
It's important to support newly released open-weight models on day 1. But it's not noteworthy. What's noteworthy is to have the inference optimization muscle to immediately blow the competition out of water on latency and throughput.
As measured by OpenRouter:
We're excited to introduce the Baseten Performance Client, a new open-source Python library for up to 12x higher throughput for high-volume embedding tasks!
Stand up a new vector database, preprocess text, and run massive workloads in <2 minutes (vs. 15+ with AsyncOpenAI).
New Qwen-QWQ running at 90tokens/s generation speed on a single H100 @baseten using a new spec-dec stack.
Around 2x more than the rest of the leaderboard (https://t.co/StCzjaZ1i0).
2 things.
1. i have loved working on this team. model performance is so much fun and so rewarding.
2. persistence is key. we started working on model performance end of 2023 and watching us slowly become better and better has been an incredible experience.
when i tell people working in infra is like being a plumber people assume it’s because of lots of pipe connecting, when in fact it’s because i spend most of my day digging through shit
We're excited to announce that we've raised a $40M Series B to help power the next generation of AI-native products with performant, reliable and scalable inference infrastructure.
https://t.co/NAn8LduZ6I
Ready to try open source LLMs?
Switch from GPT to Mistral 7B in the smallest refactor you'll ever ship: just 3 tiny code changes.
If you're making the jump, DM us for $1,000 in free credits.
https://t.co/izLK8UUJBZ
Repurposing @tuhinone's Llama v2 truss, got FreeWilly 2 up in under a minute. `:s/meta-llama\/Llama-2-70b-chat-hf/stabilityai\/FreeWilly2`. 275GB of weights later we're running at 23 tok/s out of the box.
We keep getting asked by users if they can use the 70B parameter model in production.
We're serving the chat variant of Llama-2 70B on 2xA100 and getting pretty great throughput — it's cooking!