What does it actually cost to run an agent?
We traced a Claude Code session: 283 inference requests in 33 mins, context peaking past 150K tokens.
The economics break under conventional serving.
New blog w/ Eduardo Alvarez and @benklieger on what fixing it takes 👇
What does it actually take to run agentic workloads at scale?
⚡Agents push token consumption, context length, and latency into extremely demanding regions. Extreme co-design on the Vera Rubin platform is built for these complex workloads, delivering 400+ tokens/sec/user on trillion-parameter MoE models.
Tech blog ➡️ https://t.co/DIxW96omML
Evaluating AI inference TCO?
Look beyond compute costs and evaluate cost per token which reflects end-to-end system performance and actual utilization across the entire AI factory—spanning GPUs, CPUs, storage, networking, software, and more.
Lowest cost per token isn’t achieved by optimizing peak chip specs alone.
It’s the result of deep, end-to-end co-design with our partners including @CoreWeave, @Nebiusai, @Nscale, and @togethercompute across the full stack.
🚨Today we’re rolling out Prompt Caching on GroqCloud.
Keep hot prompts in memory, cut cached token costs by 50% and slash latency.
Faster response, smarter inference.
Learn more 👇
Introducing: Groq Code CLI
“Hold on!”, you say. Coding CLIs are everywhere. The Groq Code CLI is different. It is a template and building block for developers looking to customize and extend a CLI to be entirely their own. Leading open-source CLIs are all fantastic yet gigantic. Feature-rich: yes, but local development with such a large codebase can be unfriendly and overwhelming.
This is a project for those developers looking to dive in.
Link to the code: https://t.co/cKdIJXm8aS
And yes, it created that in 14.6s on Kimi K2.
@GroqInc
Meet your full-stack AI assistant, powered by OpenAI’s new open gpt-oss models (20B + 120B) and Groq.
Includes search, code execution, STT/TTs, code-gen editor, AI notes, and custom tool calling.
Build Fast.
OpenAI’s open models are live and already running on Groq. Try gpt-oss-20B and gpt-oss-120B today.
Groq delivers 128K context and built-in tools such as code execution and browser search. For the first time, developers and enterprises can deploy open models backed by OpenAI instantly, anywhere, at scale.
Start building now. Links in comments.
Two PhD students build a voice agent from scratch—and take on a $350B industry. Enter @GroqInc: the catalyst.
I can’t get enough of real stories like this one. Congrats to the @phonely_ai team on the beginnings of your journey. Onward.
⚡ @GroqInc powers up RAISE your HACK - the World’s Largest AI Hackathon, giving you direct access to their groundbreaking LPU architecture for blazing-fast LLM performance! Prepare to experience inference at an entirely new level.
And guess what? They’re bringing… quite a prize pool to the table!
👉 In case you missed it, @GroqInc just released Compound Beta, their first compound AI system, built by @benklieger!
Here is an open sourced demo with voice-in capability to kickstart your development work. Link in the comments below! 🔥
llama 4 scout on @groqinc paired with @elevenlabs is incredible for multilingual voice agents.
insanely smooth even switching between different languages thanks to low latency.
and for those who have been asking about its turkish - i've been testing and it's pretty good. :)
🔊🔛🔥@GroqInc just dropped support for @MetaAI's Llama 4 🚀
But what if you want to have real-time dynamic conversation with your images? Now you can thanks to STT, LLM, Image-to-Text, and TTS models powered by Groq and the @LiveKit SDK. Repo and details below 👇
Llama4 + MCP find me an Airbnb 🦙✨
- Giving Llama4 a "thinking" scratchpad is a superpower
- @GroqInc is so fast that it makes the additional "think" step tolerable.
I love everything about this! Full code below 👇