Very sad news for the LLM research and open-source community. Does this mean PhD researchers in frontier LLMs, or contributors to open-source LLM infrastructure like Megatron, FSDP, Verl, SGLang, and vLLM, may be using a degraded Claude model in their daily work without being notified?
As part of Dynamo 2.0, the program abstraction proposed in ThunderAgent is being standardized as part of the nvext.agent_context protocol in dynamo.
Inference scheduling/KV cache management with agent lifecycle awareness isn't future anymore, it’s the trend happening right now!
Excited to share that ThunderAgent has been integrated into NVIDIA Dynamo as an experimental router for agentic workloads!
ThunderAgent was designed to schedule at the granularity of agent runs, making agentic serving/rl upto 4x faster!
Huge thanks to @0xishand , @KranenKyle , and the Dynamo team. They have been exceptionally efficient and proactive — the team had already started pushing this forward even before I officially joined @nvidia .
Looking forward to seeing ThunderAgent ideas further evolve within Dynamo. And thanks for the help from @togethercompute
Link: https://t.co/CzteUYO0JD
@simran_s_arora@Chenfeng_X@_weilix@yinfang_chen
#AI #MLsys #Agent #Nvidia
📢 Official Announcement: Qwen Partners with Fireworks AI to Accelerate Access to Qwen Family Models
We are pleased to announce a strategic partnership between Qwen and Fireworks AI to deliver optimized, production-ready deployment of Qwen's closed weights models via the Fireworks Platform. @FireworksAI_HQ
This collaboration empowers developers and enterprises to:
✅ Deploy Qwen models with lower latency and reduced fine tuning and inference costs
✅ Leverage enterprise-grade reliability, security, and scalability
✅ Integrate seamlessly into modern AI workflows
🔹 Get started with Qwen on Fireworks: https://t.co/SEGxfJAGM4
#Qwen #FireworksAI #OpenSourceAI #LLM #AIInfrastructure #ResponsibleAI #DeveloperCommunity
Big Update🤩: #paperclip now includes full papers from all of arXiv, PubMed Central and 150 million abstracts!🖇️
You can give your LLM all that knowledge in one line—all optimally indexed for AI agents. Much more thorough and ~100x faster than web search, and free.
Introducing Kimi K2.6 from @Kimi_Moonshot, a multimodal agentic model with Agent Swarm scaling to 300 sub-agents and long-horizon coding stability. AI natives can now use Kimi K2.6 on Together AI and benefit from reliable inference for production-scale autonomous agent workflows.
NICE Talk 141🌟invites Ph.D. at Georgia Tech Hao Kang @GT_HaoKang to discuss ThunderAgent: 4× Faster LLM Agent Inference!
Time
⏰ PST 3.07 18:00–19:00
⏰ EST 3.07 21:00–22:00
⏰ Beijing 3.08 10:00–11:00
Watch live: https://t.co/4MUXa6HIKK
Register: https://t.co/vP7exZ3tRS
In this talk, the speaker will talk about:
🚀 How can we make LLM agent workflows faster, simpler, and more robust?
❌ Traditional request-level engines (vLLM, SGLang) struggle with KV cache thrashing, memory imbalance, and resource leaks.
✅ ThunderAgent introduces Program Abstraction, treating multi-step agent workflows as programs, unifying GPU, CPU, and remote tool scheduling.
With just two lines of code, ThunderAgent boosts inference throughput by 1.5–3.6×, rollout throughput by 1.8–3.9×, and saves 4.2× disk space, while ensuring high concurrency stability.
Join us to explore a principled, program-level approach to distributed agent inference and RL rollouts.
#AI #LLM #AgenticAI #ReinforcementLearning #DistributedSystems #ProgramAbstraction #ThunderAgent
I used to be a strong believer in the “Bitter Lesson.” However, my view began to shift once I realized that real-world agentic systems inevitably need to call external tools due to limitations in knowledge acquisition, precision computation, and environment interaction.
An important observation is that LLMs, especially when deployed as agents, are not purely connectionist systems. Instead, they are better understood as a hybrid of connectionism and symbolism. While we encode discrete tokens into continuous representations through neural networks, we ultimately decode them back into symbolic forms to operate in the real world.
For example, special tokens such as <EOS> serve as explicit symbolic markers that deterministically control termination. This illustrates that even within LLMs, symbolic structure plays a fundamental operational role. This reflects something deeper: humans use discrete symbols to make sense of a continuous world. We impose structure, define rules, and create abstractions so that reasoning and coordination become possible. Symbolism is not a relic of pre-neural AI; it is a mechanism for control.
f we want LLMs to be controllable, we cannot ignore their symbolic layer. The question is not whether to use symbols, but how to use them more flexibly. We need better ways to integrate discrete symbolic structure with continuous neural computation, rather than pretending that scaling alone will dissolve the need for structure.
GPT-5.3-Codex + the Codex app is the best AI coding tool available right now.
Slept on it for a bit.
Likely going to move back to a ChatGPT Pro sub from Claude MAX because of how good it is.
It's so precise, accurate and excellent at following instructions. There are trade-offs in that it has a more "machine-like" personality than Claude.
I do still love Claude.
But for getting software dev work done, Codex is the best option right now.
It's two things:
1. OpenAI is clearly investing a lot of their human talent into making Codex better.
2. They are co-designing the model and harness together.
And I believe that they have the most rapid post-training capabilities which is why you see a new model iteration every month for the last few months.
Endorsing Codex.
Check our ThunderAgent (https://t.co/fMko6C1M1i) and @GT_HaoKang 's post 👇, 2 lines of code, up to 3.9x throughputs improvement, 4.2x disk memory saving on your agentic inference system 😉
Checkout ThunderAgent led by @GT_HaoKang, intern at @togethercompute! An agentic workflow involves multiple model and tool requests, but inference systems make scheduling decisions on a per-request basis. ThunderAgent introduces a simple "program abstraction" to track the end to end workflow state and improve agentic inference throughput! 🔥
🔥Modifying 2 lines of code and get your agentic serving/rollout up to 3.9x faster losslessly!
⚡️Say hello to ThunderAgent, a fast, simple, and program-aware agentic Inference System.
🥇 We propose a program abstraction to schedule all GPU and CPU resources, the first principled approach for distributed agentic inference and rollout.
🌐 Blog: https://t.co/PAcgTZzlhD
💻 Code: https://t.co/nr7XJj1L7B
📜 Paper: https://t.co/aCD6POzwkU
#AI #ThunderAgent #LLMAgent #Mlsys
1/n
Time to consider not just human visitors, but to treat agents as first-class citizens. Cloudflare’s network now supports real-time content conversion to Markdown at the source using content negotiation headers.
https://t.co/B7wYH4PtA8
Beyond softmax attention
Linear attention and its variants enable faster inference without growing the KV cache.
Let’s learn the core ideas behind efficient sequence modeling. 👇
https://t.co/geNiBXKdlI
Learn how @cursor_ai partnered with Together AI, the AI Native Cloud, to deliver real-time inference for AI-powered coding.
Cursor's in-editor agents generate code while developers actively edit — requiring responses inside the editor's feedback loop. Together AI built the infrastructure to meet those strict latency targets at scale.