Starting with this exciting line of work from @jintian and colleagues at MIT. We tackle the question of: Can we train LLMs to parallelize autoregressive decoding automatically, backed by a performant runtime to exploit this parallelism for improved inference speedup?
Introducing Learned Asynchronous Decoding w/ friends from MIT/Google! LLM responses often have chunks of tokens that are semantically independent. We train LLMs to identify and decode them in parallel, speeding up inference by 1.46x geomean (AlpacaEval) w/ only 1.3% quality loss.
A short article on our #ICML2025 paper (led by
@jintian, @ellieyhc MITxGoogle): PASTA teaches LLMs to adaptively parallelize their own decode, optimizing quality & latency in concert. No hand-crafted heuristics -> learned parallelism, with realized latency improvements on GPUs.
Asynchronous decoding: multiple LLM threads write different parts of an answer in parallel.
In Feb we (MIT×Google) introduced PASTA—the first async-dec method that uses policy learning to optimize latency & quality end-to-end. See us @ E-2600, East Hall A-B, Tue 11pm #ICML.
Scaling Laws provide a valuable lens in guiding model design and computational budgets. Our recent work extends this lens to the realm of _fine-grained_ sparsity. Check out our #ICLR2025 paper, and the thread below from lead-author @jintian summarizing our findings.
📣 The Journey Matters: Our #ICLR2025 paper shows how to pretrain sparse LLMs with half the size of dense LLMs while maintaining quality. We found that the average parameter count during sparse pre-training predicts quality, not final size. An MIT/Rice/Google/ISTA collab 🧵 1/N
TPUs have been a key enabler for the Gemini models -- from large-scale training, to fast and cost-effective serving. Our latest generation TPUs (Ironwood) will bring more exciting compute capabilities to the fore: https://t.co/uillylLyT1
Breaking news: Google is winning on every AI front.
This is not just about Gemini 2.5 but about a reality that OpenAI and Anthropic fans have ignored for too long. Here's a non-exhaustive list:
- Gemini 2.5 Pro is the best model in the world according to benchmarks, vibe checks, high-taste testers, and firsthand testimonies. It's also fast and cheap compared to similar models (Google offers it for free on the Gemini app!)
- Gemini 2.5 Flash (to be announced soon) is much faster and much cheaper, so it captures perfectly the Pareto frontier of cost-performance of cost-efficient models.
- Gemma 3 is a highly competitive open-source model, as good or better than Llama 4 and DeepSeek models.
- That's just LLMs. Google is world-class in image (Imagen 3), video (Veo 2), voice (Chirp 3), and music (Lyria). They're integrating them all in Vertex AI.
- Deep Research with Gemini 2.5 Pro is *twice as good* as OpenAI's Deep Research, according to human testers. Other agents? Yes: Project Astra (assistant) and Project Mariner (computer interaction)
- They just launched Agent2Agent, compatible and complementary to Anthropic's MCP, which they will build in-house as well.
- And they keep publishing papers in top journals (Nature) and going to the top conferences (ICLR, NeurIPS), whereas others jealously keep their most important stuff for themselves.
- That's just the AI stuff, but Google is also a consumer software company with seven 2+ billion monthly users: Search, YouTube, Gmail, Android, Chrome, Maps, and Play Store
- A hyperscaler (Google Cloud)
- A hardware company (TPUs, Ironwood)
- And a phone company (Pixel).
How can OpenAI or Anthropic or even Meta fight such a beast?
Let’s wait for their responses to this. I’ll be here to cover any newsworthy release—even if I’ve already made my bet on who’s most likely to win.
(Read the full post in the link below.)
@SabaMugazambi@JeffDean@NormJouppi And finally, for those interested in more technical details, and codesign across multiple layers of the stack from hardware, circuits to software and all the way up to the datacenter: https://t.co/gVuV978G6U
@Google announced the latest generation of our AI supercomputers (TPUs) -- Ironwood -- this week. Check out the blogpost in quote for the highlights. https://t.co/TMns1HbEkA
Pointers to deep-dives and more technical details in thread. [contd...👇]
@SabaMugazambi@JeffDean@NormJouppi A couple of fun videos that provide a sneak peek into TPUs and how they are plugged into our datacenters: [1] https://t.co/V43HD2SKad, [2] https://t.co/7RoiKy59WZ
Together with Lisa Hsu (Meta), we have been hosting the Computer Architecture Podcast -- we recently crossed 50K downloads. Check out our latest episode with Prof. Arka Basu: https://t.co/RRg6gq7ZMa -- we discuss GPUs, but a different vantage point than AI which is all the rage.
A couple of excellent resources on how to think about AI systems performance, parallelism, and scaling. The below is from colleagues at Google and focused on TPUs. Another resource that dropped in the past-month is the Ultra-scale Playbook from HF: https://t.co/xuE8SZIPzu.
Making LLMs run efficiently can feel scary, but scaling isn’t magic, it’s math! We wanted to demystify the “systems view” of LLMs and wrote a little textbook called “How To Scale Your Model” which we’re releasing today. 1/n
In addition to the ArchReasoning Challenge, please subimit your work at the intersection of ML, Computer Architecture and Systems to the MLArchSys Workshop at ISCA'25 (Tokyo). CFP and topics in-quote.
Please consider submitting your best work. MLArchSys is the best place to showcase your work at the intersection of ML, Computer Architecture, and System. Check out the call for paper and look for new topics we included this year 🚀🔥
https://t.co/QiYsSFVcny
1/3
High-quality data is a key enabler for effective, useful, and actionable use of AI. We are working towards collecting and curating such a dataset for the computer architecture domain. Submit your favorite architecture qns to ArchReasoning Challenge (https://t.co/5JG0DUOpta).
We're excited to launch the 𝐀𝐫𝐜𝐡𝐑𝐞𝐚𝐬𝐨𝐧𝐢𝐧𝐠 𝐂𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞 (https://t.co/SS4EuHt5wA). Design complex, reasoning-based questions that expose the current limitations of LLMs and contribute to the broader effort of improving AI reasoning for comp. arch. and systems.
Returning to Twitter/X after a decade hiatus. My excellent intern(s) at Google with whom I have had the pleasure of working, were kind enough to nudge me to help signal-boost their work. Will also try to share updates on TPUs, AI chips & systems, and computer architecture.