andy hwang

@andrewnhwang

@thinkymachines formerly: @openai, @weHRTyou, @stanford🌲

SF / NYC

Joined July 2020

419 Following

690 Followers

57 Posts

andrewnhwang retweeted

Scale Labs @ScaleAILabs

about 2 months ago

Congrats to @thinkymachines on the release of TML-Interaction-Small and tying for the top spot on our Audio MC S2S leaderboard! 🥇 Their interaction model scores a 43.4% APR, demonstrating an impressive level of intelligence and long-context awareness compared to existing full-duplex models, without losing responsiveness in conversation.

ScaleAILabs's tweet photo. Congrats to @thinkymachines on the release of TML-Interaction-Small and tying for the top spot on our Audio MC S2S leaderboard! 🥇

Their interaction model scores a 43.4% APR, demonstrating an impressive level of intelligence and long-context awareness compared to existing full-duplex models, without losing responsiveness in conversation.

229

34K

andrewnhwang retweeted

Thinking Machines

@thinkymachines

about 2 months ago

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. https://t.co/AFJZ5kH7Ku

465

16K

12K

andy hwang @andrewnhwang

2 months ago

@MatternJustus Really exciting collection of tasks! Great work @MatternJustus and @calvinchen

255

andrewnhwang retweeted

Thinking Machines

@thinkymachines

4 months ago

We are partnering with @nvidia to power our frontier model training and platforms delivering customizable AI. https://t.co/JFlf9hJkKh

thinkymachines's tweet photo. We are partnering with @nvidia to power our frontier model training and platforms delivering customizable AI.

https://t.co/JFlf9hJkKh https://t.co/M23RB8dZxI

101

164

289

675K

Who to follow

Rachel Park

@rachelsupark

@benchmark // @modal @warpdotdev @stanford

friends and family

@friendsandfam_

A home for makers and founders @ Stanford

david

@davidtsong

intern @ maximal | prev at AI startups, VC, and stanford cs DMs open if you want help w/ benchmarks/evals/custom models

andy hwang @andrewnhwang

4 months ago

@ZitongYang0 @EmmanuelCandes @tatsu_hashimoto @percyliang @ruomingpang Congratulations Dr. Yang 🎉

484

andy hwang @andrewnhwang

4 months ago

@ProximalHQ congrats @calvinchen excited for you!

andrewnhwang retweeted

Surya

@sdand

5 months ago

inspired by SDPO, i made continualcode -- a minimal claude code that learns from your corrections in real-time, built on tinker. when you deny a diff, the model uses your correction as context to teach itself, takes a gradient step on LoRA, and retries with updated weights. claude code but it updates the model weights!

389

316

83K

andrewnhwang retweeted

Woosuk Kwon

@woosuk_k

5 months ago

Today, we're proud to announce @inferact, a startup founded by creators and core maintainers of @vllm_project, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster. The Challenge Inference is not solved. It's getting harder. Models grow larger. New architectures proliferate: mixture-of-experts, multimodal, agentic. Every breakthrough demands new infrastructure. Meanwhile, hardware fragments: more accelerators, more programming models, and more combinations to optimize. The capability gap between models and the systems that serve them is widening. Left this way, the most capable models remain bottlenecked and with full scope of their capabilities accessible only to those who can build custom infrastructure. Close the gap, and we unlock new possibilities. And the problem is growing. Inference is shifting from a fraction of compute to the majority: test-time compute, RL training loops, synthetic data. We see a future where serving AI becomes effortless. Today, deploying a frontier model at scale requires a dedicated infrastructure team. Tomorrow, it should be as simple as spinning up a serverless database. The complexity doesn't disappear; it gets absorbed into the infrastructure we're building. Why Us vLLM sits at the intersection of models and hardware: a position that took years to build. When model vendors ship new architectures, they work with us to ensure day-zero support. When hardware vendors develop new silicon, they integrate with vLLM. When teams deploy at scale, they run vLLM, from frontier labs to hyperscalers to startups serving millions of users. Today, vLLM supports 500+ model architectures, runs on 200+ accelerator types, and powers inference at global scale. This ecosystem, built with 2,000+ contributors, is our foundation. We've been stewards of this engine since its first commit. We know it inside out. We deployed it at frontier scale—in research and in production. Open Source vLLM was built in the open. That's not changing. Inferact exists to supercharge vLLM adoption. The optimizations we develop flow back to the community. We plan to push vLLM's performance further, deepen support for emerging model architectures, and expand coverage across frontier hardware. The AI industry needs inference infrastructure that isn't locked behind proprietary walls. Join Us Through the open source community, we are fortunate to work with some of the best people we know. For @inferact, we're hiring engineers and researchers to work at the frontier of inference, where models meet hardware at scale. Come build with us. We're fortunate to be supported by investors who share our vision, including @a16z and @lightspeedvp who led our $150M seed, as well as @sequoia, @AltimeterCap, @Redpoint, @ZhenFund, The House Fund, @strikervp, @LaudeVentures, and @databricks. - @woosuk_k, @simon_mo_, @KaichaoYou, @rogerw0108, @istoica05 and the rest of the founding team

$woosuk_k's tweet photo. Today, we're proud to announce @inferact, a startup founded by creators and core maintainers of @vllm_project, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster. The Challenge Inference is not solved. It's getting harder. Models grow larger. New architectures proliferate: mixture-of-experts, multimodal, agentic. Every breakthrough demands new infrastructure. Meanwhile, hardware fragments: more accelerators, more programming models, and more combinations to optimize. The capability gap between models and the systems that serve them is widening. Left this way, the most capable models remain bottlenecked and with full scope of their capabilities accessible only to those who can build custom infrastructure. Close the gap, and we unlock new possibilities. And the problem is growing. Inference is shifting from a fraction of compute to the majority: test-time compute, RL training loops, synthetic data. We see a future where serving AI becomes effortless. Today, deploying a frontier model at scale requires a dedicated infrastructure team. Tomorrow, it should be as simple as spinning up a serverless database. The complexity doesn't disappear; it gets absorbed into the infrastructure we're building. Why Us vLLM sits at the intersection of models and hardware: a position that took years to build. When model vendors ship new architectures, they work with us to ensure day-zero support. When hardware vendors develop new silicon, they integrate with vLLM. When teams deploy at scale, they run vLLM, from frontier labs to hyperscalers to startups serving millions of users. Today, vLLM supports 500+ model architectures, runs on 200+ accelerator types, and powers inference at global scale. This ecosystem, built with 2,000+ contributors, is our foundation. We've been stewards of this engine since its first commit. We know it inside out. We deployed it at frontier scale—in research and in production. Open Source vLLM was built in the open. That's not changing. Inferact exists to supercharge vLLM adoption. The optimizations we develop flow back to the community. We plan to push vLLM's performance further, deepen support for emerging model architectures, and expand coverage across frontier hardware. The AI industry needs inference infrastructure that isn't locked behind proprietary walls. Join Us Through the open source community, we are fortunate to work with some of the best people we know. For @inferact, we're hiring engineers and researchers to work at the frontier of inference, where models meet hardware at scale. Come build with us. We're fortunate to be supported by investors who share our vision, including @a16z and @lightspeedvp who led our $150M seed, as well as @sequoia, @AltimeterCap, @Redpoint, @ZhenFund, The House Fund, @strikervp, @LaudeVentures, and @databricks. - @woosuk_k, @simon_mo_, @KaichaoYou, @rogerw0108, @istoica05 and the rest of the founding team$

181

129

351

485K

andrewnhwang retweeted

Thinking Machines

@thinkymachines

7 months ago

Congratulations to @axiommathai on their achievement! AxiomProver, a mathematics model fine-tuned with Tinker, got top scores on the Putnam Math Competition.

669

154

301K

andrewnhwang retweeted

Soumith Chintala

@soumithchintala

7 months ago

kinda cool that Axiom started just 4 months ago, and just had amazing Math results on Putnam because they bootstrapped infra via @thinkymachines Tinker. An early proof-point that Tinker could be to AI frontier research labs what AWS was to Product startups in the 2010s.

416

63K

andrewnhwang retweeted

Yangjun Ruan

@YangjunR

7 months ago

I’ll be attending #NeurIPS starting Wednesday as part of @thinkymachines! Feel free to DM me if you’d like to catch up, chat about research, or learn more about Thinky (we have openings!)🤝 https://t.co/IjUWdrtEJj

165

17K

andrewnhwang retweeted

Mira Murati

@miramurati

7 months ago

Consider joining us https://t.co/WzNNHtd3Oa

634

584K

andrewnhwang retweeted

LMSYS Org

@lmsysorg

7 months ago

🚀 Introducing Miles — an enterprise-facing RL framework for large-scale MoE training & production, forked from slime. Slime is a lightweight, customizable RL framework that already powers real post-training pipelines and large MoE runs. Miles builds on slime but focuses on new hardware (e.g., GB300), large-scale MoE RL, and production-grade stability. Please read more about Miles' current status and roadmap here👇

lmsysorg's tweet photo. 🚀 Introducing Miles — an enterprise-facing RL framework for large-scale MoE training & production, forked from slime.

Slime is a lightweight, customizable RL framework that already powers real post-training pipelines and large MoE runs. Miles builds on slime but focuses on new hardware (e.g., GB300), large-scale MoE RL, and production-grade stability.

Please read more about Miles' current status and roadmap here👇

263

212

137K

andrewnhwang retweeted

Soumith Chintala

@soumithchintala

7 months ago

thinking machines....the people are incredible

144

220

804K

andrewnhwang retweeted

Lequn Chen @abcdabcd987

8 months ago

Wrote a blog post on why collective communication feels awkward for newer LLM workloads (disaggregated inference, RL weight update, MoE), why people don’t just use raw RDMA, how we approached it, and some behind-the-scenes stories. https://t.co/G0IiHo54qc

230

173

22K

andrewnhwang retweeted

Thinking Machines

@thinkymachines

8 months ago

Science is best shared! Tell us about what you’ve built or discovered with Tinker, so we can tell the world about it on our blog. More details at https://t.co/2z5U597QZ4

446

153

148K

andrewnhwang retweeted

Iain Dunning

@iaindunning

8 months ago

This was a real pleasure and honor, thanks @TheStalwart and @tracyalloway for the opportunity. I hope this strips back some of the mystique of the quant trading world, and adds an interesting element to peoples understanding to the "AI story" we're all living through, that is distinct from LLMs.

127

131K

andrewnhwang retweeted

Thinking Machines

@thinkymachines

8 months ago

Roadmap update: Tinker launched into private beta a month ago, and we've seen hundreds of builders and researchers train and experiment with models on our platform. This month we've added new models, expanded the cookbook, and improved overall capacity and performance.

115

543

130

266K

andrewnhwang retweeted

Thinking Machines

@thinkymachines

8 months ago

Today we’re announcing research and teaching grants for Tinker: credits for scholars and students to fine-tune and experiment with open-weight LLMs. Read more and apply at: https://t.co/EAx6uOpDCS

998

126

600

505K

andrewnhwang retweeted

Pengyu Zhao

@zpysky1125

8 months ago

MiniMax M2 Tech Blog 3: Why Did M2 End Up as a Full Attention Model? On behave of pre-training lead Haohai Sun. (https://t.co/upkka9WIlV) I. Introduction As the lead of MiniMax-M2 pretrain, I've been getting many queries from the community on "Why did you turn back the clock and go with full attention with MiniMax M2?" After explaining the backstory in one chat after another, I figured it's time to write down our journey in a blog. Honestly, I could give you the textbook debate. I could talk all afternoon about why you should build linear/sparse attention. Then, I could turn around and talk all afternoon about why you shouldn't. But what's the point of all that hand-waving? The real question is whether you should actually do it. So, let's start with the conclusion: We are always working on it. But in a real-world, industrial-grade system, the truth is that efficient attention still has some way to go before it can definitively beat full attention. As LLMs have evolved, the entire stack has become monstrously complex. We serve more scenarios, and the architecture design trade-offs are exploding: "How does it perform on code and math? What about agent scenarios? How does it handle multimodality? Does long-chain CoT still hold up? Can RL scale on top of it? Are there hidden traps with low-precision compute? How do you implement interleaved thinking, caching, or speculative decoding? ... " In short, there's a vast difference between the promise on paper and its payoff in production. You only get to claim that payoff after satisfying Condition 1...n and solving Problem 1...n. II. Why Efficient Attention? Let's do a thought experiment. If you had infinite compute, would you even bother with linear or sparse attention? Some might bring up theoretical arguments about softmax attention "oversmoothing" in an infinite context... but who knows? Under the current compute bound, no model has truly pushed softmax attention to its absolute limit. So, for all practical purposes, the race for efficient attention is a race to save compute. For our M2 design, could we aim to save tokens — achieving the same quality with fewer tokens? Well if you believe in scaling laws, to achieve this goal, you'd probably bet on other paths to get there, not efficient attention. So, the simple truth is this: Compute is finite. We need an architecture that makes better use of it — models that achieve higher performance under the same budget (training & inference). III. The Real Bottlenecks To build a model that can practically be deployed and used by the community, we have to start with what users care: Quality, Speed (TPS), and Price. Quality is non-negotiable. A useless model is useless even if it's free. So how do we make a Linear/Sparse/Hybrid Attention model that performs well enough? The biggest challenge here isn’t the architecture design — the real bottleneck is the limitations of evaluation. (As for speed and price, those are heavily influenced by the inference stack—and great models tend to attract great engineers to optimize them.) The Evaluation Trap: Goodhart's Law in Action “As long as you build the benchmark, I’ll find a way to beat it.” Over the past few years of LLM development, the pace of leaderboard progress is staggering. No matter how hard a benchmark is — even if the SOTA score starts in single digits — once it catches the industry’s attention, it’s usually crushed within a few iterations. But how do you build an evaluation system that is comprehensive and actually reflects a model's true capabilities? That’s one of the hardest — and most critical — problems in LLM development, and it becomes even more acute when you start messing with a component as fundamental as attention. Benchmarks are a Leaky Abstraction There’s no free lunch. When you reduce the complexity of attention, you pay a price. The question is, where? When we were developing MiniMax-Text-01, everyone was still evaluating MMLU, BBH, MATH, and LongBench (all of which are now saturated). From the perspective of a year ago, a hybrid of Lightning Attention and Full Attention looked just as good as pure full attention. Our own small-scale hybrid models confirmed this on the leaderboards. (Did we find a free lunch?) Not quite. The price paid became obvious at a larger scale: the model had clear deficits in complex, multi-hop reasoning tasks. Okay, once a problem is exposed, you can fix it. We developed proxy metrics for this specific weakness and iterated until the hybrid model seemed to match MHA. But does that proxy metric still correlate with real-world downstream performance at an even larger scale? Are there other hidden weaknesses? Who knows. We haven't run those experiments yet. The better the models get, the harder they are to evaluate. But that’s a must part of the journey — keep it up, eval teams! The High Cost of Knowing Things For complex reasoning tasks, we can sometimes find early proxy metrics that correlate well with final performance — but not for all tasks (at least, not yet). As tasks get harder, the amount of experiment compute required just to get a statistically significant signal on your metric grows astronomically — which is ironic, since we study efficient attention because compute is limited. And beyond the academic benchmarks, optimization issues often only surface at scale. You never really know what’s going to happen until you scale up. Anyone who read our M1 paper will recall the serious precision issues we hit during RL training — problems that would’ve been spotted earlier. Going back and analyzing Lightning Attention's numerical convergence with that experience in hand was incredibly clarifying. Discovering the real problems is often far harder than solving them. A Symphony of Variables There are just too many variables in model training. Different architectures behave very differently on different data distributions and with different optimizers. In a world where our data is constantly being updated, an experiment run on last month's data mix might yield the opposite conclusion today. We can’t observe everything perfectly — but we’re working on finding more reliable experimental strategies. Infrastructure: Where Theory Meets Metal Compared to full attention, the infrastructure for linear and sparse attention is much less mature. To actually get the promised results, there’s still a lot of groundwork to fill in. Take linear attention for example: If you analyze the compute intensity of existing linear architectures, many of them are memory-bound — even during training. Without extreme IO optimization, you’re basically leaving a huge amount of GPU FLOPs on the table. And inference brings even more challenges than training: How do you deliver a service that is genuinely faster and cheaper? Linear attention has linear compute complexity and constant memory usage. That means there’s a crossover point where it becomes more efficient than full attention in compute and memory. In theory, that point lies at a few thousand tokens — which isn’t particularly long for today’s large models. But that’s just theory. We need to solve a few key problems to actually approach it: Low-Precision State Storage: Linear attention is currently far more sensitive to numerical precision than full attention. Prefix Caching: In real-world applications, the cache-hit rate for conversations is very high. A new architecture must handle this gracefully. Speculative Decoding: How do you optimize speculative decoding with linear attention backbone? Well fortunately, all of these seem solvable. IV. What’s Next Scaling remains the name of the game, and context scaling is one of the key problems. Longer and longer context length is key in both pre-training and post-training. As GPU compute growth slows while data length keeps increasing, the benefits of linear and sparse attention will gradually emerge. We should start preparing now: Better Data: More multimodal, information-rich long-context data. Better Evaluation: More informative evaluation system and experimental paradigms to speed up iteration. Better Infrastructure: Mature training and inference infrastructure to fully squeeze out GPU potential. V. Addendum: the SWA code... We accidentally left the SWA inference code in the open-source release, and some people asked why it wasn’t used in the final model. Simple answer: the performance wasn't good enough. That experiment was from quite early on, before GPT-OSS was open-sourced (we were pretty surprised to see its structure, by the way). But I can share a brief summary of our failed attempt. We tried adapting CPT into a Hybrid SWA, testing both inter & intra-layer mixing. The motivation for intra-layer mixing was to balance the compute intensity across all layers, which is friendly to both PP in training and PP or AFD during inference. Unfortunately, neither worked. Performance degraded noticeably as context length grew — which is unacceptable in agentic scenarios. Our analysis showed that many global attention patterns (like retrieval head and induction head) were already established early during pre-training. CPT can hardly adjust those patterns afterwards. You surely can mitigate the issue by using data probes to identify and keep those heads as full attention — but unfortunately, it’s nearly impossible to discover them all from human priors. (And no, this issue isn’t related to attention sinks.) If you're interested in this line of research, I recommend taking a closer look at GPT-OSS, CWM, and Gemma, especially their long-context performance. Finally, we’re hiring! If you want to join us, send your resume to [email protected]. References MiniMax-01: Scaling Foundation Models with Lightning Attention MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention CWM: An Open-Weights LLM for Research on Code Generation with World Models Qwen3-Next Gemma 3 Technical Report gpt-oss-120b & gpt-oss-20b Model Card Retrieval Head Mechanistically Explains Long-Context Factuality https://t.co/tvATk8hbXZ

803

120

669

749K

andy hwang

@andrewnhwang

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users