Tianjian Li

@tli104

PhD student @jhuclsp, research scientist intern @AIatMeta FAIR. Previously @nyuniversity. I work on making LMs intelligent and interesting.

Baltimore, MD

Joined November 2022

731 Following

367 Followers

278 Posts

Pinned Tweet

Tianjian Li @tli104

10 months ago

Language models often produce repetitive responses, and this issue is further amplified by post-training. In this work, we introduce DARLING, a method that explicitly optimizes for both response diversity and quality within online reinforcement learning!

Jason Weston

@jaseweston

10 months ago

🌀Diversity Aware RL (DARLING)🌀 📝: https://t.co/MH0tui34Cb - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks 🧵1/5

jaseweston's tweet photo. 🌀Diversity Aware RL (DARLING)🌀
📝: https://t.co/MH0tui34Cb
- Jointly optimizes for quality & diversity using a learned partition function
- Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k
- Works for both non-verifiable & verifiable tasks
🧵1/5 https://t.co/AhEYPQwbkg

425

339

87K

11K

tli104 retweeted

Vishal

@KyrieBlunders

17 days ago

Spent a month writing CuteDSL kernels for an RL training loop. The fused decode-attention kernel I wrote benchmarked 2.2x faster than the SDPA path it replaces. Dropped it into HF generate. The decode step got 3x slower. [Metrics and Explanation below]

tli104 retweeted

Quentin Gallouédec @QGallouedec

19 days ago

If you're serious about RL, you eventually need to get your hands dirty and do the math. Precision isn't an implementation detail, the gradient flow itself depends on it, infinitely more than in supervised training. Check this out https://t.co/nyBeO4VxtR and follow @DirhousssiAmine

188

210

tli104 retweeted

Kenton Murray @kentonmurray

24 days ago

I'm excited to announce that this Fall I will be joining the Computer Science Department at George Mason University as an Asst. Prof. I'll be expanding my lab and looking for PhD students to work on Multilingual AI problems text, video, and speech. https://t.co/f6PIESWRp4

211

17K

Who to follow

Stella Li

@StellaLisy

PhD student @uwnlp | visiting researcher @AIatMeta | undergrad @jhuclsp #NLProc

Vijay Murari Tiyyala

@VijayTiyyala

Incoming PhD student at @BUCompSci MS @JHUCompSci Research Assistant @jhuclsp @mdredze Research: Alignment, Model Editing, Interpretability. (విజయ్ మురారి)

Yunmo Chen

@YunmoChen

MTS @MicrosoftAI | Previously @Bloomberg @jhuclsp @Apple @MSFTResearch @Amazon | Opinions are my own

tli104 retweeted

Daniel Khashabi 🕊️

@DanielKhashabi

25 days ago

@ben_vandurme and I are recruiting multiple postdoc fellows at JHU. We're looking for candidates w/ strong record in language models, reasoning, coding agents, and/or AI for science. Interested candidates should send their CV and a brief summary of their research interests to [email protected] / [email protected]. Please reshare for visibility. 🙏

tli104 retweeted

Yujie Zhao @YujieZhao455906

25 days ago

🚀 Excited to share that our work, AMA-Bench, has been accepted to #ICML2026! Most benchmarks test dialogue memory, but real agents learn through continuous environment interactions. We actually found that systems acing dialogue benchmarks completely struggle in true agentic settings! 🤯 To fix this, we introduce AMA-Bench to evaluate long-horizon memory in real applications, plus AMA-Agent—a new system designed to track causality and objective info across long trajectories. 🧠 🌐 Check it out: https://t.co/3y2wyXwVyL See you at ICML! 🎉

YujieZhao455906's tweet photo. 🚀 Excited to share that our work, AMA-Bench, has been accepted to #ICML2026!

Most benchmarks test dialogue memory, but real agents learn through continuous environment interactions. We actually found that systems acing dialogue benchmarks completely struggle in true agentic settings! 🤯

To fix this, we introduce AMA-Bench to evaluate long-horizon memory in real applications, plus AMA-Agent—a new system designed to track causality and objective info across long trajectories. 🧠

🌐 Check it out: https://t.co/3y2wyXwVyL
See you at ICML! 🎉

31K

tli104 retweeted

Niloofar ✈️ icml

@niloofar_mire

26 days ago

Tbh i’m kinda sick of this academic doomerism vibe consuming all of bay area and the self-aggrandizing pov that frontier labs have. Sure a lot of exciting stuff is happening but we wouldn’t be where we are wo academia & there is sth to be said about the pursuit of curiosity.

607

52K

tli104 retweeted

Chinmay

@ChinmayKak

27 days ago

New blog! Covers a lot of papers and methods about recent advances in On policy distillation and On policy self distillation, their wins, their failure modes, and my opinion about the same! Link below, please do check it out, and RT/QT if you like it:)

ChinmayKak's tweet photo. New blog!
Covers a lot of papers and methods about recent advances in On policy distillation and On policy self distillation, their wins, their failure modes, and my opinion about the same!
Link below, please do check it out, and RT/QT if you like it:) https://t.co/UoTivfW3u4

511

552

71K

tli104 retweeted

Aran Komatsuzaki

@arankomatsuzaki

28 days ago

i was playing with Codex /goal on some lesser-known open conjectures, mostly 20–50y old. after letting it run autonomously for 8h+, i was already seeing what looked like publishable progress, even if not full resolutions. weakly held take: people overrate “open for decades” as a proxy for importance. unsolved ≠ important. a lot of old problems are just boring-but-hard, or maybe hard in the bad way / structurally not that productive. imo the higher-value thing is often accelerating recent research directions where the community actually has live taste / consensus that the topic matters. these aren’t necessarily “harder” in some intrinsic sense. there are just way fewer participants because the prerequisite stack is brutal, vs more approachable combinatorics / Erdős-style problems. so the marginal AI researcher there may be much higher-value than grinding on random half-century-old open problems. my stronger take: current models can already push some frontiers forward rapidly 95%-automatically, not “solve smooth 4D Poincaré today,” but real progress. it’s underpriced because the domain people are conservative or slow to retool around AI, and the AI people mostly don’t know which deep problems exist / matter.

256

43K

tli104 retweeted

Hamza Elshafie

@hamzaelshafie

29 days ago

New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels" This post is my attempt to dissect ThunderKittens from the bottom up. I approached TK by asking what each abstraction is really buying us: which hardware detail it corresponds to, how it maps onto the underlying layouts the GPU actually wants, what boilerplate it removes, and which parts of the GPU programming model still remain visible to us as kernel authors. The post walks through the tile abstractions TK provides: register, shared, and tensor memory tiles, global layouts, vector abstractions, warp/warpgroup compute, TMA, swizzling, Hopper WGMMA, Blackwell tcgen05, 2xSM MMA, tensor memory, Cluster Launch Control, TK’s pipeline templates, and static persistent tile scheduling. At the end, I demonstrate TK’s lcf pipeline template by implementing a non-causal attention prefill kernel and benchmarking it against FlashAttention-2 and FlashAttention-3 on an H100 PCIe across different sequence lengths. The kernel beats FA2 across the sweep by ~1.55x on average, and closely tracks FA3, where FA3 is only ~1.05x-1.17x faster on the longer sequence lengths. Blog link: https://t.co/t29Z6jVF87 Repo: https://t.co/3gsRd25QwL I also put an extensive list of resources at the end, which I found very useful for interested readers. Please note: this is my own independent writeup. I’m not affiliated with @HazyResearch, and any mistakes in the post are mine. If you spot any please reach out! 1 / xx

hamzaelshafie's tweet photo. New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels"

This post is my attempt to dissect ThunderKittens from the bottom up. I approached TK by asking what each abstraction is really buying us: which hardware detail it corresponds to, how it maps onto the underlying layouts the GPU actually wants, what boilerplate it removes, and which parts of the GPU programming model still remain visible to us as kernel authors.

The post walks through the tile abstractions TK provides: register, shared, and tensor memory tiles, global layouts, vector abstractions, warp/warpgroup compute, TMA, swizzling, Hopper WGMMA, Blackwell tcgen05, 2xSM MMA, tensor memory, Cluster Launch Control, TK’s pipeline templates, and static persistent tile scheduling.

At the end, I demonstrate TK’s lcf pipeline template by implementing a non-causal attention prefill kernel and benchmarking it against FlashAttention-2 and FlashAttention-3 on an H100 PCIe across different sequence lengths. The kernel beats FA2 across the sweep by ~1.55x on average, and closely tracks FA3, where FA3 is only ~1.05x-1.17x faster on the longer sequence lengths.

Blog link: https://t.co/t29Z6jVF87
Repo: https://t.co/3gsRd25QwL

I also put an extensive list of resources at the end, which I found very useful for interested readers.

Please note: this is my own independent writeup. I’m not affiliated with @HazyResearch, and any mistakes in the post are mine. If you spot any please reach out!

1 / xx

375

414

40K

tli104 retweeted

Maksym Andriushchenko

@maksym_andr

about 1 month ago

💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation. AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters. Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task. Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE). One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now. This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!

maksym_andr's tweet photo. 💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation.

AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters.

Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task.

Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE).

One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now.

This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!

349

213

42K

tli104 retweeted

Chuanyang Jin

@chuanyang_jin

about 1 month ago

What are users thinking during their interactions with LLMs? We introduce ThoughtTrace — the first large-scale dataset that captures what users think during real-world human–AI conversations, not just what they type. → 10,174 thought annotations → 2,155 multi-turn conversations, 17,058 turns → 1,058 users → 20 LLMs These thoughts improve user behavior prediction (+41.7%) and model alignment (+25.6%). This opens a new paradigm of user-centric LLM research. Full information in the thread 🧶 Read our paper: https://t.co/lRYJvGJ7bb Check our project website: https://t.co/AupCn1YQOk

136

69K

tli104 retweeted

Lei Li

@_TobiasLee

about 1 month ago

https://t.co/Rb5eLxvoaF

tli104 retweeted

Richard Sutton

@RichardSSutton

about 1 month ago

The bitter lesson in 26 words: Don’t be distracted by human knowledge, as AI has been historically. Instead focus on methods for creating knowledge that scale with computation, like search and learning.

138

979

589K

tli104 retweeted

Graham Neubig

@gneubig

about 1 month ago

Great question about how to come up with research ideas! I think that different people have different ways, but here are some of mine: 1. Working with excellent people -- most of the ideas we tackle at CMU originally come from the people I work with, not me 2. Thinking about impact -- I always ask my collaborators "who is your target audience, and what results will they need to see to be convinced that this is promising" 3. Staying close to real-world use -- as you know, I have a startup and many of the ideas that we have for research come from practical problems we need to solve there too (and sorry about the late reply, I needed to think about this before responding)

216

135

17K

tli104 retweeted

Junli Wang

@JunliWang2021

about 1 month ago

Digital agent learning needs massive rollouts. But digital agent rollouts are painfully slow due to heavy environments. 🐌 🚀 We introduce NanoRollout, a lightweight open infra (900 lines core code) for digital agent rollout at scale, validated with three workloads: 🏋️ Large batchsize (4K) SWE Agent RL -> surpasses DeepSWE-32B 🧪 250k+ distilled coding trajectories -> SOTA ≤32B open coding agent ⚡ Fast evaluation on coding/cua/unified agent -> finish Check our Blog: https://t.co/IBNqqbLqra

136

102

36K

tli104 retweeted

Rishabh Agarwal

@agarwl_

about 1 month ago

Training LLMs is synonymous with updating their weights. However, LLMs can also learn in-context using *frozen* weights. There is no good reason for restricting learning to being in-context or in-weights. So a natural idea is "Learning, Fast and Slow" (FST). In FST, slow learning is LLM weights trained with RL while fast learning is context / prompt (fast weights) optimized with GEPA. Compared to RL, FST performs better while being more data efficient, adaptable (plasticity), and forgetting less (stays closer to base models). I think this idea of learning both fast-slow weights would be a good foundation for continual learning. PS: Geoff Hinton (the OG) described the idea of fast weights and slow weights several years ago, and back then I remember thinking it's a very cool idea. See more details here: https://t.co/FACsHx7IpK

agarwl_'s tweet photo. Training LLMs is synonymous with updating their weights. However, LLMs can also learn in-context using *frozen* weights. There is no good reason for restricting learning to being in-context or in-weights.

So a natural idea is "Learning, Fast and Slow" (FST). In FST, slow learning is LLM weights trained with RL while fast learning is context / prompt (fast weights) optimized with GEPA.

Compared to RL, FST performs better while being more data efficient, adaptable (plasticity), and forgetting less (stays closer to base models).

I think this idea of learning both fast-slow weights would be a good foundation for continual learning.

PS: Geoff Hinton (the OG) described the idea of fast weights and slow weights several years ago, and back then I remember thinking it's a very cool idea.

See more details here:
https://t.co/FACsHx7IpK

569

575

73K

tli104 retweeted

Stephen Xie

@stephenx_

about 1 month ago

Full post — inference systems, training recipes, reward design, eval, and a survey of Multiverse, Parallel-R1, NPR, ThreadWeaver + the original APR method (Pan et al., 2025): https://t.co/iFQwVISoue Co-authored with @tonylian!

tli104 retweeted

alphaXiv

@askalphaxiv

about 2 months ago

"Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding" Speculative decoding for RL rollouts! This paper speeds up post-training without changing the target policy’s sampling distribution. So a draft model proposes multiple tokens, and the policy model verifies them. An important system piece is that the draft must stay aligned with the continually updated policy, with weight sync and optional online adaptation. This gives faster rollouts, but same learning trajectory. With ~1.5-1.8x rollout speedup at 8B, and projected ~2.5x end-to-end speedup at 235B scale.

askalphaxiv's tweet photo. "Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding"

Speculative decoding for RL rollouts!

This paper speeds up post-training without changing the target policy’s sampling distribution. So a draft model proposes multiple tokens, and the policy model verifies them.

An important system piece is that the draft must stay aligned with the continually updated policy, with weight sync and optional online adaptation.

This gives faster rollouts, but same learning trajectory. With ~1.5-1.8x rollout speedup at 8B, and projected ~2.5x end-to-end speedup at 235B scale.

204

146

12K

tli104 retweeted

Parth Asawa

@pgasawa

about 2 months ago

Today, we’re releasing Continual Learning Bench 1.0: the first, realistic benchmark for measuring how AI systems can improve in online settings. Benchmarks today assume models are stateless. Each example is independent, and once a system finishes a task, it moves on as if nothing happened. But deployed AI systems should learn from experience. We tested 10+ frontier systems against novel, expert-validated tasks and find there’s still plenty of headroom for learning. (1/n)

pgasawa's tweet photo. Today, we’re releasing Continual Learning Bench 1.0: the first, realistic benchmark for measuring how AI systems can improve in online settings.

Benchmarks today assume models are stateless. Each example is independent, and once a system finishes a task, it moves on as if nothing happened.

But deployed AI systems should learn from experience. We tested 10+ frontier systems against novel, expert-validated tasks and find there’s still plenty of headroom for learning. (1/n)

167

999

835K

tli104 retweeted

DAIR.AI

@dair_ai

about 2 months ago

Banger paper from Meta FAIR. They introduce Autodata, an agentic data scientist that builds high-quality training and evaluation data autonomously. The headline result: on a CS research QA task, an Agentic Self-Instruct loop produces a 34-point gap between weak and strong solvers (43.7% vs 77.8%), while standard CoT Self-Instruct on the same setup produces a 1.9-point gap (71.4% vs 73.3%). The agent generates questions that actually discriminate between models. The method: An orchestrator LLM directs a challenger agent to generate examples grounded in domain documents. A weak and a strong solver attempt them, a judge scores the outputs, and the orchestrator analyzes the failures and prompts the challenger to regenerate from new angles until quality thresholds are met. The system also meta-optimizes itself. An outer loop tunes the agent's instructions based on which harness changes lift validation pass rate. Over 126 accepted iterations, validation pass rate climbed from 12.8% to 42.4%. They processed 10,000+ CS papers and produced 2,117 quality-filtered QA pairs. Existing self-instruct pipelines do not control data quality. Autodata reframes data generation as an agent loop, spend more inference compute and the data gets harder, which gives downstream RL a real lift. Blog: https://t.co/41coXidxRI Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

dair_ai's tweet photo. Banger paper from Meta FAIR.

They introduce Autodata, an agentic data scientist that builds high-quality training and evaluation data autonomously.

The headline result: on a CS research QA task, an Agentic Self-Instruct loop produces a 34-point gap between weak and strong solvers (43.7% vs 77.8%), while standard CoT Self-Instruct on the same setup produces a 1.9-point gap (71.4% vs 73.3%).

The agent generates questions that actually discriminate between models.

The method:

An orchestrator LLM directs a challenger agent to generate examples grounded in domain documents. A weak and a strong solver attempt them, a judge scores the outputs, and the orchestrator analyzes the failures and prompts the challenger to regenerate from new angles until quality thresholds are met.

The system also meta-optimizes itself.

An outer loop tunes the agent's instructions based on which harness changes lift validation pass rate. Over 126 accepted iterations, validation pass rate climbed from 12.8% to 42.4%. They processed 10,000+ CS papers and produced 2,117 quality-filtered QA pairs.

Existing self-instruct pipelines do not control data quality. Autodata reframes data generation as an agent loop, spend more inference compute and the data gets harder, which gives downstream RL a real lift.

Blog: https://t.co/41coXidxRI

Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

221

316

40K

Tianjian Li

@tli104

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users