Weiwei Sun

@sunweiwei12

PhD student @LTIatCMU | Interned at Google, ByteDance, Vector, Baidu | Working on LLM agents

Pittsburgh, PA

Joined June 2021

258 Following

884 Followers

163 Posts

Pinned Tweet

Weiwei Sun @sunweiwei12

9 days ago

Excited to share our new work on Reinforcing Human Behavior Simulation via Verbal Feedback. Can human simulators learn from feedback, not just rewards? Most RL for LLMs turns feedback into a single score. But human behavior is rarely just right or wrong. It is social, contextual, subjective, and multi-dimensional. A score can tell the model what is better. Verbal feedback can tell it why. Meet DITTO + SOUL. Paper: https://t.co/G0cEHr53h0 Code: https://t.co/6osJizwUDi Model: https://t.co/yIAvpbKPSd

227

162

33K

Weiwei Sun @sunweiwei12

8 days ago

@eb1aexperts Thanks!

Weiwei Sun @sunweiwei12

9 days ago

227

162

33K

Weiwei Sun @sunweiwei12

8 days ago

@evijit Thanks!

Who to follow

ACM TOIS

@acmtois

ACM Transactions on Information Systems

Mohammad Aliannejadi

@maliannejadi

Assistant Professor at UvA @UvA_IvI. Information Retrieval, Conversation Search, Crowdsourcing.

THUIR_lab

@thuir_lab

Information Retrieval Lab, Tsinghua University.

sunweiwei12 retweeted

Xuhui Zhou

@nlpxuhui

8 days ago

Exactly, we have similar findings in our new work! Many OPD variants actually collapse in our setting, which partially drove to DITTO: a more straightforward way to let the teacher actively "work" on the task, then have the student better learn from that. https://t.co/PjXF9GX9cf

nlpxuhui's tweet photo. Exactly, we have similar findings in our new work!

Many OPD variants actually collapse in our setting, which partially drove to DITTO: a more straightforward way to let the teacher actively "work" on the task, then have the student better learn from that.
https://t.co/PjXF9GX9cf

107

113

13K

Weiwei Sun @sunweiwei12

8 days ago

@MichelIvan92347 Thanks!

Weiwei Sun @sunweiwei12

8 days ago

@shwiy1558125 Totally agree! LLM simulators are already quite useful in controlled settings like entertainment / data generation. For higher stakes use cases we may need more careful validation against real human data

Weiwei Sun @sunweiwei12

8 days ago

@techietaro Good point! We’ve seen our method reach SOTA on many theory-of-mind tasks too, and we’re actively expanding to more evaluations. Verbal feedback can also be a pretty useful hint for guiding the model’s reasoning

sunweiwei12 retweeted

Weiwei Sun @sunweiwei12

9 days ago

227

162

33K

sunweiwei12 retweeted

Xuhui Zhou

@nlpxuhui

9 days ago

Wondering how we can better simulate human behavior with reinforcement learning? Introducing DITTO: RL with verbal feedback for subjective tasks like user simulation, student modeling, character role-play, and theory of mind. The result: an 8B model that performs on par with GPT-5.4 on the new SOUL benchmark suite.

22K

Weiwei Sun @sunweiwei12

9 days ago

Co-led with @nlpxuhui. Huge thanks to our amazing collaborators: @Jiarui_Liu_ @StigLidu @judysun233 @YiqingXieNLP @1000seagull @soshsihao @mengtingwan @ylongqi @peizNLP @tongshuangwu @wellecks @gneubig Yiming @MaartenSap More to come!

414

Weiwei Sun @sunweiwei12

9 days ago

Results: with an 8B model, DITTO improves over the base model by 36% on average, outperforms standard GRPO on 8/10 SOUL tasks, and matches or exceeds GPT-5.4 on 6/10 benchmarks. The takeaway is simple: to train human-like simulators, we need training signals that are more human-like too.

sunweiwei12's tweet photo. Results: with an 8B model, DITTO improves over the base model by 36% on average, outperforms standard GRPO on 8/10 SOUL tasks, and matches or exceeds GPT-5.4 on 6/10 benchmarks.

The takeaway is simple: to train human-like simulators, we need training signals that are more human-like too.

447

sunweiwei12 retweeted

Jiarui Liu

@Jiarui_Liu_

16 days ago

Excited to share our new paper 🧵MIXSD: Mixed Contextual Self-Distillation for Knowledge Injection Supervised fine-tuning is the common way to teach LLMs new knowledge, but it often catastrophically forgets existing capabilities. We introduce MixSD: a simple, external-teacher-free method to inject knowledge with far less forgetting. 📄https://t.co/qRpaTiI9EU Why does SFT forget? Targets written by humans or external systems diverge from the model's own autoregressive distribution, forcing the optimizer to imitate low-probability tokens. That's what drags pretrained capabilities down. MixSD: We hypothesize that keeping supervision close to the model's own distribution is key to avoiding forgetting. Instead of training on fixed, externally authored targets, at every token we mix between two conditionals of the base model itself: an expert conditional that sees the injected fact in context, and a naive conditional reflecting the model's prior. The result is supervision the model already finds high-probability, while still carrying the new factual signal. A Bernoulli rate λ controls the balance between memorization and retention. Findings: SFT only retains as little as 1% of held-out capability. MixSD retains far more, up to ~100% on larger models, with near-perfect training accuracy. It also beats on-policy self-distillation at a fraction of the compute, and holds across Qwen3 1.7B, 4B, 8B and Llama-3.2.

$Jiarui_Liu_'s tweet photo. Excited to share our new paper 🧵MIXSD: Mixed Contextual Self-Distillation for Knowledge Injection Supervised fine-tuning is the common way to teach LLMs new knowledge, but it often catastrophically forgets existing capabilities. We introduce MixSD: a simple, external-teacher-free method to inject knowledge with far less forgetting. 📄https://t.co/qRpaTiI9EU Why does SFT forget? Targets written by humans or external systems diverge from the model's own autoregressive distribution, forcing the optimizer to imitate low-probability tokens. That's what drags pretrained capabilities down. MixSD: We hypothesize that keeping supervision close to the model's own distribution is key to avoiding forgetting. Instead of training on fixed, externally authored targets, at every token we mix between two conditionals of the base model itself: an expert conditional that sees the injected fact in context, and a naive conditional reflecting the model's prior. The result is supervision the model already finds high-probability, while still carrying the new factual signal. A Bernoulli rate λ controls the balance between memorization and retention. Findings: SFT only retains as little as 1% of held-out capability. MixSD retains far more, up to ~100% on larger models, with near-perfect training accuracy. It also beats on-policy self-distillation at a fraction of the compute, and holds across Qwen3 1.7B, 4B, 8B and Llama-3.2.$

104

10K

sunweiwei12 retweeted

Sasha Rush

@srush_nlp

17 days ago

Been working on text feedback / OPSD in Composer. Really interesting space, and much more to be explored.

277

132

39K

sunweiwei12 retweeted

Apurva Gandhi

@apurvasgandhi

27 days ago

Sub-agents are a promising inference-time scaling primitive: • Expand an agent's working memory • Divide-and-conquer hard problems • Solve problems faster with parallel execution But how do we train a model to best take advantage of sub-agents and make sure we get these benefits? Very excited to release RAO: Recursive Agent Optimization. RAO is an end-to-end reinforcement learning approach for training LLM agents to spawn, delegate to, and coordinate with recursive copies of themselves (that can themselves spawn other agents) - turning recursive inference into a learned capability. 1/10

713

117

921

134K

sunweiwei12 retweeted

Yixin Dong @yi_xin_dong

about 1 month ago

Introducing XGrammar-2: structured generation for complex agent harnesses. Strict tool-calling formats. Built-in DeepSeek-V4 and Qwen-3.6 support. Up to 80x speedup over XGrammar. Ready-to-use integrations with vLLM, SGLang, TensorRT-LLM, and more! ⚡ From Claude Code to OpenClaw, agents are defining more complex harnesses. XGrammar-2 ensures LLMs always interact with them in the right way. Built in collaboration with DeepSeek, Databricks, and leading frontier AI labs to bring XGrammar-2 into latest models and products. 🧩 Structural Tag: one unified abstraction to describe any format your agent needs 🚀 Scales to 500+ strictly typed tools for complex agent harnesses 🌐 Native APIs in Python, C++, Rust, and JS, running everywhere from cloud to edge 🛠️ Integrated with vLLM, SGLang, TensorRT-LLM, and more Excited to see what agent builders create with it! Blog: https://t.co/N0Tbl588BH GitHub: https://t.co/lo4yScuI2f

yi_xin_dong's tweet photo. Introducing XGrammar-2: structured generation for complex agent harnesses.

Strict tool-calling formats. Built-in DeepSeek-V4 and Qwen-3.6 support. Up to 80x speedup over XGrammar. Ready-to-use integrations with vLLM, SGLang, TensorRT-LLM, and more! ⚡

From Claude Code to OpenClaw, agents are defining more complex harnesses. XGrammar-2 ensures LLMs always interact with them in the right way.

Built in collaboration with DeepSeek, Databricks, and leading frontier AI labs to bring XGrammar-2 into latest models and products.

🧩 Structural Tag: one unified abstraction to describe any format your agent needs
🚀 Scales to 500+ strictly typed tools for complex agent harnesses
🌐 Native APIs in Python, C++, Rust, and JS, running everywhere from cloud to edge
🛠️ Integrated with vLLM, SGLang, TensorRT-LLM, and more

Excited to see what agent builders create with it!

Blog: https://t.co/N0Tbl588BH
GitHub: https://t.co/lo4yScuI2f

149

42K

Weiwei Sun @sunweiwei12

about 1 month ago

@bag_of_words1 🤷

Weiwei Sun @sunweiwei12

about 1 month ago

🪭Excited to share that Context Folding has been accepted to #ICML2026! Congrats to all collaborators! https://t.co/ohVoMQQ0el

sunweiwei12's tweet photo. 🪭Excited to share that Context Folding has been accepted to #ICML2026! Congrats to all collaborators!
https://t.co/ohVoMQQ0el https://t.co/JfnmKf4Mst

Weiwei Sun @sunweiwei12

8 months ago

Context engineering is key to building LLM agents. Can we let agents actively manage their own context? We introduce Context-Folding, giving agents the ability to branch and compress their context. Trained with RL on Search and SWE task, it beats ReAct using 10× less context.

sunweiwei12's tweet photo. Context engineering is key to building LLM agents. Can we let agents actively manage their own context?

We introduce Context-Folding, giving agents the ability to branch and compress their context.

Trained with RL on Search and SWE task, it beats ReAct using 10× less context. https://t.co/ysUeghUFaC

215

190

30K

Weiwei Sun @sunweiwei12

about 1 month ago

@ZhiqiuLin Thanks Zhiqiu!!

Weiwei Sun

@sunweiwei12

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users