Andrew Zhao

🚀 How should LLMs sample on hard reasoning problems during post-training and inference where direct rollouts rarely produce a correct answer? Best-of-N (e.g., GRPO) and tree search share two limitations: 🔻 Verification signals are sparse 🔻 Candidates stay within the model's own distribution We introduce BES: Bidirectional Evolutionary Search — a search framework that couples forward candidate evolution with backward goal decomposition. ✅ Works for both post-training and inference.

686

114

757

239K

Who to follow

Changyeon Kim

@cykim1006

Intern @nvidia SRL | PhD Student at KAIST w/ @kimin_le2 and @jinwoos0417

Fu-En (Fred) Yang

@FuEnYang1

Research Scientist @NVIDIAAI | Ph.D. @NTU_TW | Prev. Research Intern @NVIDIAAI | Unifying World, Language & Action for Generalist Robotics

PedroRas

@PedroRas1

_AndrewZhao retweeted

Ryan Yuan @RainbowYuhui

10 days ago

[1/2] Most image gen models output flat pixels. Users need layers. Our MRT (CVPR'2026) closes that gap and supports: • text → layers • image → layers • layers → layers 🔥 New SOTA on all three tasks 🔥 Beats Qwen-Image-Layered 🔥 10–100× faster, 50–90% less GPU memory

RainbowYuhui's tweet photo. [1/2] Most image gen models output flat pixels. Users need layers. Our MRT (CVPR'2026) closes that gap and supports:
• text → layers
• image → layers
• layers → layers

🔥 New SOTA on all three tasks
🔥 Beats Qwen-Image-Layered
🔥 10–100× faster, 50–90% less GPU memory https://t.co/zyFusxC8dy

183

145

12K

_AndrewZhao retweeted

Kristof

@CoastalFuturist

12 days ago

BREAKING NEWS: God joins Anthropic as member of technical staff

233

620

412

708K

_AndrewZhao retweeted

Pope Leo XIV

@Pontifex

12 days ago

In the era of #ArtificialIntelligence, when human dignity is threatened by new forms of dehumanization, ours is the pressing duty to remain profoundly human. We must lovingly safeguard the grandeur of humanity bestowed upon us and revealed in its fullness in Christ, the splendor of which no machine can ever replace. #MagnificaHumanitas https://t.co/6i9MWs6LJl

862

67K

14K

_AndrewZhao retweeted

James Chen

@jchencxh

15 days ago

Short blog on why I think SSL isn't scaling. tldr we're suppressing too much information from the SSL loss so that anything is learnt at all, but this limits the total set of things we may be able to learn. https://t.co/dkPDF60FpQ

222

236

34K

_AndrewZhao retweeted

Xin Eric Wang ✈️ CVPR 2026

@xwang_lk

15 days ago

We discover the 𝐀𝐬𝐲𝐦𝐦𝐞𝐭𝐫𝐢𝐜 𝐑𝐨𝐥𝐞𝐬 𝐨𝐟 𝐃𝐚𝐭𝐚 𝐆𝐚𝐭𝐢𝐧𝐠 𝐚𝐧𝐝 𝐑𝐞𝐰𝐚𝐫𝐝 𝐆𝐫𝐨𝐮𝐧𝐝𝐢𝐧𝐠 𝐢𝐧 𝐒𝐞𝐥𝐟-𝐏𝐥𝐚𝐲 𝐑𝐋: data gating, not reward grounding, is the binding constraint on stability. A strict gate stabilizes every reward we tested, including a self-consistency reward with no access to ground truth; while no reward stabilizes once the gate is removed, not even one grounded in execution truth. It challenges the common assumption that reward grounding is what governs self-play stability. The field's response to collapse has been better rewards: confidence penalties, momentum anchors, hacking detectors, all on the reward side. The binding constraint lives upstream, in the data pipeline. A self-play system has two distinct levers that prior work conflates. A DATA GATE decides which proposer-generated tasks enter the training pool. A REWARD decides how the policy updates on what's admitted. The gate decides what data exists; the reward decides how the optimizer reacts. They are not symmetric! The reward doesn't filter bad data; instead, it's maximized by it. Under self-consistency, the intrinsic–grounded gap saturates near 1.0: corrupted data receives higher reward than clean data, because intra-group agreement is easiest to maximize on ambiguous tasks. The counterintuitive consequence we call the Grounded Proposer Paradox: a proposer with ground-truth verification access collapses FASTER than an ungrounded one when paired with a self-consistency solver. Cleaner tasks form the lowest-resistance path to the spurious self-consistent attractor. The upstream agent doesn't bias the downstream one toward truth; it sharpens the corridor to the wrong fixed point. The shift: stop treating self-play stability as a reward-design problem. What enters the training loop matters more than how the optimizer scores it.

xwang_lk's tweet photo. We discover the 𝐀𝐬𝐲𝐦𝐦𝐞𝐭𝐫𝐢𝐜 𝐑𝐨𝐥𝐞𝐬 𝐨𝐟 𝐃𝐚𝐭𝐚 𝐆𝐚𝐭𝐢𝐧𝐠 𝐚𝐧𝐝 𝐑𝐞𝐰𝐚𝐫𝐝 𝐆𝐫𝐨𝐮𝐧𝐝𝐢𝐧𝐠 𝐢𝐧 𝐒𝐞𝐥𝐟-𝐏𝐥𝐚𝐲 𝐑𝐋: data gating, not reward grounding, is the binding constraint on stability. A strict gate stabilizes every reward we tested, including a self-consistency reward with no access to ground truth; while no reward stabilizes once the gate is removed, not even one grounded in execution truth.

It challenges the common assumption that reward grounding is what governs self-play stability. The field's response to collapse has been better rewards: confidence penalties, momentum anchors, hacking detectors, all on the reward side. The binding constraint lives upstream, in the data pipeline.

A self-play system has two distinct levers that prior work conflates. A DATA GATE decides which proposer-generated tasks enter the training pool. A REWARD decides how the policy updates on what's admitted. The gate decides what data exists; the reward decides how the optimizer reacts. They are not symmetric!

The reward doesn't filter bad data; instead, it's maximized by it. Under self-consistency, the intrinsic–grounded gap saturates near 1.0: corrupted data receives higher reward than clean data, because intra-group agreement is easiest to maximize on ambiguous tasks.

The counterintuitive consequence we call the Grounded Proposer Paradox: a proposer with ground-truth verification access collapses FASTER than an ungrounded one when paired with a self-consistency solver. Cleaner tasks form the lowest-resistance path to the spurious self-consistent attractor. The upstream agent doesn't bias the downstream one toward truth; it sharpens the corridor to the wrong fixed point.

The shift: stop treating self-play stability as a reward-design problem. What enters the training loop matters more than how the optimizer scores it.

114

26K

Andrew Zhao

@_AndrewZhao

16 days ago

@eliebakouch @agarwl_ @_lewtun If you have complete overlap mid/sft prompts with rl prompts, your RL is just regurgitating what you imitated tho

_AndrewZhao retweeted

Beidi Chen

@BeidiChen

18 days ago

Align with how @cursor_ai has done its RL stage — Astraflow is a new RL engine that enables asynchronous, heterogeneous, and geo-distributed RL in a native way through dataflow abstraction~ Like @FireworksAI_HQ’s sparse RL transfer design, it syncs only ≤1.1% of model weights — making remote rollout lightweight and efficient. Check it out!!!

209

146

34K

_AndrewZhao retweeted

Andrej Karpathy

@karpathy

18 days ago

Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.

150K

11K

14K

27M

Andrew Zhao

@_AndrewZhao

18 days ago

@thkostolansky unfortunately no, but I think there are papers who found self play lead to behaviorly more unsafe agents, such as https://t.co/BSLstQKqBk

_AndrewZhao retweeted

Dimitris Papailiopoulos

@DimitrisPapail

19 days ago

https://t.co/n10GwfKYuY

982

128

850K

Andrew Zhao

@_AndrewZhao

19 days ago

@DimitrisPapail Interesting! This remind me of ICM (https://t.co/6fJsygIls2) but in the same weights, have you tried to use CE loss as the reward to let it explore even more

365

_AndrewZhao retweeted

Ning Ding

@stingning

23 days ago

We’re releasing a 30B-A3B reasoning model that reaches gold-medal level across both physics and math Olympiad evaluations: IPhO directly, and IMO/USAMO with test-time self-verification and refinement. A simple, unified scaling recipe for proof search. https://t.co/yc2ZlLVbD2

147

921

301K

Andrew Zhao

@_AndrewZhao

23 days ago

@agarwl_ how about self-opd using gepa as privileged info

_AndrewZhao retweeted

Pingbang Hu 🇹🇼 @PingbangHu

24 days ago

New Preprint Alert ⏰ Propose Dr. Post-training 🩺 a Data Regularization framework, making your data more effective with ZERO overheads Experiments demonstrate faster training convergence across SFT, RLHF, RLVR over SOTA data selection, opening up new data optimization designs!

PingbangHu's tweet photo. New Preprint Alert ⏰

Propose Dr. Post-training 🩺 a Data Regularization framework, making your data more effective with ZERO overheads

Experiments demonstrate faster training convergence across SFT, RLHF, RLVR over SOTA data selection, opening up new data optimization designs! https://t.co/MePVPMFXol

151

102

27K

Andrew Zhao

@_AndrewZhao

25 days ago

@Kagurazaka_L @willccbb PRM is not a critic, critics are trained to output the expected sum of returns. Also it’s not classic, “classic” critics use the same context as the actor

_AndrewZhao retweeted

jietang

@jietang

25 days ago

Recent thoughts: The Shift to Long-Horizon Tasks The most likely breakthrough this year will be in long-horizon tasks. We are moving toward a stage where Large Language Models (LLMs) learn to complete extended, complex missions by interacting with Agent environments. This is perhaps where the true value of LLMs lies. Take cybersecurity as an example: imagine a model that continuously hunts for software bugs and vulnerabilities. While it sounds like a search process, it’s actually the model learning the high-level intuition and methodology of a professional hacker. Unlike humans, AI can run 24/7 without fatigue. It could potentially find exploits at a much higher frequwill ency and claim bounties on platforms like HackerOne or BugCrowd. It sounds fun, but fundamentally, it's a revolution that displaces the hacker. If even hackers are being "disrupted," one can only imagine the impact on general programmers. From One-Person to None-Person Companies Building on long-horizon capabilities, Autonomous Agent Systems (AAS) will inevitably become the next frontier. Last year, we were discussing the rise of the "One Person Company" (OPC). I didn't expect us to move so quickly toward the "None Person Company" (NPC). It’s an ironic twist—we might all end up as NPCs in this new ecosystem. Engineering the Impossible: Memory and Learning To realize the vision above, we must solve three technical pillars: Memory, Continual Learning, and Self-Judging. I used to think these would require massive paradigm shifts and years of research. However, the pressure from both the technical and application sides is so intense that we are seeing these capabilities emerge through ingenious engineering "tricks": Memory: Long context windows (1M+) and RAG have significantly bridged the gap. Continual Learning: While true continual learning remains difficult, the release cycles are shrinking. Global models are updated monthly; domestic models are catching up. If we reach weekly updates by next year, it will effectively function as continual learning. Self-Judging: This remains the most elusive, yet models like Opus 4.7 are already demonstrating early self-correction and judgment capabilities. The Self-Evolving Endgame The most difficult—and most promising—path is Self-Evolution. The current wave is incredibly fierce. I suspect that models like Claude may have already achieved a baseline for self-training: writing their own code, cleaning their own data, generating synthetic data, and then training on it. It might "waste" some compute, but it saves the most precious resources: human labor and time. In the LLM era, speed is everything. Rapid iteration is what creates the cognitive gap between leaders and followers. Claude’s rumored 2-million-chip cluster for next year is likely dedicated to exactly this: autonomous model self-training. Technical Summary: 1M Context: Necessary baseline. Memory & Continual Learning: Prerequisites, likely solved first via "tricky" engineering. Harnessing Environments: The breakthrough point. Self-Judging: The tipping point. Full Self-Training: The endgame. Redefining AGI and the Industry If this is the road to AGI, then AGI’s definition should be the sum of all human collective intelligence, not just an individual’s intelligence. It must possess the creative capacity to produce something as profound as the "Theory of Relativity"—meeting the bar set by Hassabis. During this transition, every APP will need to be reconstructed as AI-native. In fact, we might move past the concept of APPs entirely. The most significant challenge will be the reconstruction of the operating system itself. In the future, you won’t see a traditional desktop; you will see an LLM OS, where applications are "generated on demand." This challenges the 80-year-old Von Neumann architecture and represents a total upheaval of the computer science industry. The Irreversible Wave From completing long-horizon tasks to fully autonomous operations, every sector—Security, Finance, Law, E-commerce—will be reshaped. Many friends have reached out lately, asking how to transform their enterprises to keep pace with AI. But few truly realize that this irreversible process has already begun. As this massive technical wave hits, we must be prepared to act, but we must also start thinking seriously about how to regulate it.

726

148

523

188K

Andrew Zhao

@_AndrewZhao

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users