Tim Xiao

12 months ago

✨ New paper: Flipping Against All Odds We found that large language models (LLMs) can describe probabilities—but fail to sample from them faithfully. Yes, even flipping a fair coin is hard. 🪙 🧵 Here’s what we learned—and how we fixed it. 🔗https://t.co/Auw7agOws3 1/

TimZXiao's tweet photo. ✨ New paper: Flipping Against All Odds

We found that large language models (LLMs) can describe probabilities—but fail to sample from them faithfully.

Yes, even flipping a fair coin is hard. 🪙

🧵 Here’s what we learned—and how we fixed it.

🔗https://t.co/Auw7agOws3

1/ https://t.co/85A1sLXllx

TimZXiao retweeted

1 day ago

🚀 Excited to introduce PEFT-Arena! This project grew directly out of the challenges we faced when evaluating existing PEFT methods. We argue that PEFT evaluation should not ask only what improves. It should also ask what is forgotten. For too long, parameter-efficient finetuning methods have been evaluated by downstream accuracy. But target performance alone is not enough. A method can adapt well while silently losing pretrained capabilities. PEFT-Arena evaluates existing popular PEFT methods through the stability-plasticity trade-off, i.e., how much the model learns on the target task, and how much general ability it still preserves after finetuning. A good PEFT method should adapt without forgetting and achieve a strong stability-plasticity trade-off. In our evaluation, we find that Orthogonal Finetuning achieves the strongest trade-off among all methods. Beyond task-driven benchmark scores, PEFT-Arena also provides geometry-based internal diagnostics in weight space, activation space, and interpolation paths, helping us understand not only which methods work, but why they forget. What PEFT-Arena has done: 🌟 Adaptation + preservation: evaluates both target gains and forgetting. 🌟 Trade-off frontier: reveals stability-plasticity patterns, with OFT often leading. 🌟 Geometry diagnosis: explains forgetting via weight spectra, activation distortion, and interpolation paths. Welcome to try PEFT-Arena and evaluate your own PEFT method! 🌐 Project page: https://t.co/y8TsHWtHSZ 📝 Paper: https://t.co/AD4SguO4sj 💻 Code: https://t.co/JdtQkiF9IP

Besteuler's tweet photo. 🚀 Excited to introduce PEFT-Arena! This project grew directly out of the challenges we faced when evaluating existing PEFT methods.

We argue that PEFT evaluation should not ask only what improves. It should also ask what is forgotten. For too long, parameter-efficient finetuning methods have been evaluated by downstream accuracy. But target performance alone is not enough. A method can adapt well while silently losing pretrained capabilities.

PEFT-Arena evaluates existing popular PEFT methods through the stability-plasticity trade-off, i.e., how much the model learns on the target task, and how much general ability it still preserves after finetuning.

A good PEFT method should adapt without forgetting and achieve a strong stability-plasticity trade-off. In our evaluation, we find that Orthogonal Finetuning achieves the strongest trade-off among all methods.

Beyond task-driven benchmark scores, PEFT-Arena also provides geometry-based internal diagnostics in weight space, activation space, and interpolation paths, helping us understand not only which methods work, but why they forget.

What PEFT-Arena has done:
🌟 Adaptation + preservation: evaluates both target gains and forgetting.
🌟 Trade-off frontier: reveals stability-plasticity patterns, with OFT often leading.
🌟 Geometry diagnosis: explains forgetting via weight spectra, activation distortion, and interpolation paths.

Welcome to try PEFT-Arena and evaluate your own PEFT method!

🌐 Project page: https://t.co/y8TsHWtHSZ
📝 Paper: https://t.co/AD4SguO4sj
💻 Code: https://t.co/JdtQkiF9IP

TimZXiao retweeted

6 days ago

I want to share more about why we started building Orbit: https://t.co/Rc7S1zQUel Orbit is an OFT-centric reinforcement learning pipeline designed for ultra-efficient post-training of large language models. In simple terms, Orbit allows us to post-train a trillion-parameter LLM on a single 8-GPU node. But more importantly, Orbit represents something much larger for us, as it connects our long-standing research to a practical system that can be used at frontier scale. One of SphereLab’s core missions is to develop a principled and unified full-stage training pipeline for large foundation models. We are especially interested in what we call the spectral scaling paradigm: the idea that the geometry and spectral structure of model weights should not be treated as incidental details, but as central objects in how we design training algorithms. This perspective has alreadyled us to develop a series of pretraining algorithms and systems, including POET, POET-X, and Pion: - POET: https://t.co/1OUbqQ2I1D - POET-X: https://t.co/5Q01hOjiMw - Pion: https://t.co/fgAPUEzPMG It has also guided our work on post-training and adaptation, including OFT, BOFT, OFTv2, PEFT-Arena, and OrthoMerge: - OFT: https://t.co/mAiH8XWCqz - BOFT: https://t.co/aQUypEWsCf - OFTv2: https://t.co/4QRHWBB5aZ - PEFT-Arena: https://t.co/PqP0c2Pg5F - OrthoMerge: https://t.co/Fzjrn0zpaW Although these projects may appear to target different stages of the model lifecycle, they are all driven by the same underlying principle -- the spectrum and geometry of neural network weights matter. Preserving and controlling these structures can lead to more stable, efficient, and scalable training and adaptation. While creating these algorithms has been extremely rewarding, we take a step further to build practical systems from the ground up that can leverage the same principle to best scale our algorithms. That is why a small team like us still spent so much time and efforts to build Orbit. Orbit is only the starting point. We are excited to keep building systems that bring principled training algorithms to real-world foundation model development.

Besteuler's tweet photo. I want to share more about why we started building Orbit: https://t.co/Rc7S1zQUel

Orbit is an OFT-centric reinforcement learning pipeline designed for ultra-efficient post-training of large language models. In simple terms, Orbit allows us to post-train a trillion-parameter LLM on a single 8-GPU node. But more importantly, Orbit represents something much larger for us, as it connects our long-standing research to a practical system that can be used at frontier scale.

One of SphereLab’s core missions is to develop a principled and unified full-stage training pipeline for large foundation models. We are especially interested in what we call the spectral scaling paradigm: the idea that the geometry and spectral structure of model weights should not be treated as incidental details, but as central objects in how we design training algorithms.

This perspective has alreadyled us to develop a series of pretraining algorithms and systems, including POET, POET-X, and Pion:
- POET: https://t.co/1OUbqQ2I1D
- POET-X: https://t.co/5Q01hOjiMw
- Pion: https://t.co/fgAPUEzPMG

It has also guided our work on post-training and adaptation, including OFT, BOFT, OFTv2, PEFT-Arena, and OrthoMerge:
- OFT: https://t.co/mAiH8XWCqz
- BOFT: https://t.co/aQUypEWsCf
- OFTv2: https://t.co/4QRHWBB5aZ
- PEFT-Arena: https://t.co/PqP0c2Pg5F
- OrthoMerge: https://t.co/Fzjrn0zpaW

Although these projects may appear to target different stages of the model lifecycle, they are all driven by the same underlying principle -- the spectrum and geometry of neural network weights matter. Preserving and controlling these structures can lead to more stable, efficient, and scalable training and adaptation.

While creating these algorithms has been extremely rewarding, we take a step further to build practical systems from the ground up that can leverage the same principle to best scale our algorithms. That is why a small team like us still spent so much time and efforts to build Orbit.

Orbit is only the starting point. We are excited to keep building systems that bring principled training algorithms to real-world foundation model development.

7 days ago

Checkout our new project! Being able to post train the largest opensource LLMs on a single node is such a cool achievement!

AI Research Scientist at @Meta | Ph.D. at @ETH

8 days ago

🚀 Meet Orbit: OFT-based RL infrastructure for stable, efficient post-training of trillion-parameter LLMs. Orbit can train 1T+ LLMs (e.g. Kimi-2.6, DeepSeek-V4-Pro) on a single GPU node (8xB200) with extremely small train-rollout gap! Code: https://t.co/pyyOg6s7RQ Blog: https://t.co/Rc7S1zQUel Blog in Chinese: https://t.co/rvToBFG4Iq

117

10K

117

Who to follow

Chen Guo

@ChenGuo96

Sammy Joe Christen

@sammy_j_c

Associate Research Scientist @Disney | PhD Student @ait_eth at ETH Zurich | AI research intern @AIatMeta | intern @NVIDIA AI Robotics Lab

Valentyn Boreiko 🇺🇦

@valentynepii

ML PhD student @ University of Tübingen from 🇺🇦. Working at the intersection of robustness, explainability and generative models. Ex-co-founder of Studyly.

TimZXiao retweeted

Mridul Sharma @mriiidullll

17 days ago

🧵1/7 Ever tried building a world model for a partially observable environment from just raw observations? It's tough - but what if LLMs could help? We explore this question in our latest preprint: 'Learning POMDP World Models from Observations Using Language-Model Priors.'

mriiidullll's tweet photo. 🧵1/7

Ever tried building a world model for a partially observable environment from just raw observations? It's tough - but what if LLMs could help?

We explore this question in our latest preprint: 'Learning POMDP World Models from Observations Using Language-Model Priors.' https://t.co/lDMQm2qBc4

about 2 months ago

Nice work! Whether an LLM can generate faithful sample is an important topic for the agentic era. Our work verbalized rejection sampling also studied this problem! https://t.co/Auw7agOws3

Petar Veličković

@PetarV_93

about 2 months ago

can llms reliably roll the dice? 🎲 we shed new light on stochasticity limitations of llms, discussing some ways in which things can improve: tools, prngs, and 'just giving a random number to the model™' great work from @gu_xiangming while being a student researcher with us 🚀

PetarV_93's tweet photo. can llms reliably roll the dice? 🎲

we shed new light on stochasticity limitations of llms, discussing some ways in which things can improve: tools, prngs, and 'just giving a random number to the model™'

great work from @gu_xiangming while being a student researcher with us 🚀 https://t.co/uYo7mNkzVI

133

18K

2 months ago

clock rate -＞ FLOPS -＞ heart rate?

TimZXiao retweeted

3 months ago

🚀 Excited to introduce POET-X, a scalable and highly memory-efficient algorithm for LLM pretraining. ✨ LoRA-level GPU memory, better-than-AdamW pretraining performance! POET-X finally marries training stability (from POET's spectrum preservation) and practical scalability (from our new implementation and CUDA kernels). POET-X can pretrain billion-parameter LLMs (eg., Llama-8B) on a single NVIDIA H100, where standard optimizers like AdamW run out of memory under the same settings. We carefully reimplemented every computation step of POET (https://t.co/EFIfegjcyc). POET-X combines many small checkpointing and parallelization tricks. While each may appear incremental, together they dramatically improve scalability and reduce memory usage by over 70% compared to the original POET. The memory-efficiency of POET-X comes from the unique parameter-efficient reparameterization (where sparsity comes in) of the weight update rule. POET-X bridges this gap between parameter efficiency and memory efficiency. Code is now public. Feel free to try it! ➡️ paper: https://t.co/bjznzJ5RHR 💻 Code: https://t.co/DL8ruiuA8X 🌐 Website: https://t.co/QacFuGR7WI #AI #LLM #MachineLearning #DeepLearning

Besteuler's tweet photo. 🚀 Excited to introduce POET-X, a scalable and highly memory-efficient algorithm for LLM pretraining.

✨ LoRA-level GPU memory, better-than-AdamW pretraining performance!

POET-X finally marries training stability (from POET's spectrum preservation) and practical scalability (from our new implementation and CUDA kernels). POET-X can pretrain billion-parameter LLMs (eg., Llama-8B) on a single NVIDIA H100, where standard optimizers like AdamW run out of memory under the same settings.

We carefully reimplemented every computation step of POET (https://t.co/EFIfegjcyc). POET-X combines many small checkpointing and parallelization tricks. While each may appear incremental, together they dramatically improve scalability and reduce memory usage by over 70% compared to the original POET.

The memory-efficiency of POET-X comes from the unique parameter-efficient reparameterization (where sparsity comes in) of the weight update rule. POET-X bridges this gap between parameter efficiency and memory efficiency.

Code is now public. Feel free to try it!

➡️ paper: https://t.co/bjznzJ5RHR
💻 Code: https://t.co/DL8ruiuA8X
🌐 Website: https://t.co/QacFuGR7WI

#AI #LLM #MachineLearning #DeepLearning

16K

TimZXiao retweeted

4 months ago

Interesting work! Doing proper normalization is definitely important for training neural networks stably. We considered hyperball normalization for convolutional neural networks back in 2018, see https://t.co/wQyH14jgsr. Besides hyperball normalization, we also proposed multiple other normalization methods for weight/activation. Quite surprisingly, we also did gradient normalizaiton in order to make it actually work. See Section 4 of the Decoupled Networks paper. I somehow got the impression that many many old ideas are worth revisiting for LLM pretraining, especially those that stablizes the training (but may slightly hurt the performance for conventional CNNs).

Besteuler's tweet photo. Interesting work! Doing proper normalization is definitely important for training neural networks stably. We considered hyperball normalization for convolutional neural networks back in 2018, see https://t.co/wQyH14jgsr. Besides hyperball normalization, we also proposed multiple other normalization methods for weight/activation.

Quite surprisingly, we also did gradient normalizaiton in order to make it actually work. See Section 4 of the Decoupled Networks paper.

I somehow got the impression that many many old ideas are worth revisiting for LLM pretraining, especially those that stablizes the training (but may slightly hurt the performance for conventional CNNs).

202

138

17K

TimZXiao retweeted

Sharvaree Vadgama

@SharvVadgama

6 months ago

😍Excited to organize @GRaM_org_ 2.0 this time at #ICLR2026 🇧🇷 🌟 Looking forward to your best works on geometry-grounded representations, inductive bias, and structure in learning. This year, we also welcome works on 🌐open problems, ⚔️discussions on scale vs symmetry, 👊 position papers and more! Deadline: 30th January AOE

TimZXiao retweeted

Yuxuan Xue @yxue_yxue

6 months ago

It's time! I will present InfiniHuman at 16:30 in room S421 at #SIGGRAPHAsia2025 . Please join me if you want to generate avatars with fine-grained multi-modal control! @ympradyumna will present PhySIC at 16:30 in room S221. Join him to turn 2D image to 3D human + Scene!

yxue_yxue's tweet photo. It's time! I will present InfiniHuman at 16:30 in room S421 at #SIGGRAPHAsia2025 . Please join me if you want to generate avatars with fine-grained multi-modal control!

@ympradyumna will present PhySIC at 16:30 in room S221. Join him to turn 2D image to 3D human + Scene! https://t.co/MkYh6t0Wqp

882

TimZXiao retweeted

yingzhen @liyzhen2

6 months ago

An exciting PhD opportunity at StatML CDT (Imperial) + Institute of Cancer Research, with Oliver Ratmann, Richard Houlston and yours truly ☺️: "Machine Learning for Cancer Susceptibility Genetics" Oct 2026 entry, apply to StatML CDT by Jan 8 2026. RT🙏 https://t.co/oBjnoC8tCo

TimZXiao retweeted

Zhen Liu

@ItsTheZhen

6 months ago

Can we efficiently and robustly finetune flow matching models with reinforcement learning using differentiable rewards, in an amortized way? Hint: use optimal control and match your velocity field with value gradients! Please come by our poster “Value Gradient Guidance for Flow Matching Alignment” at #NeurIPS2025 (Exhibit Hall C, D, E — #4906 Fri, Dec 5 | 4:30pm – 7:40pm PST) and learn more about our VGG-Flow! 🔗ArXiv: https://t.co/1Qq2jQZHgp Joint work w/ @zdhnarsil @TimZXiao @cdomingoenrich @Besteuler

ItsTheZhen's tweet photo. Can we efficiently and robustly finetune flow matching models with reinforcement learning using differentiable rewards, in an amortized way?

Hint: use optimal control and match your velocity field with value gradients!

Please come by our poster “Value Gradient Guidance for Flow Matching Alignment” at #NeurIPS2025 (Exhibit Hall C, D, E — #4906 Fri, Dec 5 | 4:30pm – 7:40pm PST) and learn more about our VGG-Flow!

🔗ArXiv: https://t.co/1Qq2jQZHgp

Joint work w/ @zdhnarsil @TimZXiao @cdomingoenrich @Besteuler

10K

6 months ago

@Besteuler I guess the review quality for the coming ICML will be very high😆

134

TimZXiao retweeted

7 months ago

🤩 This is awesome. When we are doing the agentic design project (https://t.co/VKW7MdhFPr) using the Besiege game environment, we have to hack the game to get as much feedback as possible to do RL and stuff. However, I start to think differently after seeing the Genshin agent. We humans don’t need that much feedback to learn to master the game, and visual feedback is already sufficient. I am wondering what will happen if the agent learns to master many games this way. Will it develop some universal skills for game playing? Will it see the world differently?🤔

TimZXiao retweeted

7 months ago

🤯 Merging many finetuned LLMs into one model, effectively? Introducing Functional Dual Anchor (FDA), a new framework for model merging. 🚀 Current merging works poorly due to the underlying parameter conflicts. FDA shifts knowledge integration to the input-representation space for seamless merging. This "dual" perspective bridges the gap between post-hoc merging and joint multi-task training, reducing the knowledge conflicts. ✨ FDAs are synthetic anchors that precisely capture a finetuned model's functional shift. ✨ FDAs can complement existing model merging methods and achieves SOTA performance. ➡️ Paper: https://t.co/DhI45Da0AQ 💻 Code: https://t.co/0gGvFOkdnh 🌐 Project: https://t.co/mv1wBHSESN #AI #LLM #MachineLearning #DeepLearning

Besteuler's tweet photo. 🤯 Merging many finetuned LLMs into one model, effectively? Introducing Functional Dual Anchor (FDA), a new framework for model merging.

🚀 Current merging works poorly due to the underlying parameter conflicts. FDA shifts knowledge integration to the input-representation space for seamless merging. This "dual" perspective bridges the gap between post-hoc merging and joint multi-task training, reducing the knowledge conflicts.

✨ FDAs are synthetic anchors that precisely capture a finetuned model's functional shift.

✨ FDAs can complement existing model merging methods and achieves SOTA performance.

➡️ Paper: https://t.co/DhI45Da0AQ
💻 Code: https://t.co/0gGvFOkdnh
🌐 Project: https://t.co/mv1wBHSESN

#AI #LLM #MachineLearning #DeepLearning

597

402

35K

TimZXiao retweeted

Center of The Maze @Maze_s_Center

7 months ago

The physics prior matters in molecular structures. We model potential energy between molecules for drug design. This happens to have a coincident yet interesting connection to my past work, hyperspherical energy (https://t.co/vN5Kiv0ULo), which considers potential energy between imaginary electrons (i.e. neurons in neural networks). But this time we are modeling real molecules for drug design. :) Excited that our new AI-for-science paper is finally online: "Manifold-Constrained Nucleus-Level Denoising Diffusion Model for Structure-Based Drug Design." Very glad to be part of the wonderful team. @Shengchao_Liu Caltech news: https://t.co/p1Z8x4XFG9 Paper link: https://t.co/FMMq2f9UaE Project page: https://t.co/I6PmwcmXj0

Besteuler's tweet photo. The physics prior matters in molecular structures. We model potential energy between molecules for drug design. This happens to have a coincident yet interesting connection to my past work, hyperspherical energy (https://t.co/vN5Kiv0ULo), which considers potential energy between imaginary electrons (i.e. neurons in neural networks). But this time we are modeling real molecules for drug design. :)

Excited that our new AI-for-science paper is finally online: "Manifold-Constrained Nucleus-Level Denoising Diffusion Model for Structure-Based Drug Design." Very glad to be part of the wonderful team. @Shengchao_Liu

Caltech news: https://t.co/p1Z8x4XFG9

Paper link: https://t.co/FMMq2f9UaE

Project page: https://t.co/I6PmwcmXj0

TimZXiao retweeted

8 months ago

@kenneth0stanley @ai_bread This prompt baking reminds me of verbalized machine learning. Though they don't modify the weights, but update the parameters. https://t.co/xqHDuWkY7f

195

TimZXiao retweeted

Anna Kuzina @a_kzna

8 months ago

Polymer simulations, but make them Vivace ⚡ It was a pleasure to work on Vivace architecture during my time in @MSFTResearch together with Lixin Sun and @gncsimm .

750

TimZXiao retweeted

8 months ago

This is almost a year-long project and led by @ItsTheZhen. My biggest takeaway is that physical simulation is very effective as a reward signal, and this efficient verification is crucial for unlocking LLMs’ design novelty. This conclusion is actually aligned with our previous work https://t.co/GIqIqOvJQG, where the verification is done by a renderer.