✨ New paper: Flipping Against All Odds
We found that large language models (LLMs) can describe probabilities—but fail to sample from them faithfully.
Yes, even flipping a fair coin is hard. 🪙
🧵 Here’s what we learned—and how we fixed it.
🔗https://t.co/Auw7agOws3
1/
🚀 Excited to introduce PEFT-Arena! This project grew directly out of the challenges we faced when evaluating existing PEFT methods.
We argue that PEFT evaluation should not ask only what improves. It should also ask what is forgotten. For too long, parameter-efficient finetuning methods have been evaluated by downstream accuracy. But target performance alone is not enough. A method can adapt well while silently losing pretrained capabilities.
PEFT-Arena evaluates existing popular PEFT methods through the stability-plasticity trade-off, i.e., how much the model learns on the target task, and how much general ability it still preserves after finetuning.
A good PEFT method should adapt without forgetting and achieve a strong stability-plasticity trade-off. In our evaluation, we find that Orthogonal Finetuning achieves the strongest trade-off among all methods.
Beyond task-driven benchmark scores, PEFT-Arena also provides geometry-based internal diagnostics in weight space, activation space, and interpolation paths, helping us understand not only which methods work, but why they forget.
What PEFT-Arena has done:
🌟 Adaptation + preservation: evaluates both target gains and forgetting.
🌟 Trade-off frontier: reveals stability-plasticity patterns, with OFT often leading.
🌟 Geometry diagnosis: explains forgetting via weight spectra, activation distortion, and interpolation paths.
Welcome to try PEFT-Arena and evaluate your own PEFT method!
🌐 Project page: https://t.co/y8TsHWtHSZ
📝 Paper: https://t.co/AD4SguO4sj
💻 Code: https://t.co/JdtQkiF9IP
I want to share more about why we started building Orbit: https://t.co/Rc7S1zQUel
Orbit is an OFT-centric reinforcement learning pipeline designed for ultra-efficient post-training of large language models. In simple terms, Orbit allows us to post-train a trillion-parameter LLM on a single 8-GPU node. But more importantly, Orbit represents something much larger for us, as it connects our long-standing research to a practical system that can be used at frontier scale.
One of SphereLab’s core missions is to develop a principled and unified full-stage training pipeline for large foundation models. We are especially interested in what we call the spectral scaling paradigm: the idea that the geometry and spectral structure of model weights should not be treated as incidental details, but as central objects in how we design training algorithms.
This perspective has alreadyled us to develop a series of pretraining algorithms and systems, including POET, POET-X, and Pion:
- POET: https://t.co/1OUbqQ2I1D
- POET-X: https://t.co/5Q01hOjiMw
- Pion: https://t.co/fgAPUEzPMG
It has also guided our work on post-training and adaptation, including OFT, BOFT, OFTv2, PEFT-Arena, and OrthoMerge:
- OFT: https://t.co/mAiH8XWCqz
- BOFT: https://t.co/aQUypEWsCf
- OFTv2: https://t.co/4QRHWBB5aZ
- PEFT-Arena: https://t.co/PqP0c2Pg5F
- OrthoMerge: https://t.co/Fzjrn0zpaW
Although these projects may appear to target different stages of the model lifecycle, they are all driven by the same underlying principle -- the spectrum and geometry of neural network weights matter. Preserving and controlling these structures can lead to more stable, efficient, and scalable training and adaptation.
While creating these algorithms has been extremely rewarding, we take a step further to build practical systems from the ground up that can leverage the same principle to best scale our algorithms. That is why a small team like us still spent so much time and efforts to build Orbit.
Orbit is only the starting point. We are excited to keep building systems that bring principled training algorithms to real-world foundation model development.
🚀 Meet Orbit: OFT-based RL infrastructure for stable, efficient post-training of trillion-parameter LLMs.
Orbit can train 1T+ LLMs (e.g. Kimi-2.6, DeepSeek-V4-Pro) on a single GPU node (8xB200) with extremely small train-rollout gap!
Code: https://t.co/pyyOg6s7RQ
Blog: https://t.co/Rc7S1zQUel
Blog in Chinese: https://t.co/rvToBFG4Iq
🧵1/7
Ever tried building a world model for a partially observable environment from just raw observations? It's tough - but what if LLMs could help?
We explore this question in our latest preprint: 'Learning POMDP World Models from Observations Using Language-Model Priors.'
Nice work! Whether an LLM can generate faithful sample is an important topic for the agentic era.
Our work verbalized rejection sampling also studied this problem!
https://t.co/Auw7agOws3
can llms reliably roll the dice? 🎲
we shed new light on stochasticity limitations of llms, discussing some ways in which things can improve: tools, prngs, and 'just giving a random number to the model™'
great work from @gu_xiangming while being a student researcher with us 🚀
🚀 Excited to introduce POET-X, a scalable and highly memory-efficient algorithm for LLM pretraining.
✨ LoRA-level GPU memory, better-than-AdamW pretraining performance!
POET-X finally marries training stability (from POET's spectrum preservation) and practical scalability (from our new implementation and CUDA kernels). POET-X can pretrain billion-parameter LLMs (eg., Llama-8B) on a single NVIDIA H100, where standard optimizers like AdamW run out of memory under the same settings.
We carefully reimplemented every computation step of POET (https://t.co/EFIfegjcyc). POET-X combines many small checkpointing and parallelization tricks. While each may appear incremental, together they dramatically improve scalability and reduce memory usage by over 70% compared to the original POET.
The memory-efficiency of POET-X comes from the unique parameter-efficient reparameterization (where sparsity comes in) of the weight update rule. POET-X bridges this gap between parameter efficiency and memory efficiency.
Code is now public. Feel free to try it!
➡️ paper: https://t.co/bjznzJ5RHR
💻 Code: https://t.co/DL8ruiuA8X
🌐 Website: https://t.co/QacFuGR7WI
#AI #LLM #MachineLearning #DeepLearning
Interesting work! Doing proper normalization is definitely important for training neural networks stably. We considered hyperball normalization for convolutional neural networks back in 2018, see https://t.co/wQyH14jgsr. Besides hyperball normalization, we also proposed multiple other normalization methods for weight/activation.
Quite surprisingly, we also did gradient normalizaiton in order to make it actually work. See Section 4 of the Decoupled Networks paper.
I somehow got the impression that many many old ideas are worth revisiting for LLM pretraining, especially those that stablizes the training (but may slightly hurt the performance for conventional CNNs).
😍Excited to organize @GRaM_org_ 2.0 this time at #ICLR2026 🇧🇷
🌟 Looking forward to your best works on geometry-grounded representations, inductive bias, and structure in learning.
This year, we also welcome works on
🌐open problems,
⚔️discussions on scale vs symmetry,
👊 position papers and more!
Deadline: 30th January AOE
It's time! I will present InfiniHuman at 16:30 in room S421 at #SIGGRAPHAsia2025 . Please join me if you want to generate avatars with fine-grained multi-modal control!
@ympradyumna will present PhySIC at 16:30 in room S221. Join him to turn 2D image to 3D human + Scene!
An exciting PhD opportunity at StatML CDT (Imperial) + Institute of Cancer Research, with Oliver Ratmann, Richard Houlston and yours truly ☺️:
"Machine Learning for Cancer Susceptibility Genetics"
Oct 2026 entry, apply to StatML CDT by Jan 8 2026.
RT🙏
https://t.co/oBjnoC8tCo
Can we efficiently and robustly finetune flow matching models with reinforcement learning using differentiable rewards, in an amortized way?
Hint: use optimal control and match your velocity field with value gradients!
Please come by our poster “Value Gradient Guidance for Flow Matching Alignment” at #NeurIPS2025 (Exhibit Hall C, D, E — #4906 Fri, Dec 5 | 4:30pm – 7:40pm PST) and learn more about our VGG-Flow!
🔗ArXiv: https://t.co/1Qq2jQZHgp
Joint work w/ @zdhnarsil@TimZXiao@cdomingoenrich@Besteuler
🤩 This is awesome. When we are doing the agentic design project (https://t.co/VKW7MdhFPr) using the Besiege game environment, we have to hack the game to get as much feedback as possible to do RL and stuff.
However, I start to think differently after seeing the Genshin agent. We humans don’t need that much feedback to learn to master the game, and visual feedback is already sufficient. I am wondering what will happen if the agent learns to master many games this way. Will it develop some universal skills for game playing? Will it see the world differently?🤔
🤯 Merging many finetuned LLMs into one model, effectively? Introducing Functional Dual Anchor (FDA), a new framework for model merging.
🚀 Current merging works poorly due to the underlying parameter conflicts. FDA shifts knowledge integration to the input-representation space for seamless merging. This "dual" perspective bridges the gap between post-hoc merging and joint multi-task training, reducing the knowledge conflicts.
✨ FDAs are synthetic anchors that precisely capture a finetuned model's functional shift.
✨ FDAs can complement existing model merging methods and achieves SOTA performance.
➡️ Paper: https://t.co/DhI45Da0AQ
💻 Code: https://t.co/0gGvFOkdnh
🌐 Project: https://t.co/mv1wBHSESN
#AI #LLM #MachineLearning #DeepLearning
The physics prior matters in molecular structures. We model potential energy between molecules for drug design. This happens to have a coincident yet interesting connection to my past work, hyperspherical energy (https://t.co/vN5Kiv0ULo), which considers potential energy between imaginary electrons (i.e. neurons in neural networks). But this time we are modeling real molecules for drug design. :)
Excited that our new AI-for-science paper is finally online: "Manifold-Constrained Nucleus-Level Denoising Diffusion Model for Structure-Based Drug Design." Very glad to be part of the wonderful team. @Shengchao_Liu
Caltech news: https://t.co/p1Z8x4XFG9
Paper link: https://t.co/FMMq2f9UaE
Project page: https://t.co/I6PmwcmXj0
@kenneth0stanley@ai_bread This prompt baking reminds me of verbalized machine learning. Though they don't modify the weights, but update the parameters.
https://t.co/xqHDuWkY7f
Polymer simulations, but make them Vivace ⚡
It was a pleasure to work on Vivace architecture during my time in @MSFTResearch together with Lixin Sun and @gncsimm .
This is almost a year-long project and led by @ItsTheZhen. My biggest takeaway is that physical simulation is very effective as a reward signal, and this efficient verification is crucial for unlocking LLMs’ design novelty. This conclusion is actually aligned with our previous work https://t.co/GIqIqOvJQG, where the verification is done by a renderer.
Sharing a fascinating work: BesiegeField. It explores how LLMs can think and design directly in the space of natural language — a meaningful and fitting challenge for LLMs.
A great example of verbalized computing, where design goals are defined in words rather than formal specs.
Can LLMs design real machines — from 🚗 cars to 🏹 catapults?
Can they engineer through both 🧠 agentic workflows and 🌀 reinforcement learning (RL) — learning from physical simulation instead of text alone?
We treat machine design as “machine code writing”, where LLMs assemble mechanisms from standard parts.
To explore this, we built 🧩 BesiegeField — a real-time, physics-based sandbox where LLMs can build, test, and evolve machines through agentic planning or RL-based self-improvement.
Our findings:
1️⃣ Even top LLMs fail to build working catapults — easy for humans but highly dynamic ⚙️ and nonlinear.
2️⃣ RL helps — working designs emerge through interaction.
3️⃣ Aligning reasoning 🧩 with construction 🔩 remains a key challenge.
This marks the first step toward LLMs that learn to design through action — bridging reasoning, physics, and embodiment. 🛠️🤖
🌐 Project Website: https://t.co/ccDEEnPk2J
💻 GitHub (RL & Agentic Workflow): https://t.co/uBcjOb5gaQ
👥 Joint work w/ @Besteuler & Wenqian Zhang