@fly51fly This paper has been accepted to CVPR 2026: Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, Guangrun Wang, VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling
@fly51fly This paper has been accepted to CVPR 2026: Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, Guangrun Wang, VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling
To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight learnable updates. We apply a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters.
Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling.
For the first time, this mechanism faithfully transplants the principles of diffusion into the world of discrete symbols, achieving true “symbol-level diffusion.”
2.Introducing time-weighted cross-entropy loss during training to avoid the over-smoothing problem caused by MSE;
3. Using an argmax + one-hot feedback iteration during inference to achieve true step-by-step discrete denoising.