New work: On-policy distillation with question-specific rubrics as rich and fine-grained supervision.
This is especially useful for hard-to-verify domains!
Check out Siyi's thread for details:
@HeMuyu0327@askerlee@iofu728 Sure it is correct. What is different is that attention residual treats it as a value and uses it to replace the following layer; instead, we treat it as a delta for accumulation.
attention residual = value + replacement
delta attention residual = delta + accumulation
We're excited to release ๐๐๐ฅ๐ญ๐ ๐๐ญ๐ญ๐๐ง๐ญ๐ข๐จ๐ง ๐๐๐ฌ๐ข๐๐ฎ๐๐ฅ๐ฌ, a drop-in upgrade to residual connections that
learns which past layers to route from โ without the routing collapse that breaks prior cross-layer
attention at scale. ๐
Attention Residuals route over cumulative hidden states, but those are highly redundant, so routing
collapses to near-uniform (max weight ~0.2) in deep layers. Delta Attention Residuals route over
๐๐๐ฅ๐ญ๐๐ฌ (vแตข = hแตขโโ โ hแตข) โ what each sublayer actually contributed โ and natively enable:
โก ๐.๐ร ๐ฌ๐ก๐๐ซ๐ฉ๐๐ซ ๐๐ซ๐จ๐ฌ๐ฌ-๐ฅ๐๐ฒ๐๐ซ ๐ซ๐จ๐ฎ๐ญ๐ข๐ง๐
Deltas are structurally diverse, lifting max attention weight from ~0.2 โ ~0.6 (0.62 vs 0.35 avg)
and curing routing collapse in deep layers.
๐ โ๐.๐% ๐ฏ๐๐ฅ๐ข๐๐๐ญ๐ข๐จ๐ง ๐๐๐ ๐๐ญ ๐.๐๐
Consistent gains from 220M โ 7.6B (1.7โ8.2% lower PPL), beating both standard residuals and
Attention Residuals โ the latter actually degrades below baseline at scale (18.58 vs 17.43).
๐ ๐๐ซ๐จ๐ฉ-๐ข๐ง ๐๐ข๐ง๐-๐ญ๐ฎ๐ง๐ข๐ง๐ ๐จ๐ ๐ฉ๐ซ๐๐ญ๐ซ๐๐ข๐ง๐๐ ๐ฆ๐จ๐๐๐ฅ๐ฌ
Additive, zero-init routing is identity at initialization, so you can convert pretrained
checkpoints (e.g. Qwen3-0.6B) into Delta Attention Residuals via standard fine-tuning โ beating the
original on 8 downstream benchmarks (55.6 vs 55.0).
๐ชถ โค๐.๐๐% ๐ฉ๐๐ซ๐๐ฆ๐๐ญ๐๐ซ ๐จ๐ฏ๐๐ซ๐ก๐๐๐
Delta Block adds just 589K params (0.008% at 8B) and ~3% memory โ and runs faster + lighter than
Attention Residuals (14.0k vs 12.5k tok/s, 42.7 vs 44.0 GB).
๐ป Code: https://t.co/c8E4NXCZWn
๐ Paper: https://t.co/Mj1W07qOm2
@dandingsky Great observation. It should be true that our non-negative softmax over deltas is exactly a signed combination over the original cumulative states. An ablation (signed-weight vanilla AttnRes) can be considered for next work.
Thanks for your attention, we totally agree on the intuition that routing over cumulative states would certainly be worse. I thought the same for attention residual, but found that their figure and pseudocode speak differently. Although the original AttnRes code isn't public, our "AttnRes" baseline is a faithful reimplementation of the described mechanism
Delta routing keeps max weight ~0.6 regardless of depth, so its edge over cumulative-state routing widens with scale, and vs a plain baseline our largest gain shows up at our largest run (8B, โ8.2%). So no sign of it washing out up to ~8B โ but 1T is honestly an extrapolation I wouldn't bet on yet. I hope it can work there
@nrol_ling thanks for the attention, we currently not have the plan or resources to scale up to 1T but we are interested in how this works out for large scale
๐ Announcing the 2nd Workshop on Efficient Reasoning (ER) at @colm2026 โ Oct 9!
๐ฃ We welcome submissions! Submit your work here: https://t.co/loVmlunK87
๐๏ธ Deadline: July 12, 2026 (AoE)
๐ Website: https://t.co/FRgQ95CcAd
๐ฌ Topics include (but aren't limited to):
๐น Multimodal, spatial & embodied reasoning under efficiency constraints
๐น Curating high-quality reasoning datasets under resource constraints
๐น Algorithmic innovations for efficient training & RL fine-tuning
๐น Fast inference: pruning, compression, progressive generation, KV-cache tricks
๐น Benchmarks & theory on time-/space-complexity and faithfulness
๐น Systems to deploy long-CoT or on-device reasoning in the wild
๐น Safety & robustness of efficient reasoning pipelines
๐น Real-time applications in healthcare, robotics, autonomy, and more
๐ค We invite perspectives from ML, systems, natural & social sciences, and industry practitioners to rethink reasoning under tight compute, memory, latency, and cost budgets.
Hope to see you there! ๐
๐จ New paper alert !!
๐ฅ Video VLMs are strong at high-level semantics and long-range temporal understanding.
๐ง JEPA is almost the opposite: better at dense, high-frequency dynamics, local physical consistency, and fast corrective control, but are less suited for rich semantic reasoning and long-horizon reasoning.
We try to get the best of both:
๐งฉ A VLM as a cortex-like reasoner for semantics and long-horizon planning
โก A JEPA branch as a cerebellum-like controller for fine-grained dynamics, physical consistency, and rapid corrections
Proudly, we present ThinkJEPA: a VLM-guided latent world model that FiLM-fuse the pyramid repr of VLMs encoding long-horizon semantic reasoning into the JEPA repr for fine-grained, physically consistent dynamics prediction.
๐ Project: https://t.co/quro6Pf8un
๐ Paper: https://t.co/yO5rv3ZJT7
@HeMuyu0327 Thanks for your comment. You are correct, the blocks have some problem, we have now updated it with refering KIMI's paper. v_i = f_i(h_i) (The output of the i-th layer's sub-layerโi.e., the increment) h_l = ฮฃ ฮฑ_{iโl} ยท v_i (A weighted sum of all preceding increments)
We open-source Attention Residuals โ replacing standard additive residuals with learned cross-layer attention in transformers.
Block AttnRes reduces WikiText-2 perplexity by 7.7% with only 0.03% extra parameters.
Includes visualization of how layers route information across depth.
Code: https://t.co/6aBEDlIn1Y
Blog: https://t.co/3VMF7u1PlW
we updated our implemtation based on KIMI's paper.
v_i = f_i(h_i) (The output of the i-th layer's sub-layerโi.e., the increment)
h_l = ฮฃ ฮฑ_{iโl} ยท v_i (A weighted sum of all preceding increments)