Qingye Meng

@hilbertmeng

NLP Researcher at ColorfulClouds Tech. | Mechanistic interpretability of LLMs | Transformer architecture

Beijing, People's Republic of China

Joined September 2021

503 Following

44 Followers

216 Posts

Pinned Tweet

Qingye Meng @hilbertmeng

12 months ago

1/n [ICML 2025 paper] Glad to share our latest work, MUDDFormer, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers, matching ~2x Transformer++. paper: https://t.co/rTFP4oE9d7

503

Qingye Meng @hilbertmeng

18 days ago

@aHpaBean Great work! NITP enables predicting next state, instead of next token. However, it plays a similar role of MTP. Due to wide adoption of MTP, the gain over MTP is more critical. Will the gain of NITP decrease under the MTP setting?

113

hilbertmeng retweeted

Da Xiao

@xiaoda99

18 days ago

A lightweight adaptation of our MUDD connections significantly advanced NanoGPT Speedrun WR!

hilbertmeng retweeted

Larry Dial

@classiclarryd

20 days ago

New NanoGPT Speedrun WR at 81.8 (-2.6s) from @.Lisennlp on Github with MUDD skip connections, an expressive and efficient mechanism for data dependent skips! Instead of a learned scalar or sigmoid(linear) gate, MUDD uses a 64 neuron 'MLP' to generate the coefficients. The key efficiency is in reusing the same input projection for up to 14 coefficients at once. https://t.co/LMPGKRPsk0

classiclarryd's tweet photo. New NanoGPT Speedrun WR at 81.8 (-2.6s) from @.Lisennlp on Github with MUDD skip connections, an expressive and efficient mechanism for data dependent skips! Instead of a learned scalar or sigmoid(linear) gate, MUDD uses a 64 neuron 'MLP' to generate the coefficients. The key efficiency is in reusing the same input projection for up to 14 coefficients at once. https://t.co/LMPGKRPsk0

175

13K

Who to follow

Fryin Ryan🇺🇲

@ryan_capuzzi

HVAC Technician, Business owner, Fishing, Hunting. Trump 2024! Go Pack Go! I love football. NO DM's!!

JiaQi

@GuSu7

moved to Canada in 2021, now live with my family in Ontario.

amy

@tina73874613

安格斯图娜苦酒

Qingye Meng @hilbertmeng

about 1 month ago

This is my best effort to add MUDD Connections to the sota baseline 20260409 by bigbigbag. It's really hard to add new components to transformer architecture in the Parameter Golf due to extra overhead introduced against the highly-optimized baseline. Although this competition is designed to surface most parameter-efficient techniques, the 10-minute track is still very sensitive to training speed, and non-record solutions struggle to stand out. So I have to carefully tune to pick out those most important cross-layer connections, a better tradeoff between performance and overhead for the 10-min track. Despite mudd connections is not adopted in the top solution, it provides a more comprehensive perspective to cover as special cases a) U-Net connections, b) mixing x0, c) two residual lanes. Moreover, it's promising and competitive in the non-record track (truly parameter-efficient region). https://t.co/kYh9H2ZmUu

Alex Zhao

@cocohearts

about 1 month ago

parameter golf was a blast. 2,000+ submissions. 1,000+ verified github accounts. ideas ranging from quantization and depth recurrence to TTT LoRA, SSMs, H-nets, JEPA, and more. autoresearch made iteration dramatically faster — and led to emergent bulletin boards, issue threads, unofficial leaderboards, and agent-built writeups that helped everyone learn from everyone else. it felt like a glimpse of where interaction with AI is headed: humans setting taste and direction, agents helping explore, coordinate, and share what works. our goal was simple: make ml research accessible to anyone, anywhere. it was amazing to see that happen. full recap: https://t.co/FxvcbImWzL future events: https://t.co/qanojnkjmJ

612

322

163K

Qingye Meng @hilbertmeng

about 1 month ago

In MUDDFormer, we split residual stream to Q/K/V/R streams, and ablation studies show V-stream is more important than Q/K-stream, despite QK-stream is effective. Value Residual or V-stream creates a clean pathway transmitting low-level information to upper layers, without polluting residual stream at upper layers, which is consistent with increased head utilization at upper layers. In Transformer, residual streams at upper layers have to erase or dilute low-level embedding information to highlight key information predicting the next token, so attention heads tend to not be activated. https://t.co/247wiVLEHl

hilbertmeng retweeted

fly51fly @fly51fly

2 months ago

[CL] Weight Tying Biases Token Embeddings Towards the Output Space A Lopardo, A Harish, C Arnett, A Gupta [EleutherAI & UC Berkeley] (2026) https://t.co/dvkwkGyPJq

fly51fly's tweet photo. [CL] Weight Tying Biases Token Embeddings Towards the Output Space
A Lopardo, A Harish, C Arnett, A Gupta [EleutherAI & UC Berkeley] (2026)
https://t.co/dvkwkGyPJq https://t.co/Q5YV5lyOgg

hilbertmeng retweeted

Da Xiao

@xiaoda99

3 months ago

Depthwise attention/recurrence is becoming a trend! After ByteDance's HC (ICLR'24), our MUDDFormer (ICML'25) & Google's DSA (ICML'25), more labs are joining: ByteDance's VWN, DeepSeek's mHC, MoonshotAI's AttnRes, etc. MUDDFormer's key design: input-dependent weights with multiway decoupling across Q/K/V/residual streams. Only +0.23% params, 1.8×–2.4× compute advantage. This is just the beginning. More fundamental architecture innovations to come. https://t.co/QLkh9nPn9S https://t.co/2JZgW975Fh

616

Qingye Meng @hilbertmeng

3 months ago

@osieberling Decoupling residual stream into 4 streams QKVR can further improve the performance as done in MUDDFormer (or DeepCrossAttention). Full AttnRes is roughly equivalent to DynamicDenseFormer(DDFormer). https://t.co/rTFP4oE9d7

hilbertmeng's tweet photo. @osieberling Decoupling residual stream into 4 streams QKVR can further improve the performance as done in MUDDFormer (or DeepCrossAttention). Full AttnRes is roughly equivalent to DynamicDenseFormer(DDFormer).
https://t.co/rTFP4oE9d7 https://t.co/PUhBGy5iYO

864

Qingye Meng @hilbertmeng

3 months ago

Great to see depth-wise attention mechanisms like mHC and Attention Residuals (AttnRes) proving their scalability in large-scale models, and attract more attention to this line of work, including DenseFormer, HC, DeepCrossAttention (DCA) and our MUDDFormer (ICML25). We proposed multi-way dynamic dense connections along transformer layers to address the limitation of residual connections, where DynamicDenseFormer is similar to Kimi's Full AttnRes. I'd like to compare decoupling of residual streams, PP, training stability and details on depth attention weights. 1. Decoupled residual streams In MUDDFormer, we decouple the residual stream into 4-way/stream QKVR—a strategy also explored in the concurrent DCA, which is effective but absent in recent practices. We are motivated by different attribution circuits, like Q-attribution, V-attribution in mechanistic interpretability studies. Decoupled residual streams can better handle cross-layer information flow. In mHC and AttnRes, depth-wise attention is applied before each Attention and FFN block, so they can be seen as a 2-stream residual. 2. Pipeline Parallelism (PP) Efficiency is the primary bottleneck for dense cross-layer connections. Kimi addresses this via Block AttnRes, which reduces communication by attending to block-level summaries, while HC compresses the residual stream into hyper hidden states (typically 4 times wide) to reduce communication. In DenseFormer/MUDDFormer, key-wise dilation on dense connections is also a simple approach to reduce PP overhead. If PP is not a strict requirement (e.g., in TPU-based pretraining), MUDDFormer already demonstrates strong performance, and query-wise dilation can further provide an excellent balance between performance and efficiency. 3. Training stability & Depth attention weights To stabilize the residual mapping, mHC proposed the Sinkhorn-Knopp algorithm, while MUDDFormer tackles training stability by PrePostNorm in deep models. In HC and AttnRes, depth attention weights are dependent on key-wise layer outputs, while MUDDFormer utilizes a small MLP to generate weights from the query-wise hidden states.

Qingye Meng @hilbertmeng

12 months ago

503

Qingye Meng @hilbertmeng

5 months ago

@orvieto_antonio I updated the loss landscape image. Those spikes should be downward. This time it looks right and more natural.

hilbertmeng's tweet photo. @orvieto_antonio I updated the loss landscape image. Those spikes should be downward. This time it looks right and more natural. https://t.co/DO1NFT71SO

Qingye Meng @hilbertmeng

5 months ago

@orvieto_antonio I use gemini to generate a loss landscape image reflecting the sharpness of the river and valley. Is this more aligned with your mind？

hilbertmeng's tweet photo. @orvieto_antonio I use gemini to generate a loss landscape image reflecting the sharpness of the river and valley. Is this more aligned with your mind？ https://t.co/bQvgtOpkB7

203

Qingye Meng @hilbertmeng

5 months ago

Computation is more than intelligence. Intelligence may be a disguise of computation.

hilbertmeng retweeted

Lisan al Gaib

@scaling01

8 months ago

Another ByteDance Seed banger? They introduce the looped language models (LoopLMs) Ouro 1.4B and 2.6B trained on 7.7T tokens, that match evaluation results of larger 4B and 8B models respectively. "Ouro" 1.4B is a standard decoder-only Transformer with 24 layers (upcycled Ouro 2.6B = 48 layers), MHA, RoPE, SwiGLU and sandwich norm. This stack is repeatedly applied for T recurrent steps, avoiding the usual collapse of latent state to token space, therefore enabling deeper latent multi-hop reasoning. Like test-time-compute this approach trades more forward passes for additional performance, but with the additional benefit that models are smaller and have more effective depth. Additionally, they add a learned exit gate to allow early exit on simpler inputs improving the performance-cost trade-off. Training Pipeline: The pipeline is staged: warmup → big pretrain → CT-annealing → LongCT (push context, 64k) → mid-training and then a small reasoning SFT pass to make the "Ouro-Thinking" variants. The 2.6B model is an upcycled continuation of the 1.4B run (stack doubled to 48 layers). Benchmark results: - in synthetic 3-hop QA tasks they found that looped models learn the task with fewer examples compared to non looped, iso-parameter model - the looped architecture seems to help with safety as models are better able to distinguish benign prompts from harmful prompts as the number of recurrent steps increases - furthermore they demonstrated improved faithfulness of the reasoning using linear probes to predict responses in the next recurrent step and observe low predictability - they claim: "this systematic disagreement across steps when i <= 4 is precisely what a faithful latent process should exhibit: the model is updating its decision as recurrence deepens, and intermediate predictions are not frozen rationalizations of the final output" Some issues: - They state: "A defining advantage of the LoopLM architecture lies in its capacity for adaptive computation allocation", but find that performance does not increase by scaling recurrence beyond the trained T=4 depth (Table 10) - no extrapolation, which means more training is necessary - 4 recurrent steps mean 4x the FLOPs during inference. So ultimately Ouro-1.4B model with 4 recurrent steps would use more FLOPs than a Qwen3-4B, but less memory - in the appendix they under D.1 they pose the question: "What is the performance gap between standard models and LoopLM?". For this they compare 5 different model sizes: 53M, 134M, 374M, 778M, and 1.36B with recurrent depths: 1, 2, 4, and 8, trained on 20B tokens. The Standard Transformer in this case at depth 2, 4 and 8 has effectively 2, 4 and 8 times more layers and ~params(untied). They find that the Standard Transformer consistently outperforms LoopLM. Furthermore, LoopLM shows no performance increases with 8 recurrent steps. The 8 recurrent step 1.36B model is actually worse than the 778M model with 4 steps. Furthermore as seen in Table 18, the performance difference decreases the larger the models get, but increases with the number of steps/recurrence -> LoopLM is generally worse per-FLOP in compute-matched tests (untied depth wins), but it’s strong per-parameter and under memory/KV constraints. - their RL stage did not yield significant performance gains over the final SFT checkpoint: they blame model saturation and infrastructure challenges - they had to lower the number of recurring steps during training from 8 to 4 due to stability issues other notes: - looping does not increas eknowledge capacity nor improve capacity scaling - KV-cache can't be reused during pre-fill, but can be reused for decoding - recurrent architectures require smaller learning rates compared to standard transformers of equivalent parameter count

scaling01's tweet photo. Another ByteDance Seed banger?

They introduce the looped language models (LoopLMs) Ouro 1.4B and 2.6B trained on 7.7T tokens, that match evaluation results of larger 4B and 8B models respectively.

"Ouro" 1.4B is a standard decoder-only Transformer with 24 layers (upcycled Ouro 2.6B = 48 layers), MHA, RoPE, SwiGLU and sandwich norm. This stack is repeatedly applied for T recurrent steps, avoiding the usual collapse of latent state to token space, therefore enabling deeper latent multi-hop reasoning.

Like test-time-compute this approach trades more forward passes for additional performance, but with the additional benefit that models are smaller and have more effective depth.

Additionally, they add a learned exit gate to allow early exit on simpler inputs improving the performance-cost trade-off.

Training Pipeline:
The pipeline is staged: warmup → big pretrain → CT-annealing → LongCT (push context, 64k) → mid-training and then a small reasoning SFT pass to make the "Ouro-Thinking" variants. The 2.6B model is an upcycled continuation of the 1.4B run (stack doubled to 48 layers).

Benchmark results:
- in synthetic 3-hop QA tasks they found that looped models learn the task with fewer examples compared to non looped, iso-parameter model
- the looped architecture seems to help with safety as models are better able to distinguish benign prompts from harmful prompts as the number of recurrent steps increases
- furthermore they demonstrated improved faithfulness of the reasoning using linear probes to predict responses in the next recurrent step and observe low predictability - they claim: "this systematic disagreement across steps when i <= 4 is precisely what a faithful latent process should exhibit: the model is updating its decision as recurrence deepens, and intermediate predictions are not frozen rationalizations of the final output"

Some issues:
- They state: "A defining advantage of the LoopLM architecture lies in its capacity for adaptive computation allocation", but find that performance does not increase by scaling recurrence beyond the trained T=4 depth (Table 10) - no extrapolation, which means more training is necessary

- 4 recurrent steps mean 4x the FLOPs during inference. So ultimately Ouro-1.4B model with 4 recurrent steps would use more FLOPs than a Qwen3-4B, but less memory

- in the appendix they under D.1 they pose the question: "What is the performance gap between standard models and LoopLM?". For this they compare 5 different model sizes: 53M, 134M, 374M, 778M, and 1.36B with recurrent depths: 1, 2, 4, and 8, trained on 20B tokens. The Standard Transformer in this case at depth 2, 4 and 8 has effectively 2, 4 and 8 times more layers and ~params(untied).
They find that the Standard Transformer consistently outperforms LoopLM. Furthermore, LoopLM shows no performance increases with 8 recurrent steps. The 8 recurrent step 1.36B model is actually worse than the 778M model with 4 steps.

Furthermore as seen in Table 18, the performance difference decreases the larger the models get, but increases with the number of steps/recurrence
-> LoopLM is generally worse per-FLOP in compute-matched tests (untied depth wins), but it’s strong per-parameter and under memory/KV constraints.

- their RL stage did not yield significant performance gains over the final SFT checkpoint: they blame model saturation and infrastructure challenges

- they had to lower the number of recurring steps during training from 8 to 4 due to stability issues

other notes:
- looping does not increas eknowledge capacity nor improve capacity scaling
- KV-cache can't be reused during pre-fill, but can be reused for decoding
- recurrent architectures require smaller learning rates compared to standard transformers of equivalent parameter count

153

13K

Qingye Meng @hilbertmeng

9 months ago

@XiaohuaZhai As the second plot shown, we can keep ~baseline quality with p_s=0.5. In contrast, with p=0.8, RINS shows too much improvement over baseline, which counter-intuitively even matches performance of the best AAAB model (p_s=0, 2x inference cost). Could you give some explanation?

Qingye Meng @hilbertmeng

9 months ago

@ibomohsin @XiaohuaZhai Excellent work! To reproduce RINS, I trained two 150M llama models(AB, AAAB) on the Pile dataset over 105B tokens, with a loss gap of 0.012, smaller than ~0.04 in the paper. I also failed to reproduce the adapter with unstable training. Can I DM you for further help?

hilbertmeng retweeted

Aaditya Singh @Aaditya6284

11 months ago

Excited to present this work in Vancouver at #ICML2025 today 😀 Come by to hear about why in-context learning emerges and disappears: Talk: 10:30-10:45am, West Ballroom C Poster: 11am-1:30pm, East Exhibition Hall A-B # E-3409

hilbertmeng retweeted

Omar Sanseviero

@osanseviero

12 months ago

Want to learn about the research behind Gemma 3n? Altup - https://t.co/ngwMI7UfIw LAuReL - https://t.co/2KE997GDWV MatFormer - https://t.co/AnHhQktcZu Activation sparsity - https://t.co/CxoPEOMdkU Universal Speech Model - https://t.co/TuP8XMeYKS Blog - https://t.co/TKIO4yJVhk

657

122

787

53K

Qingye Meng @hilbertmeng

12 months ago

@GauravML Congratulations! We also concurrently propose MUDDFormer, with dynamic and multi-way connections to previous layers. Hope enhanced cross-layer connections can be adopted in more architectures. https://t.co/rTFP4oE9d7

152

Qingye Meng

@hilbertmeng

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users