ざと @hoppiece_ - Twitter Profile

1/ We have been training RNNs wrong for decades. Backpropagation through time (BPTT) forces sequential updates, creating unstable O(T) gradient paths. What if we could train highly expressive, non-linear RNNs with flat, parallelized O(1) gradients? It is now possible. 🧵

che_shr_cat's tweet photo. 1/ We have been training RNNs wrong for decades.

Backpropagation through time (BPTT) forces sequential updates, creating unstable O(T) gradient paths.

What if we could train highly expressive, non-linear RNNs with flat, parallelized O(1) gradients?

It is now possible. 🧵 https://t.co/1mBjRedxSE

12

777

125

830

79K

0

1

0

1

414

ざと

@hoppiece_

about 18 hours ago

Transformer を RNNに蒸留するやつ翻訳とかでアツそう

1

3

0

313

Who to follow

gakkaaka

@gakkaaka

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⛰️⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

2 days ago

MiMoの開発者もwindow=128が良いと言っていたな https://t.co/ngFReBk2Zu

Fuli Luo

@_LuoFuli

6 months ago

MiMo-V2-Flash is live. It’s just step 2 on our AGI roadmap, but I wanted to dump some notes on the engineering choices that actually moved the needle. Architecture: We settled on a Hybrid SWA. It’s simple, elegant, and in our internal benchmarks, it outperformed other Linear Attention variants on long context reasoning. Plus, a fixed KV cache just plays way nicer with current infra. Note: Window size 128 turned out to be the magic number (512 actually degraded performance). Also, sink values are non-negotiable—don't skip them. MTP (Multi-Token Prediction): This is underrated for efficient RL. Aside from the first layer, it needs surprisingly little fine-tuning to hit high accept length. With a 3-layer MTP, we're seeing >3 accept length and ~2.5x speedup in coding tasks. It effectively solves the GPU idle time from long-tail samples in small-batch On-Policy RL. We didn't get to squeeze it into the RL loop this time due to deadlines, but it’s a perfect fit. We open-sourced the 3-layer MTPs so you can develop with it. Posttrain with MOPD: We adopted On-Policy-Distillation from Thinking Machine to merge multiple RL models, and the efficiency gains were wild. We matched the teacher model's performance using less than 1/50th the compute of a standard SFT+RL pipeline. There’s a clear path here for a self-reinforcing loop where the student evolves into a stronger teacher. Huge props to my team. They sculpted these ideas from scratch into production in just a few months. Full breakdown is in the tech report. If this kind of pragmatic engineering resonates with you, we should talk.

79

1K

114

411

407K

0

3

0

1

512

hoppiece_ retweeted

Daisuke Okanohara / 岡野原大輔

@hillbig

2 days ago

近年のLLMでは、長いコンテキストを効率よく扱うために、full attentionと効率的なattention（SWA、Mamba、Gated DeltaNetなど）を組み合わせる方式が増えている。しかし、こうしたhybrid architectureにおいて、長距離retrievalを実際に担っているのは主にfull attentionであることが確認されつつある。また、SWAのwindowは短い方が長距離能力を伸ばしやすいことも、すでに指摘されている。今回の実験でも、SWAの窓幅を大きくすると、局所windowだけで多くの依存関係を処理できてしまうため、full attentionが長距離retrievalを学習する圧力が弱まり、性能が伸びにくくなるという最適化上の問題が示されている。コメント === SWAの窓幅は短い方がよいという知見は、昨年のgpt-oss（2025/8）のリリース後から広く議論されるようになり、現在ではかなり浸透している。また、MambaやGated DeltaNetのようなrecurrent/linear系のsequence mixerだけでは、ロングコンテキストにおける能力が出しにくいことが指摘されている。エージェント利用などによってコンテキスト長が増え続ける中、現時点では、full attentionを完全に置き換えられる効率的なアーキテクチャはまだ確立されておらず、むしろその重要性は再確認されている。 full attention層の数はかなり減らせるようになってきたり、注意も疎に扱う技術も登場しており、計算コストの係数は着実に下がってきているとはいえ、限界がある。このままエンジニアリングでなんとか乗り切るのか、それとも、full attentionに置き換わる方法が現れるのかは大きな未解決問題である。

1

189

30

105

19K