1/ We have been training RNNs wrong for decades.
Backpropagation through time (BPTT) forces sequential updates, creating unstable O(T) gradient paths.
What if we could train highly expressive, non-linear RNNs with flat, parallelized O(1) gradients?
It is now possible. 🧵
MiMo-V2-Flash is live. It’s just step 2 on our AGI roadmap, but I wanted to dump some notes on the engineering choices that actually moved the needle.
Architecture: We settled on a Hybrid SWA. It’s simple, elegant, and in our internal benchmarks, it outperformed other Linear Attention variants on long context reasoning. Plus, a fixed KV cache just plays way nicer with current infra. Note: Window size 128 turned out to be the magic number (512 actually degraded performance). Also, sink values are non-negotiable—don't skip them.
MTP (Multi-Token Prediction): This is underrated for efficient RL. Aside from the first layer, it needs surprisingly little fine-tuning to hit high accept length. With a 3-layer MTP, we're seeing >3 accept length and ~2.5x speedup in coding tasks. It effectively solves the GPU idle time from long-tail samples in small-batch On-Policy RL. We didn't get to squeeze it into the RL loop this time due to deadlines, but it’s a perfect fit. We open-sourced the 3-layer MTPs so you can develop with it.
Posttrain with MOPD: We adopted On-Policy-Distillation from Thinking Machine to merge multiple RL models, and the efficiency gains were wild. We matched the teacher model's performance using less than 1/50th the compute of a standard SFT+RL pipeline. There’s a clear path here for a self-reinforcing loop where the student evolves into a stronger teacher.
Huge props to my team. They sculpted these ideas from scratch into production in just a few months.
Full breakdown is in the tech report. If this kind of pragmatic engineering resonates with you, we should talk.