#CVPR2026 is around the corner and we're excited to share Gated KalmanNet: A Fading Memory Layer through Test-Time Ridge Regression. Looking forward to meeting everyone who wants to learn more.
Gated KalmaNet (GKA, pronounced "gee-ka") generalizes Mamba-2 and Gated DeltaNet, and outperforms both under identical training conditions. It also works beyond language: swapping the Mamba layer in MambaVision for GKA improves ImageNet accuracy with no vision-specific tuning.
1/4
Had a great time presenting “Gated KalmaNet (GKA): A Fading Memory Layer Through Test-Time Ridge Regression” at #CVPR2026 with @achatto1994 and @PengLiangzu.
Thanks to everyone who stopped by to chat about long context, efficient reasoning, and the future of hybrid models.
#CVPR2026 is around the corner and we're excited to share Gated KalmanNet: A Fading Memory Layer through Test-Time Ridge Regression. Looking forward to meeting everyone who wants to learn more.
Gated KalmaNet (GKA, pronounced "gee-ka") generalizes Mamba-2 and Gated DeltaNet, and outperforms both under identical training conditions. It also works beyond language: swapping the Mamba layer in MambaVision for GKA improves ImageNet accuracy with no vision-specific tuning.
1/4
Love the visualization! One thing I keep wondering with this family: all the decay/erase/write gates are computed from the current token only.
We tried the opposite in Gated KalmaNet, deriving an update using all tokens from the past via online ridge regression, still linear-time. Curious where it'd sit here.
https://t.co/eVQMEpohHo
#CVPR2026 is around the corner and we're excited to share Gated KalmanNet: A Fading Memory Layer through Test-Time Ridge Regression. Looking forward to meeting everyone who wants to learn more.
Gated KalmaNet (GKA, pronounced "gee-ka") generalizes Mamba-2 and Gated DeltaNet, and outperforms both under identical training conditions. It also works beyond language: swapping the Mamba layer in MambaVision for GKA improves ImageNet accuracy with no vision-specific tuning.
1/4
Results across three settings:
Pure SSM at 2.8B, head-to-head with Mamba-2, Gated DeltaNet and Gated Linear Attention
• Best average on LM Eval Harness across SSM baselines
• Strongest SSM on recall tasks (FDA, SWDE), closing the gap with Attention
Hybrid GKA at 8B:
• Beats Hybrid Mamba-2 and Hybrid Gated DeltaNet on long-context RULER @128K, well beyond the 8K context prior SSM hybrids typically report
Hybrid GKA at 32B:
• Up to 1.5x faster than Qwen3 32B on AIME 2025 while matching accuracy.
Vision:
• GKA can be used as a drop-in replacement in MambaVision and outperforms Mamba on ImageNet, is faster than ViT, and requires no vision-specific tuning.
See you at #CVPR2026.
3/4
If pre-training cost has kept you from trying Hybrid (Attention + SSM) architectures at scale, Priming changes the math: 100× cheaper to train, matching Qwen3-32B accuracy at 2× decode speed.
Models, training code, vLLM plugin, and Triton kernels open today.
Introducing Priming
Hybrid models are faster and cheaper than Transformers to scale. But developing alternative architectures from scratch requires expensive pre-training runs.
Priming solves this by leveraging pre-trained Transformer weights to train equally performant Hybrid models with 2× faster throughput. Builders can now iterate on Hybrid architectures for under 150B tokens, 100× cheaper than pre-training.
1/12