Aditya Chattopadhyay @achatto1994 - Twitter Profile

Pinned Tweet

@achatto1994

about 1 month ago

#CVPR2026 is around the corner and we're excited to share Gated KalmanNet: A Fading Memory Layer through Test-Time Ridge Regression. Looking forward to meeting everyone who wants to learn more. Gated KalmaNet (GKA, pronounced "gee-ka") generalizes Mamba-2 and Gated DeltaNet, and outperforms both under identical training conditions. It also works beyond language: swapping the Mamba layer in MambaVision for GKA improves ImageNet accuracy with no vision-specific tuning. 1/4

1

8

5

0

903

achatto1994 retweeted

Luca Zancato @ZancatoLuca

15 days ago

Had a great time presenting “Gated KalmaNet (GKA): A Fading Memory Layer Through Test-Time Ridge Regression” at #CVPR2026 with @achatto1994 and @PengLiangzu. Thanks to everyone who stopped by to chat about long context, efficient reasoning, and the future of hybrid models.

ZancatoLuca's tweet photo. Had a great time presenting “Gated KalmaNet (GKA): A Fading Memory Layer Through Test-Time Ridge Regression” at #CVPR2026 with @achatto1994 and @PengLiangzu.

Thanks to everyone who stopped by to chat about long context, efficient reasoning, and the future of hybrid models. https://t.co/xEeCShKb9f

1

14

2

1

851

Aditya Chattopadhyay

@achatto1994

20 days ago

Presenting this work at #CVPR2026 - Sat June 6, 11:45 AM–1:45 PM MDT, ExHall F (poster/booth 557) along with @ZancatoLuca and @PengLiangzu.

Aditya Chattopadhyay

@achatto1994

about 1 month ago

#CVPR2026 is around the corner and we're excited to share Gated KalmanNet: A Fading Memory Layer through Test-Time Ridge Regression. Looking forward to meeting everyone who wants to learn more. Gated KalmaNet (GKA, pronounced "gee-ka") generalizes Mamba-2 and Gated DeltaNet, and outperforms both under identical training conditions. It also works beyond language: swapping the Mamba layer in MambaVision for GKA improves ImageNet accuracy with no vision-specific tuning. 1/4

1

8

5

0

903

0

2

1

0

200

Aditya Chattopadhyay

@achatto1994

about 1 month ago

Love the visualization! One thing I keep wondering with this family: all the decay/erase/write gates are computed from the current token only. We tried the opposite in Gated KalmaNet, deriving an update using all tokens from the past via online ridge regression, still linear-time. Curious where it'd sit here. https://t.co/eVQMEpohHo

0

1

0

3

263

Aditya Chattopadhyay

@achatto1994

about 1 month ago

Work by: Liangzu Peng (@PengLiangzu), Aditya Chattopadhyay (@achatto1994), Luca Zancato (@ZancatoLuca), Elvis Nunez, Wei Xia (@wxhawaii) and Stefano Soatto (@soatto4) Code: https://t.co/oKvNIbnZrJ Paper: https://t.co/eVQMEpohHo Models: https://t.co/JftDYt8obw Talk: https://t.co/N9pYEjYfNZ #StateSpaceModels #LinearAttention #LongContext #LLM #Mamba #GatedDeltaNet 4/4

0

1

0

185

Aditya Chattopadhyay

@achatto1994

about 1 month ago

#CVPR2026 is around the corner and we're excited to share Gated KalmanNet: A Fading Memory Layer through Test-Time Ridge Regression. Looking forward to meeting everyone who wants to learn more. Gated KalmaNet (GKA, pronounced "gee-ka") generalizes Mamba-2 and Gated DeltaNet, and outperforms both under identical training conditions. It also works beyond language: swapping the Mamba layer in MambaVision for GKA improves ImageNet accuracy with no vision-specific tuning. 1/4

1

8

5

0

903

Aditya Chattopadhyay

@achatto1994

about 1 month ago

Results across three settings: Pure SSM at 2.8B, head-to-head with Mamba-2, Gated DeltaNet and Gated Linear Attention • Best average on LM Eval Harness across SSM baselines • Strongest SSM on recall tasks (FDA, SWDE), closing the gap with Attention Hybrid GKA at 8B: • Beats Hybrid Mamba-2 and Hybrid Gated DeltaNet on long-context RULER @128K, well beyond the 8K context prior SSM hybrids typically report Hybrid GKA at 32B: • Up to 1.5x faster than Qwen3 32B on AIME 2025 while matching accuracy. Vision: • GKA can be used as a drop-in replacement in MambaVision and outperforms Mamba on ImageNet, is faster than ViT, and requires no vision-specific tuning. See you at #CVPR2026. 3/4

achatto1994's tweet photo. Results across three settings:

Pure SSM at 2.8B, head-to-head with Mamba-2, Gated DeltaNet and Gated Linear Attention
• Best average on LM Eval Harness across SSM baselines
• Strongest SSM on recall tasks (FDA, SWDE), closing the gap with Attention

Hybrid GKA at 8B:
• Beats Hybrid Mamba-2 and Hybrid Gated DeltaNet on long-context RULER @128K, well beyond the 8K context prior SSM hybrids typically report

Hybrid GKA at 32B:
• Up to 1.5x faster than Qwen3 32B on AIME 2025 while matching accuracy.

Vision:
• GKA can be used as a drop-in replacement in MambaVision and outperforms Mamba on ImageNet, is faster than ViT, and requires no vision-specific tuning.

See you at #CVPR2026.

3/4

1

2

1

0

376

Aditya Chattopadhyay

@achatto1994

about 1 month ago

If pre-training cost has kept you from trying Hybrid (Attention + SSM) architectures at scale, Priming changes the math: 100× cheaper to train, matching Qwen3-32B accuracy at 2× decode speed. Models, training code, vLLM plugin, and Triton kernels open today.

Prannay Kaul

@PrannayKaul

about 1 month ago

Introducing Priming Hybrid models are faster and cheaper than Transformers to scale. But developing alternative architectures from scratch requires expensive pre-training runs. Priming solves this by leveraging pre-trained Transformer weights to train equally performant Hybrid models with 2× faster throughput. Builders can now iterate on Hybrid architectures for under 150B tokens, 100× cheaper than pre-training. 1/12

1

14

7

5

1K

0

2

1

0

149

Aditya Chattopadhyay

@achatto1994

Last Seen Users on Sotwe

Trends for you

Most Popular Users