Atmadeep Banerjee @abanerjee99 - Twitter Profile

GANs will be solved when continual learning is solved. Disc learning is a continual learning problem. You start by allocating params to broad features and as G improves you have to move to finer feats. But you can't forget the old broad feats as the G can regress to fool you. D solves different problems at different parts of training. Fixing the loss landscape is one thing but it might need arch changes. Probably a growing MoE type continually learning D.

kalomaze

@kalomaze

7 days ago

classic/hinge GAN loss == binary cross entropy rpGAN/relative GAN loss == bradley terry mode collapse from the old GAN literature is primarily an artifact of the fact that pointwise scalars aren't grounded in the relative gap between the real and fake samples ranking > rating

0

49

2

38

6K

0

1

0

68

Atmadeep Banerjee

@abanerjee99

14 days ago

We called this bagging and boosting

ar0cket1

@ar0cket1

14 days ago

Has anyone tried a mixture of idiots instead of mixture of experts.

1

8

1

0

887

0

1

0

150

Atmadeep Banerjee

@abanerjee99

14 days ago

@MainzOnX People building the >2kW fp4 chips still need fp64 for simulating thermals.

0

2

0

72

Atmadeep Banerjee

@abanerjee99

14 days ago

This would go so hard on booktok

Ashton Hall

@AshtonHallofc

15 days ago

Missing my Indian brother

201

15K

649

408

553K

0

122

Atmadeep Banerjee

@abanerjee99

14 days ago

@xlr8harder By betting on CCP distillation machinery.

0

2

0

49

Atmadeep Banerjee

@abanerjee99

16 days ago

Welcome back Deep Belief Nets

Sakana AI

@SakanaAILabs

17 days ago

Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation https://t.co/c9AvsRKybj What if we didn’t have to hold an entire neural network in memory to train it? Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network. In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance. With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block. How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently. We validated this across five different architectures: • ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers In each case, performance is competitive with end-to-end training while using a fraction of the memory. This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training. Read our paper and code, to learn more. Paper: https://t.co/CRj96VGYQn GitHub: https://t.co/eNW0K9Xh8E 🐟

56

2K

366

2K

861K

0

3

0

145

Atmadeep Banerjee

@abanerjee99

17 days ago

Depending on how you implement your sampling, diffusion can have far worse ttft for sequential outputs like creative writing/audio/video. AR is inherently streamable. In agentic coding this does not really matter as you want to minimize the total time to generate the whole response

1

0

1

120

Atmadeep Banerjee

@abanerjee99

20 days ago

@MostlyMonkey @menhguin @DeepDishEnjoyer Out of college at 22 I was making 7x my monthly spend post tax. Not including investment income. This was at at a tech startup at peak zirp but such salaries are sustainable in quant. 5 years of this would buy 30 years of retirement. Add investment incomes and 35 yrs easy.

0

2

0

132

Atmadeep Banerjee

@abanerjee99

21 days ago

Not weird if you are from a right hand drive country

lilly sharples

@lillysharples

23 days ago

Anyone who rides shotgun alone in a Waymo should be studied

76

266

4

13

77K

0

120

Atmadeep Banerjee

@abanerjee99

21 days ago

Language people inventing FPNs from first principles.

Xiangdong Zhang @aHapBean

21 days ago

Sorry for the late release. NITP (Next Implicit Token Prediction) is about to release, expected next Monday.😀 A new LLM pre-training paradigm that goes beyond next-token prediction by learning the next token’s implicit representation, fixing representation degeneration.

aHapBean's tweet photo. Sorry for the late release.

NITP (Next Implicit Token Prediction) is about to release, expected next Monday.😀

A new LLM pre-training paradigm that goes beyond next-token prediction by learning the next token’s implicit representation, fixing representation degeneration. https://t.co/bJcrfXPLDl

11

287

34

244

32K

0

2

0

1

618

Atmadeep Banerjee

@abanerjee99

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users