bycloud

3 days ago

🚨This week's top AI/ML research papers: - DiffusionBlocks - A Bitter Lesson for Data Filtering - Neural Weight Norm = Kolmogorov Complexity - When Does LeJEPA Learn a World Model? - Do Language Models Need Sleep? - Parallax - Gemini Embedding 2 - Qwen-VLA - The MiniMax-M2 Series - Looped Diffusion Language Models - LocateAnything - Learn from your own latents and not from tokens overview for each + authors' explanations read this in thread mode for the best experience

TheAITimeline's tweet photo. 🚨This week's top AI/ML research papers:

- DiffusionBlocks
- A Bitter Lesson for Data Filtering
- Neural Weight Norm = Kolmogorov Complexity
- When Does LeJEPA Learn a World Model?
- Do Language Models Need Sleep?
- Parallax
- Gemini Embedding 2
- Qwen-VLA
- The MiniMax-M2 Series
- Looped Diffusion Language Models
- LocateAnything
- Learn from your own latents and not from tokens

overview for each + authors' explanations
read this in thread mode for the best experience

6

251

47

188

13K

6 days ago

@teortaxesTex lost in translation? Tau to chinese is pronounced as Tao and back to pinyin is Tao

1

2

0

663

Artist and researcher building autonomous creative systems. Founder, Out of Distribution Labs

6 days ago

@svgoiboi this aint a dllm

1

3

0

554

Who to follow

Spencer Sterling

@cerspense

Dreaming Tulpa 🥓👑

@dreamingtulpa

creative coder and ai whisperer | @needlefm | søl/max | 🇨🇭 https://t.co/DpsJ5tSi8L (my music) https://t.co/JjX7INV1K5 (4'700+ readers) https://t.co/6AOraP48HO (my prompts)

Matt Wolfe

@mreflow

AI Tools Database: https://t.co/mmVmxk3buH AI News & Commentary: https://t.co/vUwzYBzCxO

7 days ago

today is (potentially) a great day for the GPU poors if DiffusionBlocks works on fine-tuning existing models, then literally any reasonable consumer GPU can do LLM fine-tuning will make a video on this

bycloudai's tweet photo. today is (potentially) a great day for the GPU poors

if DiffusionBlocks works on fine-tuning existing models, then literally any reasonable consumer GPU can do LLM fine-tuning

will make a video on this https://t.co/COycfLWKca

29

1K

69

601

47K

7 days ago

@mvidia84853 ur whole morning might be gone cuz this is a cool rabbit hole

0

4

0

1K

7 days ago

https://t.co/D5cpDrnMHP

0

13

0

4

2K

7 days ago

they were pretty conservative with their paper so here are some bold and cope potentials if it holds up at scale > 3-4x memory reduction across the board without much quality loss > train a small/mid sized LLMs on a single GPU > if you can train each block independently without much comms: less all-reduce, fewer pipeline bubbles, and reduced comms overhead > if it works on fine-tuning existing models: consumer GPUs/small clusters can fine-tune SoTA models > if blocks are independent: partial fine-tuning gets cheaper, since you can update subsets of blocks instead of the whole model feel free to shut me down

Sakana AI

@SakanaAILabs

7 days ago

Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation https://t.co/c9AvsRKybj What if we didn’t have to hold an entire neural network in memory to train it? Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network. In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance. With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block. How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently. We validated this across five different architectures: • ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers In each case, performance is competitive with end-to-end training while using a fraction of the memory. This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training. Read our paper and code, to learn more. Paper: https://t.co/CRj96VGYQn GitHub: https://t.co/eNW0K9Xh8E 🐟

55

2K

365

2K

848K

16

517

21

337

55K

7 days ago

https://t.co/D5cpDrnMHP

0

19

0

15

3K

bycloudai retweeted

9 days ago

🚨This week's top AI/ML research papers: - HRM-Text - Decoupling the Benefits of Subword Tokenization - Generative Recursive Reasoning - Probabilistic Tiny Recursive Model - Vector Policy Optimization - Linear-Time Looped Transformers - Gated DeltaNet-2 - Steered LLM Activations are Non-Surjective - Code as Agent Harness overview for each + authors' explanations

4

52

5

43

5K

bycloudai retweeted

15 days ago

🚨This week's top AI/ML research papers: - Self-Distilled Agentic RL - Long Context Pre-Training with Lighthouse Attention - Embedded Language Flows - Negation Neglect - Efficient Pre-Training with Token Superposition - Slicing and Dicing - SlimQwen - Registers Matter for Pixel-Space DiT - Scaling Laws for Mixture Pretraining Under Data Constraints overview for each + authors' explanations

1

46

6

44

4K

bycloudai retweeted

alphaXiv

@askalphaxiv

22 days ago

Reinforcing Recursive Language Models Can a 4B model learn to recursively call itself to answer hard long-context questions? We RL fine-tuned a small model to behave as a native RLM. On evidence selection across scientific papers, our 4B RLM matches Sonnet 4.6 in quality while running significantly faster and cheaper.

askalphaxiv's tweet photo. Reinforcing Recursive Language Models

Can a 4B model learn to recursively call itself to answer hard long-context questions?

We RL fine-tuned a small model to behave as a native RLM.

On evidence selection across scientific papers, our 4B RLM matches Sonnet 4.6 in quality while running significantly faster and cheaper.

12

462

63

403

66K

bycloudai retweeted

24 days ago

🚨This week's top AI/ML research papers: - Model Spec Midtraining - Sparser, Faster, Lighter Transformer LMs - Continuous Latent Diffusion LM - Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting - GLM-5V-Turbo - Nonsense Helps - TIDE - Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior overview for each + authors' explanations

4

143

12

97

9K

26 days ago

u can still read it here https://t.co/ralGUU8eLh

1

7

0

5

1K

26 days ago

"Thinking With Visual Primitives" was taken down without reasons after reading the paper, my take on why they did that might be because the current version shows that visual primitives can make reasoning much more efficient, but it doesn’t fully answer the big picture that is How much visual detail can you compress away before better referencing stops being enough? basically like a trade-off between perception and reference gap They did something similar with engrams (vs MoE), so maybe they wanted to supplement some more ablation results? which i hope is the case cuz i would love to see the comparison

bycloudai's tweet photo. "Thinking With Visual Primitives" was taken down without reasons

after reading the paper, my take on why they did that might be because the current version shows that visual primitives can make reasoning much more efficient, but it doesn’t fully answer the big picture that is

How much visual detail can you compress away before better referencing stops being enough?

basically like a trade-off between perception and reference gap

They did something similar with engrams (vs MoE), so maybe they wanted to supplement some more ablation results?

which i hope is the case cuz i would love to see the comparison

6

211

17

131

19K

bycloudai retweeted

about 1 month ago

🚨This week's top AI/ML research papers: - The Last Human-Written Paper - Thinking with Visual Primitives by DeepSeek - SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning - Qwen-Scope - Recursive Multi-Agent Systems - Co-Evolving Policy Distillation - Representation Fréchet Loss for Visual Generation - Tuna-2 - Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding - DORA overview for each + authors' explanations read this in thread mode for the best experience

2

250

26

176

16K

about 1 month ago

self plug https://t.co/Svx5uPQZia

0

17

0

7

3K

about 1 month ago

be DeepSeek >need to achieve batch invariance so bad >split-k is the only optimal solution but is batch variant >Thinking Machines ($50b valuation) could barely recover the performance for their solution, gets 1.6x slower >hold_my_beer.jpg > dual-kernel strategy >"match or even surpass the perf of standard split-k in most major scenarios" >DeepSeek strikes again >$20b valuation btw

bycloudai's tweet photo. be DeepSeek

>need to achieve batch invariance so bad
>split-k is the only optimal solution but is batch variant
>Thinking Machines ($50b valuation) could barely recover the performance for their solution, gets 1.6x slower
>hold_my_beer.jpg
> dual-kernel strategy
>"match or even surpass the perf of standard split-k in most major scenarios"
>DeepSeek strikes again
>$20b valuation btw

10

345

18

155

19K