when you do OPD with grpo as group size = 1
it is ppo,
you are replacing the value function in advantage with the teacher model
looking at thing through new lens/ diff view is good thing,
but ppo is dead take was always retared
I do like this paper.
It gives a proof that minimum neural weight norm matches minimum program length, aka Kolmogorov complexity, up to a log factor.
Weight decay work because small weights neural nets toward simpler, more compressible explanations.
https://t.co/7x8JSFjqVa
some recent reads from this month that I've learned from and that are pretty cool
1. Inside NVIDIA GPUs: Anatomy of high performance matmul kernels
https://t.co/p0o0ktBe90
2. Triton Flash Attention Kernel Walkthrough: The Forward Pass
https://t.co/CG4ieEldWb
3. This guy substack
https://t.co/y89yjEM9x3
4. Deep Dive into Triton Internals (3 Parts)
https://t.co/JiX9jhN6pg
5. HunyuanWorld-Mirror: Technical Report
https://t.co/u8rZf5Whsl
6. Understanding the CUDA Compiler & PTX with a Top-K Kernel
https://t.co/YiBQB8tHHa
7. Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields
https://t.co/SEfb3AxULE