Currently doing a write up on scaling laws for RL. Here are the papers I'm covering so far:
1. The Art of Scaling Reinforcement Learning Compute for LLMs (https://t.co/PGjI6Gwgv0)
2. Scaling Behaviors of LLM Reinforcement Learning Post-Training (https://t.co/2u2saB3C0h)
3. Optimally Scaling Sampling Compute for LLM RL (https://t.co/rUSdUvJyNH)
What am I missing? Please share other papers I should include!
Stanford and Caltech researchers just published the first comprehensive taxonomy of how llms fail at reasoning
not a list of cherry-picked gotchas. a 2-axis framework that finally lets you compare failure modes across tasks instead of treating each one as a random anecdote
the findings are uncomfortable
Let's get into the depth of why
Let's take Qwen 2.5 (32B parameters) running in BF16 on a single NVIDIA A100 80 GB (SXM) GPU.
--> Parameters: 32B
--> BF16 storage: 32×2=64 ; 64 GB
--> GPU memory: 80 GB
--> Free memory: 80−64=16 GB
perfect right ?
No
1. There are two major bottlenecks in LLM inference:
a. Memory capacity & memory bandwidth
b. Compute throughput
2. The generation process primarily has two stages:
a. Prefill
--> It computes single forward pass over the entire prompt and is compute bound.
--> Dominated by matrix multiplications and attention
Time to first token (TTFT)
b. Decoding
This is the phase where token by token generation happens. It's memory bandwidth bound and dominated by loading model weights repeatedly.
3. For A100 80 GB SXM:
BF16 peak compute: ~312 TFLOPS
HBM2e bandwidth: ~2,039 GB/s
To fully utilize compute, the GPU must perform:
2,039 (GB/s)/312 TFLOPS ~~ 153 FLOPs per byte
If your workload performs fewer than ~153 math operations per byte loaded, the GPU is memory bound and compute units idle.
4. During token by token generation GPU must load almost the entire model for every token, The model cannot stay on chip L2 cache (~40 MB) is far too small for a ~65 GB model. The decoding speed for this model is ~ 31 token/sec. *(This is an upper bound, real systems are usually slower due to kernel inefficiencies and synchronization.)
5. Qwen takes around ~0.7–0.8 ms(1.55GB/2035 GB/s) minimum latency.
with GQA we will need kv cache 262 KB/token.
6. Single sequence KV cache:
2K context → 1.34 GB
8K context → 5.37 GB
32K context → 21.5 GB
131K context → 85.9 GB
--> At long context, KV cache > model size.
--> Parameter count becomes almost irrelevant.
You can look at the config : https://t.co/hiXH1Yvnqt
*(Because real inference engines upcast and pad the KV cache (FP32 + paging/workspaces), so the actual memory moved per token is ~2–2.5× the theoretical 262 KB/token.)
7. Total A100 80GB HBM: 80 GB
Usable (95%): 76 GB
Model Weights: 65 GB
Remaining for KV Cache: 11 GB
Maximum supported:
- At 2K context: ~8 concurrent sequences
- At 8K context: ~2 concurrent sequences
- At 32K context: Unable to serve on single GPU
8. Prefill is compute bound and quadratic attention scaling means longer contexts dramatically increase time to first token.
Although Qwen 2.5 32B fits in 80 GB by parameter count (65 GB), decoding each token streams ~66–86 GB from HBM, costing ~46–61 ms while compute takes <0.3 ms
simultaneously, the KV cache grows from ~1.34 GB at 2K to ~85.9 GB at 131K context, collapsing batching therefore inference latency and cost are determined by memory bandwidth, KV cache growth, and context length, not parameter count.
🚨 Microsoft Research just launched something that might define the next era of AI systems.
They call it 'Agentic Organization' and it’s not just a new model. It’s a new way for intelligence itself to organize.
Here’s what’s wild:
Most large language models still “think” like a single brain.
Step-by-step. Linear. Slow. Even “parallel thinking” just runs the same process twice and merges answers later.
Agentic Organization changes the entire game.
They built a new reasoning protocol called AsyncThink, where a model plays both roles an Organizer that breaks a complex problem into sub-queries, and Workers that solve those sub-parts at the same time.
Think of it like this:
Instead of one mind grinding through steps, AsyncThink forms a mini civilization of minds delegating, merging, adapting in real time.
And it learns this behavior through reinforcement learning literally learning how to organize its own thoughts.
The results are insane:
→ 28% lower inference latency than parallel thinking
→ Higher accuracy on math reasoning tasks
→ Zero-shot generalization to unseen problems like Sudoku
→ Learned organizational policies that evolve dynamically during reasoning
It’s like scaling from “an intelligent agent” → to “an intelligent organization.”
AsyncThink models don’t just reason faster they reason like teams do.
Fork. Think. Join. Verify. Iterate.
This is a glimpse of post-LLM intelligence systems that don’t just think, they coordinate thought.
And if that holds, the future of AI might look less like a single brain… and more like a company of minds.
Paper: The Era of Agentic Organization: Learning to Organize with Language Models