Independently verified Theorem 2 from the EGGROLL paper by @bidiptas13@j_foerst@shimon8282@AaronCourville (arXiv:2511.16652) on a 4GB RTX 3050.
slope -0.898 (toy), -0.755 (MNIST MLP). R² = 0.998 and 0.9995. both firmly in [-0.8, -1.2].
but the methodology is the actual contribution , naive replications will get slope ≈ -0.004 and walk away thinking the theorem fails.
Three things have to be right:
1. nonlinear fitness. the obvious quadratic (-‖W − W*‖²) gives EGGROLL and Gaussian ES identical bias by construction zero 3rd+ derivatives, theorem is invisible.
2. σ-tuning. bias ∝ σ², SEM is roughly σ-independent → SNR ∝ σ². at σ=0.05 slope is flat. at σ=0.3 it’s -0.755. same theorem, same code.
3. proper bias estimator. average gradients across trials first, then: bias(r) = √(max(err² − SEM_eg² − SEM_ref², 0)). per-trial error is just MC noise.
Theorem 2 holds. the math is clean. the setup just has to be.
think @yacinelearning would find the bias-isolation angle interesting.
full writeup → https://t.co/4VK3RdUXM8
code → https://t.co/HtuWmrPwmD
Built the complete mathematical picture of EGGROLL from scratch, filling in everything the paper skips.
Walked through every proof and insight step by step.
Working on an implementation that runs on a consumer GPU like an RTX 3050 laptop, to actually show what it looks like when memory and compute stop being the bottleneck.
Great work by @bidiptas13 and the team at @j_foerst's Oxford FLAIR lab.
While going deep on this paper, I came across @yacinelearning's video which is a solid high level view of the paper if you want to start there.
Check it out: https://t.co/4VK3RdUXM8
one of our papers, 'alphadesign: a hybrid reinforcement learning and genetic algorithm approach for f1 front wing optimization compliant with fia 2026 regulations', got accepted at flins-iske 2025
couldn't afford the registration so we won't be attending. trying to put it up on arxiv instead -- if anyone can help with an endorsement for cs.LG or https://t.co/EEB4lVUWIM, would really appreciate it
huge credit to @scriptosis, @harish20205
I've fully covered the mathematical foundation of IceCache that was discussed in the paper, and parts that weren't detailed there.
IceCache is a novel approach to managing KV caches that uses Dynamic Continuous Indexing (DCI) to organize and retrieve tokens based on their semantic relationships more efficiently.
I walked through the complete sparse-retrieval theory step by step , every formula explained from first principles, every design choice motivated, every minute mathematical detail laid out. Implementation is in the next post .... check it out
https://t.co/pdzo5YX0Ka
Thank you for this wonderful paper, would love any feedback or guidance
@KL_Div@Mao_Yuzhen@q1tong
Few weeks ago @anirudhbv_ce shipped TurboQuant in cuTile on a Blackwell B200. Beautiful work - genuinely.
I wanted to see what the same algorithm looks like at the opposite end of the stack: raw CUDA + hand-written PTX on a 4GB laptop GPU. No cuTile. No B200. Just nvcc, shared memory, and a lot of nsight-compute.
Quick context — TurboQuant paper proposed compression in KV-cache and vector-search embedding way harder than anything before it, which I have discussed earlier in my blog , please go through it for clearer understanding of the algorithm :
https://t.co/uvIpEaCg4L
I implemented it three ways and bench marked them against each other:
1. Vanilla CUDA — two kernels. Rotate (FWHT) in kernel 1, quantize + bit-pack in kernel 2. Clean, but the intermediate rotated tensor gets written out to HBM and read back , 64 MB of wasted memory traffic per call at N=65k, d=128. Two launches too.
2. Fused — one kernel. Rotated vector stays in shared memory, quantize reads it from there directly. HBM round-trip gone. 1.10–1.25× faster on quantize, 1.17–1.41× on dequant. Dequant benefits more because its per-coord compute is smaller, so saving the HBM trip is a bigger fraction of runtime.
Hit a fun bug here. The fused kernel agreed bit-for-bit with the unfused version at b=1 and b=2, but differed in exactly 1 coord out of ~4 million at b=4. Cause: --use_fast_math was fusing (smem * scale) - codebook[k] into a single FMA, which rounds once instead of twice. At midpoint ties, that's enough to flip which centroid wins. Fix: pin the scale multiply with __fmul_rn. Bit-exact parity restored.
(cc @tri_dao — the __fmul_rn / FMA rounding thing felt like exactly the kind of footgun you run into in FlashAttention territory. Curious whether you pin rounding explicitly at ops that matter or just test against a tolerance.)
3. Fused + inline PTX. Two experiments. One paid off massively, one did nothing:
pack_signs with warp ballot (vote.sync.ballot.b32) — ~2.0× across every config. 32 threads each contribute one bit in unison via a warp-level primitive. No clean C++ form.
bfi.b32 for bit-packing the quantized indices — zero speedup within noise. I checked the SASS and nvcc already emits BFI from the C++ pattern word |= (idx & MASK) << shift. The inline PTX was cosmetic.
Takeaway: inline PTX only pays off when it exposes a hardware primitive C++ can't express.
End-to-end on SIFT-1M (1M × 128 vectors, standard ANN benchmark): — 93% Recall@10 at 8× compression with fp32 rerank — Naive scalar quantization at same bits: 68% — At 16× compression, naive is essentially random (6%); TurboQuant still preserves 52% of true top-10
(@vikhyatk@StasBekman — tried to be careful separating the three "baselines" here: fp32 ceiling vs fp16 cast vs naive scalar at matched bits. If anyone spots methodology holes, I'd take the feedback.)
Paper's predicted distortion for b={1,2,4}: {0.36, 0.117, 0.009}. I measured {0.361, 0.116, 0.00933} over 10 seeds. Sits right on the predicted curve.
Full repo, all three implementations, reproducible benchmarks: → https://t.co/u5dwdAwcWf
@mirrokni
TurboQuant paper's author , @daliri__majid , just liked my post ❤️
It really means a lot that the person behind the work took a moment to check out my deep dive.
Feeling grateful—and even more motivated to keep going. More decodes coming soon…
Thank you so much!
I’ve fully covered the mathematical foundation of TurboQuant that was not detailed in the original paper.
I derived and proved the complete quantization theory behind it, showing that TurboQuant achieves distortion only a factor of ~2.7 away from the theoretical ideal quantizer.
I’ve worked through all the minute mathematical details and proofs step by step.
Implementation is coming soon .... check it out
Thank you for this wonderful paper @daliri__majid@mirrokni
https://t.co/PXHLBlbqu2
I’ve fully covered the mathematical foundation of TurboQuant that was not detailed in the original paper.
I derived and proved the complete quantization theory behind it, showing that TurboQuant achieves distortion only a factor of ~2.7 away from the theoretical ideal quantizer.
I’ve worked through all the minute mathematical details and proofs step by step.
Implementation is coming soon .... check it out
Thank you for this wonderful paper @daliri__majid@mirrokni
https://t.co/PXHLBlbqu2
Just made understanding any codebase way simpler.
DGAT is live:
uv pip install dgat
dgat config init
dgat scan /path/to/your/project
One command and you get:
- file_tree.json
- dep_graph.json
- dgat_blueprint.md (LLM-synthesized architectural overview)
Point it at any repo → instant annotated dependency graph + blueprint. Trying with different harnesses like opencode, claude code. Benchmarks coming soon!!
The author of the MaxRL paper @FahimTajwar10 just liked my post ❤️
I’m genuinely smiling right now.
It means a lot that the person who wrote the paper took a moment to check out my deep dive. Feeling incredibly grateful and motivated.
More decodes coming up....
Thank you so much!
Just dropped a full mathematical deep-dive on the MaxRL paper
I went way beyond the original paper and derived every single step they skipped — the hidden connections between maximum likelihood and RL, the exact gradient mismatch, why the simple "average over successes" estimator is actually unbiased for the truncated MaxRL objective, the full binomial expansion proofs, and how it all smoothly interpolates from standard RL (T=1) to true MLE as compute → ∞.
No hand-wavy explanations. Pure math. All the derivations you actually need to understand what's going on under the hood .
Checkout the blog :
https://t.co/joitmurbO4
Just dropped a full mathematical deep-dive on the MaxRL paper
I went way beyond the original paper and derived every single step they skipped — the hidden connections between maximum likelihood and RL, the exact gradient mismatch, why the simple "average over successes" estimator is actually unbiased for the truncated MaxRL objective, the full binomial expansion proofs, and how it all smoothly interpolates from standard RL (T=1) to true MLE as compute → ∞.
No hand-wavy explanations. Pure math. All the derivations you actually need to understand what's going on under the hood .
Checkout the blog :
https://t.co/joitmurbO4
One of our papers, 'SpecQuant: Speculative Decoding with Multi-Parent Quantization for Adaptive LLM Inference', just won the Best Paper Award at a conference! Huge credit goes to @harish20205 and @scriptosis !"
project website: https://t.co/LSZyCdg3Ur
I wouldn't agree at all that this EsoLang-Bench drop is some profound revelation -- frontier models tanking on brainfuck etc. isn't 'shocking', it's expected when you deliberately pick syntax torture tests with near-zero training data. Many replies nail it: no human baseline, unfair apples-to-oranges (humans suck at BF too), ignores that agentic setups (tools/iterations) crush it anyway. Feels like classic visibility-farming clickbait ('🚨 Shocking') while poking at memorized patterns.
Real optimization isn't syntax games—it's efficient binaries/kernels, which is why KernelBench matters.
@xai recent reorg/talks (@elonmusk) push exactly that: models discovering optimal machine binaries directly, skipping traditional compile paths for massive efficiency gains.
Not what I'd expect serious labs (including Indian ones) to hype, but here we are.
🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%.
Presenting EsoLang-Bench.
Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵
It’s finally here .....
My deep dive into NVIDIA’s new “alien” Rubin GPUs — faster than my WiFi when I’m not downloading anything.
If Blackwell was strong, Rubin looks like it bench pressed a data center.
Read before it achieves AGI without us:
https://t.co/QigA3stVb8
Core NVL72 performance benchmarks are undeniably imposing, but synthesizing 25x H100-level inference density inside a Rubin Space-1 orbital module is a quintessential architectural masterstroke. I channeled my extensive Rubin blog into demystifying the ground-state infrastructure—mapping out exactly how ConnectX-9 bypasses MoE bottlenecks, BlueField-4 optimizes packet flow, and the absolute interconnect superiority of NVLink 6. Even after releasing that meticulous teardown a month prior to GTC, its hardware forecasts remain profoundly unassailable.
Still, contending with those zero-G thermal envelopes radically rewrites the deployment playbook. The energy-conservation architectures I spotlighted surpass trivial optimization; they function as absolute existential prerequisites in the cosmos. The vault from terrestrial server grids to microgravity operationalization signifies a tectonic inflection point in systems engineering.
Which specific disclosure from the presentation genuinely left you speechless? https://t.co/EvpoPNkmQS
Foundational NVL72 throughput statistics are awe-inspiring, yet integrating 25x H100-level inference density aboard a Rubin Space-1 orbital module stands as an unparalleled engineering triumph. I dedicated my comprehensive Rubin blog to decoding the foundational architecture—demonstrating precisely how ConnectX-9 neutralizes MoE bottlenecks, BlueField-4 streamlines data-path processing, and the sheer communicative supremacy of NVLink 6. Even after publishing that granular breakdown four weeks ahead of GTC, its architectural predictions remain remarkably unassailable.
Regardless, navigating those aerospace-grade thermal limitations completely redefines the infrastructural blueprint. The power-efficiency protocols I outlined eclipse mere iterative upgrades; they operate as non-negotiable survival mandates in deep space. The leap from conventional data centers to orbital execution represents a seismic shift in computing evolution.
Which particular revelation from the showcase unequivocally blew your mind? https://t.co/EvpoPNkmQS
Base-level NVL72 bandwidth metrics are staggering, but embedding 25x H100-level inference density into a Rubin Space-1 orbital module is a supreme architectural feat. I devoted my exhaustive Rubin blog to parsing the underlying topology—illustrating how ConnectX-9 circumvents MoE bottlenecks, BlueField-4 mitigates computational overhead, and the absolute interconnect dominance of NVLink 6. Despite launching that intricate analysis a full month preceding GTC, its technical foresight remains flawlessly intact.
Nevertheless, contending with those orbital thermal constraints irrevocably alters the hardware paradigm. The energy optimizations I documented transcend superficial enhancements; they function as absolute existential imperatives in the exosphere. The pivot from standard server farms to microgravity deployment signifies a monumental leap in systems engineering.
Which specific disclosure from the keynote left you genuinely astounded? https://t.co/EvpoPNkmQS