Guilherme Favaron

@guifav

The Tech Behind The Tech

São Paulo, Brasil

Joined January 2009

975 Following

303 Followers

9.1K Posts

Guilherme Favaron

@guifav

about 9 hours ago

There is only Hermes in my timeline. Where is openclaw stuff?? What's going on?

Guilherme Favaron

@guifav

about 15 hours ago

Deploying MoE LLMs? Your memory bill isn't set by the experts that compute — it's set by the ones that just sit in HBM. Mixed-precision quantization fixes that, but today's bit allocators ask calibration data which experts matter. That's the trap. New from @MPI_IS + @ELLISInst_Tue (AlphaQ, arXiv:2606.04980) by @Shiwei_Liu66, @tk_rusch (@LiquidAI_), Michael Mahoney et al. — the importance signal was in the weights all along. The problem: frontier MoE training data is proprietary, so every calibration set is a biased surrogate. Calibrate Mixtral's bit map on MATH: 55.8 on GSM8K, 29.6 on code. Calibrate on GitHub-Code: 37.7 on code, 13.7 on math. Same model, same budget. AlphaQ throws the data away. Random Matrix Theory (HT-SR): well-trained layers develop heavy-tailed weight spectra. Fit the power-law tail exponent per layer (Hill estimator, FARMS-debiased) — heavier tail = better trained = more bits. Then one global ILP (a knapsack) assigns 1–4 bits per projection layer under a single budget. Spectra: 3–9 min, once, offline. Solver: seconds. Tokens: zero. Results: - Qwen1.5-MoE @3.5-bit ≈ BF16 accuracy at 1/4 the memory (7.6GB) - Mixtral-8x7B: 96.8 → 13.4GB @2-bit, 1.68x faster - beats calibration-based PMQ/Hessian/AFG at matched budgets The lesson for prod: importance doesn't have to be measured on data you don't have. The checkpoint already knows. Code: https://t.co/KbuTe8Rbwf

guifav's tweet photo. Deploying MoE LLMs? Your memory bill isn't set by the experts that compute — it's set by the ones that just sit in HBM. Mixed-precision quantization fixes that, but today's bit allocators ask calibration data which experts matter. That's the trap.

New from @MPI_IS + @ELLISInst_Tue (AlphaQ, arXiv:2606.04980) by @Shiwei_Liu66, @tk_rusch (@LiquidAI_), Michael Mahoney et al. — the importance signal was in the weights all along.

The problem: frontier MoE training data is proprietary, so every calibration set is a biased surrogate. Calibrate Mixtral's bit map on MATH: 55.8 on GSM8K, 29.6 on code. Calibrate on GitHub-Code: 37.7 on code, 13.7 on math. Same model, same budget.

AlphaQ throws the data away. Random Matrix Theory (HT-SR): well-trained layers develop heavy-tailed weight spectra. Fit the power-law tail exponent per layer (Hill estimator, FARMS-debiased) — heavier tail = better trained = more bits. Then one global ILP (a knapsack) assigns 1–4 bits per projection layer under a single budget. Spectra: 3–9 min, once, offline. Solver: seconds. Tokens: zero.

Results:
- Qwen1.5-MoE @3.5-bit ≈ BF16 accuracy at 1/4 the memory (7.6GB)
- Mixtral-8x7B: 96.8 → 13.4GB @2-bit, 1.68x faster
- beats calibration-based PMQ/Hessian/AFG at matched budgets

The lesson for prod: importance doesn't have to be measured on data you don't have. The checkpoint already knows.

Code: https://t.co/KbuTe8Rbwf

Guilherme Favaron

@guifav

1 day ago

Most long-context LLM serving stacks already use sparse attention + KV-cache offload to CPU. But two costs survive — and at 128K they ARE your decode bottleneck. New from @NVIDIAAIDev (SparDA, arXiv:2606.04511) by @Guangxuan_Xiao, @songhan_mit et al. — the bottleneck isn't FLOPs. It's data movement in time. 1) Offloaded KV must be fetched back over PCIe every step — synchronous, on the critical path. 2) Sparse attention is O(T), but the top-k SELECTION step stays O(T²) and overtakes attention as context grows. SparDA adds a 4th per-layer projection beside Q/K/V — the Forecast. At layer L it predicts the KV blocks layer L+1 will need, ONE step ahead. So the CPU→GPU prefetch overlaps the current layer's compute, and selection uses 1 head per GQA group (no softmax) instead of one per query head. It's a 0.41% (33.5M) add-on: train only the Forecast via KL-distillation against the original selector. Base model frozen. Now at @thinkymachines. Results on H100, MiniCPM4.1-8B & NOSA-8B @128K: - 1.69x decode, 1.25x prefill vs sparse-offload - up to 5.3x decode throughput vs non-offload (bigger batches) - selection cost cut >2x; accuracy matches/beats sparse (+2.3 avg), gap widening with length The lesson for prod: profile data movement, not just FLOPs — and schedule the fetch one step ahead. Code: https://t.co/t15P0dBMZG

Guilherme Favaron

@guifav

3 days ago

Speculative decoding is how most production LLM stacks hit throughput today. The frontier moved twice: autoregressive drafts -> block diffusion drafts (one forward pass predicts the whole block). DFlash set the SOTA. Then block-diffusion drafts hit a ceiling: scaling draft depth stopped helping. New from @PKU1898 + @TencentAI_News (DFlare, arXiv:2606.02091, Jun 1 2026): the ceiling wasn't depth. It was conditioning. DFlash fuses 5 target-model hidden states through ONE FC layer, then injects that same vector into every draft layer. Identical input across layers -> no specialization -> depth saturates. DFlare's fix is almost free: 63 learnable scalars (D=7 draft layers x T=9 target states). Each draft layer attends to its OWN softmax-weighted combination of target hidden states. Plus heterogeneous KV projections so target-context and draft-positions stop sharing Wk/Wv. No new modules. Inference adds virtually zero latency. Results on @Alibaba_Qwen 4B/8B and GPT-OSS-20B (greedy, wall-clock): - 5.52x on Qwen3-4B (+10.6% vs DFlash) - 5.46x on Qwen3-8B (+8.1%) - 3.91x on GPT-OSS-20B (+5.4%) - Acceptance length 7.47 / 7.33 / 4.93 (DFlash plateaued; DFlare keeps scaling with depth and data 800K -> 2.4M) For tech leaders: when your speculative-decoding speedup plateaus, the bottleneck is usually conditioning topology, not draft params. The cheap fix is per-layer. Code (Tencent/AngelSlim): https://t.co/HIZEeCWbzI

guifav's tweet photo. Speculative decoding is how most production LLM stacks hit throughput today. The frontier moved twice: autoregressive drafts -> block diffusion drafts (one forward pass predicts the whole block). DFlash set the SOTA. Then block-diffusion drafts hit a ceiling: scaling draft depth stopped helping.

New from @PKU1898 + @TencentAI_News (DFlare, arXiv:2606.02091, Jun 1 2026): the ceiling wasn't depth. It was conditioning.

DFlash fuses 5 target-model hidden states through ONE FC layer, then injects that same vector into every draft layer. Identical input across layers -> no specialization -> depth saturates.

DFlare's fix is almost free: 63 learnable scalars (D=7 draft layers x T=9 target states). Each draft layer attends to its OWN softmax-weighted combination of target hidden states. Plus heterogeneous KV projections so target-context and draft-positions stop sharing Wk/Wv. No new modules. Inference adds virtually zero latency.

Results on @Alibaba_Qwen 4B/8B and GPT-OSS-20B (greedy, wall-clock):
- 5.52x on Qwen3-4B (+10.6% vs DFlash)
- 5.46x on Qwen3-8B (+8.1%)
- 3.91x on GPT-OSS-20B (+5.4%)
- Acceptance length 7.47 / 7.33 / 4.93 (DFlash plateaued; DFlare keeps scaling with depth and data 800K -> 2.4M)

For tech leaders: when your speculative-decoding speedup plateaus, the bottleneck is usually conditioning topology, not draft params. The cheap fix is per-layer.

Code (Tencent/AngelSlim): https://t.co/HIZEeCWbzI

Who to follow

I.O. MITITI 🇰🇪

@IO_Mititi

Entrepreneur; Technology | Finance | Agribusiness | Energy | Travel & Logistics; Software Engineer @alx_africa https://t.co/DsDa6LifWP

janio garcia

@janiogarcia6

Deus,pátria família e liberdade🇧🇷🇧🇷 servidor público conservador2️⃣2️⃣🇧🇷🇧🇷🇧🇷 torcedor do clube do remo fanático 🦁👍🏽🫡

3 days ago

@NousResearch I want a mobile app

Guilherme Favaron

@guifav

4 days ago

The agentic apps you ship are stateful programs. The serving system underneath them was built for single-shot requests. That mismatch is the real cost. New from @UChicagoCS (Ding, Hosseini, Gholami, Xiang, Hoffmann -- ConServe, arXiv:2606.01839, Jun 1 2026): raise the scheduling unit from the turn to the conversation, and per-turn brittleness disappears. Per-turn schedulers must PREDICT what they cannot see: decode length, tool behavior, KV growth. Any predictor will eventually misjudge -- SLO miss grows linearly in error rate. The brittleness is in the unit, not the predictor. Zoom out one level. A conversation = one heavy turn-1 prefill (compute-bound) + a long memory-bound tail of decode + short append-prefills. The original prefill-decode abstraction, at the right granularity. ConServe (on @vllm_project + @lmcache): - Route turn-1 prefill to a high-throughput prefiller - Transfer KV exactly once - Pin the conversation to one decoder for its entire tail - Two observables only: input length + active KV occupancy. No learned cost model. vs the AMPD per-turn predictor on SWE-bench (4x A40, Qwen3-0.6B): - p95 TTFET -51.08% (time-to-first-effective-token) - Energy +7.51%; +22.75% more on heterogeneous tiers - Zero SLO violations at saturation For tech leaders: per-turn TTFT lies about what your users wait on. Measure TTFET, pin conversations to decoders. https://t.co/CewAi37dta

guifav's tweet photo. The agentic apps you ship are stateful programs. The serving system underneath them was built for single-shot requests. That mismatch is the real cost.

New from @UChicagoCS (Ding, Hosseini, Gholami, Xiang, Hoffmann -- ConServe, arXiv:2606.01839, Jun 1 2026): raise the scheduling unit from the turn to the conversation, and per-turn brittleness disappears.

Per-turn schedulers must PREDICT what they cannot see: decode length, tool behavior, KV growth. Any predictor will eventually misjudge -- SLO miss grows linearly in error rate. The brittleness is in the unit, not the predictor.

Zoom out one level. A conversation = one heavy turn-1 prefill (compute-bound) + a long memory-bound tail of decode + short append-prefills. The original prefill-decode abstraction, at the right granularity.

ConServe (on @vllm_project + @lmcache):
- Route turn-1 prefill to a high-throughput prefiller
- Transfer KV exactly once
- Pin the conversation to one decoder for its entire tail
- Two observables only: input length + active KV occupancy. No learned cost model.

vs the AMPD per-turn predictor on SWE-bench (4x A40, Qwen3-0.6B):
- p95 TTFET -51.08% (time-to-first-effective-token)
- Energy +7.51%; +22.75% more on heterogeneous tiers
- Zero SLO violations at saturation

For tech leaders: per-turn TTFT lies about what your users wait on. Measure TTFET, pin conversations to decoders.

https://t.co/CewAi37dta

Guilherme Favaron

@guifav

4 days ago

@nahkhsinad @joaomdmoura You don't need AI for deterministic problems

Guilherme Favaron

@guifav

5 days ago

Your MoE training stack was built for humans. The coding agents that will extend it pay a hidden tax for that. New from @ruihanglai + @XiongChenyan + @tqchenml at @CarnegieMellon (PithTrain, arXiv:2605.31463, May 29 2026): a compact, agent-native MoE training framework + a new metric production stacks were missing — agent-task efficiency (ATE). The hidden tax. Megatron-LM = 149K LoC. DeepSpeed = 167K. Plugin systems, registry indirection, in-tree CUDA. Great for human engineers; expensive for an agent that has to locate code, trace what runs at a call site, and verify a change. Four agent-native principles: • Compact codebase (PithTrain: 11K LoC) • Python-native (Triton DSL for kernels only) • No implicit indirection — what runs is what you read • Agent skills in-repo (SKILL.md + scripts) Throughput parity. On Qwen3-30B-A3B, GPT-OSS-20B, DeepSeek-V2-Lite across H100/B200, PithTrain matches @nvidia Megatron-LM on 4/5 configs via DualPipeV + torch.compile(fullgraph=True). ATE-Bench, fixed agent = @AnthropicAI Claude Code Opus 4.7: • Q&A: up to 67% fewer turns vs Megatron • Train & Evaluate: 38.5 vs 55.5 min, 92 vs 163 turns • MoBA integration editing: 4.7K vs 13.1K vs 22.2K tokens • Skills ablation, validate-correctness: 114 → 34 turns For tech leaders: if a coding agent is going to evolve your training stack, framework structure is the new bottleneck — not just FLOPS. Repo: https://t.co/aozpUV8ftn https://t.co/EGOQvbJkdH

guifav's tweet photo. Your MoE training stack was built for humans. The coding agents that will extend it pay a hidden tax for that.

New from @ruihanglai + @XiongChenyan + @tqchenml at @CarnegieMellon (PithTrain, arXiv:2605.31463, May 29 2026): a compact, agent-native MoE training framework + a new metric production stacks were missing — agent-task efficiency (ATE).

The hidden tax. Megatron-LM = 149K LoC. DeepSpeed = 167K. Plugin systems, registry indirection, in-tree CUDA. Great for human engineers; expensive for an agent that has to locate code, trace what runs at a call site, and verify a change.

Four agent-native principles:
• Compact codebase (PithTrain: 11K LoC)
• Python-native (Triton DSL for kernels only)
• No implicit indirection — what runs is what you read
• Agent skills in-repo (SKILL.md + scripts)

Throughput parity. On Qwen3-30B-A3B, GPT-OSS-20B, DeepSeek-V2-Lite across H100/B200, PithTrain matches @nvidia Megatron-LM on 4/5 configs via DualPipeV + torch.compile(fullgraph=True).

ATE-Bench, fixed agent = @AnthropicAI Claude Code Opus 4.7:
• Q&A: up to 67% fewer turns vs Megatron
• Train & Evaluate: 38.5 vs 55.5 min, 92 vs 163 turns
• MoBA integration editing: 4.7K vs 13.1K vs 22.2K tokens
• Skills ablation, validate-correctness: 114 → 34 turns

For tech leaders: if a coding agent is going to evolve your training stack, framework structure is the new bottleneck — not just FLOPS.

Repo: https://t.co/aozpUV8ftn
https://t.co/EGOQvbJkdH

Guilherme Favaron

@guifav

6 days ago

Your "RL-trained reasoner" is mostly a sampler. RL post-training doesn't teach base LLMs new reasoning — it concentrates mass on traces the base already rates plausible. @aakaran31 + @du_yilun (Harvard, arXiv:2510.14901) showed you can match RL by sampling x ∝ p(x)^α — no training, no data, no verifier. Catch: p^α is intractable. Their MH sampler picks a uniform cut and resamples the suffix. But reasoning traces hide a few consequential decisions in thousands of mundane tokens — a uniform cut mostly rewrites the tail. New from @felix_zhou_cfz + @AnayMehrotra (@Yale + @Stanford, arXiv:2605.30327, May 28 2026): Entropy-Cut Metropolis-Hastings. Same MH frame, same target. Cut where next-token entropy jumps: Δt = max(0, h_t − h_{t−1}). Validated: top-decile Δt cuts give 1.33× more edit distance + 1.36× more distinct answers than bottom-decile. Entropy spikes mark decisions; the tail executes. Theory: mixing O(k) in semantic decisions vs Ω(T/b₁) in tokens — exponential separation when k ≪ T. Across MATH500 / HumanEval / GPQA / AIME26 on Qwen2.5-7B, Qwen3-8B-Base, Phi-4-mini: • Qwen2.5-7B MATH500: 35.9 → 71.9 • Qwen3-8B HumanEval: 47.6 → 79.3 • Qwen2.5-Math-7B AIME26: 4.8 → 13.1 Beats Standard, Low-Temp, SMC, TMC, Uniform-Cut MH. pass@k diversity preserved. Cost: zero. Entropies already in the forward pass. Drop-in inside @vllm_project. For tech leaders: reasoning gains aren't gated by RL budget — they're gated by where your MCMC cuts. https://t.co/QcsFqZ8lAQ

$guifav's tweet photo. Your "RL-trained reasoner" is mostly a sampler. RL post-training doesn't teach base LLMs new reasoning — it concentrates mass on traces the base already rates plausible. @aakaran31 + @du_yilun (Harvard, arXiv:2510.14901) showed you can match RL by sampling x ∝ p(x)^α — no training, no data, no verifier. Catch: p^α is intractable. Their MH sampler picks a uniform cut and resamples the suffix. But reasoning traces hide a few consequential decisions in thousands of mundane tokens — a uniform cut mostly rewrites the tail. New from @felix_zhou_cfz + @AnayMehrotra (@Yale + @Stanford, arXiv:2605.30327, May 28 2026): Entropy-Cut Metropolis-Hastings. Same MH frame, same target. Cut where next-token entropy jumps: Δt = max(0, h_t − h_{t−1}). Validated: top-decile Δt cuts give 1.33× more edit distance + 1.36× more distinct answers than bottom-decile. Entropy spikes mark decisions; the tail executes. Theory: mixing O(k) in semantic decisions vs Ω(T/b₁) in tokens — exponential separation when k ≪ T. Across MATH500 / HumanEval / GPQA / AIME26 on Qwen2.5-7B, Qwen3-8B-Base, Phi-4-mini: • Qwen2.5-7B MATH500: 35.9 → 71.9 • Qwen3-8B HumanEval: 47.6 → 79.3 • Qwen2.5-Math-7B AIME26: 4.8 → 13.1 Beats Standard, Low-Temp, SMC, TMC, Uniform-Cut MH. pass@k diversity preserved. Cost: zero. Entropies already in the forward pass. Drop-in inside @vllm_project. For tech leaders: reasoning gains aren't gated by RL budget — they're gated by where your MCMC cuts. https://t.co/QcsFqZ8lAQ$

189

Guilherme Favaron

@guifav

7 days ago

Your LLM API is leaking its stack — not the model, the stack underneath. @mlsec at @bifoldberlin / @TUBerlin (arXiv:2605.29979, May 28 2026): floating-point non-associativity is a side channel. Same model, same prompt, different engine / attention / GPU → different tokens. From chat output only — no logits. Example. Qwen2.5-7B, greedy, "Did England lose a 1966 World Cup game?": • SGLang + FlashAttention + A100 → P(Yes)=0.378 • SGLang + FlashInfer + H100 → P(Yes)=0.500 • TensorRT-LLM + FlashInfer + H100 → P(Yes)=0.562 The attack: four prompt sets, each probing one stage: 1) Rare-token recall → autoregressive decode 2) Yes/No → prefill 3) Numeric ID in long context → chunked prefill 4) "Repeat N times" → KV-cache reuse Score responses, train a random forest. Across 4 engines (@vllm_project, @sgl_project, TensorRT-LLM, LMDeploy), 6 attention backends, 3 GPUs (@nvidia H100/A100/L4) on Llama-3.2 (@AIatMeta) + Qwen (@Alibaba_Qwen): • T=0: 100% ID on engine, attention, GPU • T=0.6: 76% engine / 58% attention / 59% GPU • 113 prompts is enough • Survives unseen batch sizes + app prompts (99.7%) Why prod leaders should care: pin your engine and an attacker aims CVE-2026-22778 (vLLM RCE) or CVE-2026-5760 (SGLang RCE) at it. "Just unify the kernels" isn't a defense — unified stacks add 100%+ latency. Real mitigations cost utility: noise breaks deterministic decode, rate-limits are Sybil-evadable. Determinism is now a security property. https://t.co/rclQJfiaed

guifav's tweet photo. Your LLM API is leaking its stack — not the model, the stack underneath.

@mlsec at @bifoldberlin / @TUBerlin (arXiv:2605.29979, May 28 2026): floating-point non-associativity is a side channel. Same model, same prompt, different engine / attention / GPU → different tokens. From chat output only — no logits.

Example. Qwen2.5-7B, greedy, "Did England lose a 1966 World Cup game?":
• SGLang + FlashAttention + A100 → P(Yes)=0.378
• SGLang + FlashInfer + H100 → P(Yes)=0.500
• TensorRT-LLM + FlashInfer + H100 → P(Yes)=0.562

The attack: four prompt sets, each probing one stage:
1) Rare-token recall → autoregressive decode
2) Yes/No → prefill
3) Numeric ID in long context → chunked prefill
4) "Repeat N times" → KV-cache reuse

Score responses, train a random forest.

Across 4 engines (@vllm_project, @sgl_project, TensorRT-LLM, LMDeploy), 6 attention backends, 3 GPUs (@nvidia H100/A100/L4) on Llama-3.2 (@AIatMeta) + Qwen (@Alibaba_Qwen):

• T=0: 100% ID on engine, attention, GPU
• T=0.6: 76% engine / 58% attention / 59% GPU
• 113 prompts is enough
• Survives unseen batch sizes + app prompts (99.7%)

Why prod leaders should care: pin your engine and an attacker aims CVE-2026-22778 (vLLM RCE) or CVE-2026-5760 (SGLang RCE) at it. "Just unify the kernels" isn't a defense — unified stacks add 100%+ latency.

Real mitigations cost utility: noise breaks deterministic decode, rate-limits are Sybil-evadable.

Determinism is now a security property.

https://t.co/rclQJfiaed

Guilherme Favaron

@guifav

8 days ago

Your "temperature 0" LLM service is not deterministic. Same prompt, different batch, different tokens. @thinkymachines's fix: rewrite every matmul/attention/RMSNorm kernel as batch-invariant. Always correct. Always slow — +119% to +143% over BF16. LLM-42's fix: re-decode every token under a deterministic verifier. Also correct. Still slow — +50% to +70%. New from @uconn_cse + @ucdavis (arXiv:2605.30218, May 28 2026): MarginGate. Three measured facts: 1) Across 5 open-weight LLMs + MATH500 / GSM8K / HumanEval — only 0.3% to 1.3% of greedy decode steps actually flip when the batch changes. Over 98.7% are already stable. 2) Before a flip, the K/V cache shows no drift. The branch point is local to the current column. 3) Flipped steps share a fingerprint: top-1 / top-2 logit margin near zero. Stable steps have N(Δ)≈1; divergent steps have N(Δ)≈2. Policy: stay on the BF16 fast path; invoke the @vllm_project-style deterministic verifier only when the margin gate fires; on a confirmed mismatch, overwrite a single K/V column. Results on Llama-3.1-8B (@AIatMeta) + Qwen2.5-14B: • 100% sequence-level determinism • 18.56% / 15.05% verifier triggers • 2.23× / 1.99× lower latency overhead than always-on LLM-42 • MATH500-calibrated, transfers to GSM8K / SharedGPT / HumanEval For tech leaders: determinism in prod isn't a kernel rewrite, it's a 1% problem. Pay verifier cost only on steps the model already flags as risky. https://t.co/lwOmofHq03

guifav's tweet photo. Your "temperature 0" LLM service is not deterministic. Same prompt, different batch, different tokens.

@thinkymachines's fix: rewrite every matmul/attention/RMSNorm kernel as batch-invariant. Always correct. Always slow — +119% to +143% over BF16.

LLM-42's fix: re-decode every token under a deterministic verifier. Also correct. Still slow — +50% to +70%.

New from @uconn_cse + @ucdavis (arXiv:2605.30218, May 28 2026): MarginGate. Three measured facts:

1) Across 5 open-weight LLMs + MATH500 / GSM8K / HumanEval — only 0.3% to 1.3% of greedy decode steps actually flip when the batch changes. Over 98.7% are already stable.

2) Before a flip, the K/V cache shows no drift. The branch point is local to the current column.

3) Flipped steps share a fingerprint: top-1 / top-2 logit margin near zero. Stable steps have N(Δ)≈1; divergent steps have N(Δ)≈2.

Policy: stay on the BF16 fast path; invoke the @vllm_project-style deterministic verifier only when the margin gate fires; on a confirmed mismatch, overwrite a single K/V column.

Results on Llama-3.1-8B (@AIatMeta) + Qwen2.5-14B:
• 100% sequence-level determinism
• 18.56% / 15.05% verifier triggers
• 2.23× / 1.99× lower latency overhead than always-on LLM-42
• MATH500-calibrated, transfers to GSM8K / SharedGPT / HumanEval

For tech leaders: determinism in prod isn't a kernel rewrite, it's a 1% problem. Pay verifier cost only on steps the model already flags as risky.

https://t.co/lwOmofHq03

Guilherme Favaron

@guifav

8 days ago

@claudeai @grok what's the difference between Agentic coding and Agentic terminal coding?

Guilherme Favaron

@guifav

9 days ago

The thing slowing down RLVR for reasoning isn't the reward. It's the baseline. New from @USTCGlobal + @LSEStatistics + @oxengsci (arXiv:2605.27293, 26 May 2026): BASIS — Batchwise Advantage estimation from Single-rollout Information Sharing. Tech behind the tech: PPO pays a critic network. GRPO pays 8 rollouts per prompt to estimate one value. REINFORCE++ shares ONE global mean across the batch — cheap, but noisy. BASIS keeps the single-rollout budget and replaces the baseline with a per-prompt weighted average of OTHER prompts' rewards in the same batch. Weights from a Best Linear Unbiased Estimator (Prop. 1): w_ij ∝ V_i V_j / σ_j², w_ii = 0. Seed V_i with an offline KL-regularized estimate from @deepseek_ai-style reference rollouts, recalibrate β online. Value estimation on @Alibaba_Qwen Qwen2.5-Math-7B: • −69% MSE vs REINFORCE++ with G=1 • lower MSE than GRPO with G=8 Policy optimization on Qwen3-4B / DAPO-Math-17K, avg of AIME 24/25, AMC, MATH-500, Minerva, Olympiad, HMMT: • GRPO+BASIS G=1, 150 steps: 47.3% in 8.3h • GRPO G=8, 300 steps: 47.1% in 15.5h • Matches the 8-rollout baseline at ~half the wall clock. • +4.2 pp over GRPO+REINFORCE++ at −9.4h. For tech leaders: cheaper RLVR isn't a rollout problem — it's a baseline problem. And the information you need is already in the batch. https://t.co/EJqm1csXL0

guifav's tweet photo. The thing slowing down RLVR for reasoning isn't the reward. It's the baseline.

New from @USTCGlobal + @LSEStatistics + @oxengsci (arXiv:2605.27293, 26 May 2026): BASIS — Batchwise Advantage estimation from Single-rollout Information Sharing.

Tech behind the tech: PPO pays a critic network. GRPO pays 8 rollouts per prompt to estimate one value. REINFORCE++ shares ONE global mean across the batch — cheap, but noisy.

BASIS keeps the single-rollout budget and replaces the baseline with a per-prompt weighted average of OTHER prompts' rewards in the same batch. Weights from a Best Linear Unbiased Estimator (Prop. 1): w_ij ∝ V_i V_j / σ_j², w_ii = 0. Seed V_i with an offline KL-regularized estimate from @deepseek_ai-style reference rollouts, recalibrate β online.

Value estimation on @Alibaba_Qwen Qwen2.5-Math-7B:
• −69% MSE vs REINFORCE++ with G=1
• lower MSE than GRPO with G=8

Policy optimization on Qwen3-4B / DAPO-Math-17K, avg of AIME 24/25, AMC, MATH-500, Minerva, Olympiad, HMMT:
• GRPO+BASIS G=1, 150 steps: 47.3% in 8.3h
• GRPO G=8, 300 steps: 47.1% in 15.5h
• Matches the 8-rollout baseline at ~half the wall clock.
• +4.2 pp over GRPO+REINFORCE++ at −9.4h.

For tech leaders: cheaper RLVR isn't a rollout problem — it's a baseline problem. And the information you need is already in the batch.

https://t.co/EJqm1csXL0

Guilherme Favaron

@guifav

10 days ago

The training objective that makes your MoE great at training is silently slowing it down at inference. New ICML 2026 from @Beihang1952 + @Huawei (arXiv:2605.27081, 26 May 2026): ReMoE — Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference. Tech behind the tech: standard MoE training adds a load-balancing loss to spread tokens across experts. Great for expert-parallel training. Bad for single-request decoding on a phone or Jetson, where few experts fit in fast memory and the rest live on UFS / SSD. Every step pulls a different expert set; every step thrashes the cache. ReMoE fixes this without touching architecture, weights, or kernels. Freeze everything except the gate; fine-tune the router with two losses: a temporal-locality term (reuse mass over the previous step's Top-K) and a Trust-KL anchor to the pretrained router so semantics don't drift. Top-K is non-differentiable, so the surrogate is the mass Pt puts on last step's set, stop_grad on it. Bound: under recency cache with C ≥ K, avg fetches ≤ K(1 − EOR). Results on @deepseek_ai V2-Lite and @Alibaba_Qwen 1.5-MoE-A2.7B: • +26–27% expert overlap, no downstream regression • @vllm_project GPU-CPU offload: +8.4% throughput • @ggerganov llama.cpp on Jetson Orin NX: −44–50% TPOT, 1.77–1.99× decode speedup For tech leaders: edge MoE serving is no longer a FLOPS problem. It is an expert-I/O problem, and the upstream lever is the router itself. https://t.co/yNooGUOpMi

guifav's tweet photo. The training objective that makes your MoE great at training is silently slowing it down at inference.

New ICML 2026 from @Beihang1952 + @Huawei (arXiv:2605.27081, 26 May 2026): ReMoE — Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference.

Tech behind the tech: standard MoE training adds a load-balancing loss to spread tokens across experts. Great for expert-parallel training. Bad for single-request decoding on a phone or Jetson, where few experts fit in fast memory and the rest live on UFS / SSD. Every step pulls a different expert set; every step thrashes the cache.

ReMoE fixes this without touching architecture, weights, or kernels. Freeze everything except the gate; fine-tune the router with two losses: a temporal-locality term (reuse mass over the previous step's Top-K) and a Trust-KL anchor to the pretrained router so semantics don't drift. Top-K is non-differentiable, so the surrogate is the mass Pt puts on last step's set, stop_grad on it.

Bound: under recency cache with C ≥ K, avg fetches ≤ K(1 − EOR).

Results on @deepseek_ai V2-Lite and @Alibaba_Qwen 1.5-MoE-A2.7B:
• +26–27% expert overlap, no downstream regression
• @vllm_project GPU-CPU offload: +8.4% throughput
• @ggerganov llama.cpp on Jetson Orin NX: −44–50% TPOT, 1.77–1.99× decode speedup

For tech leaders: edge MoE serving is no longer a FLOPS problem. It is an expert-I/O problem, and the upstream lever is the router itself.

https://t.co/yNooGUOpMi

Guilherme Favaron

@guifav

11 days ago

The hidden bottleneck in agentic AI isn't the model. It's the harness around the model. New from @ShangdingG95714 at @berkeley_ai (arXiv:2605.26112, 25 May 2026): once your foundation model is wired to tools, terminals, repos and memory, performance is no longer set by next-token prediction. It is set by a system of six components: R reasoning, M memory, C context, S skill router, O orchestration, G verification & governance. P_H = Φ(R, M, C, S, O, G). Model scaling improves R. System scaling improves M, C, S, O, G. Once R clears a capability threshold, leverage moves down the stack. Three bottlenecks — and three failure modes already familiar in production: • Context governance: bigger windows do not buy you better attention. Failure: exposure without access. Move: per-turn assembly as a selection policy. • Trustworthy memory: storage is not the hard part. Failure: stale-but-confident. Move: per-entry confidence + recency, verified vs the environment. • Dynamic skill routing: confident-but-unchecked subagents accumulate as silent debt. Move: learned routing + post-condition checks per skill. Empirical anchor: in BrowseComp multi-agent runs, token usage alone explains 80% of variance; +tool calls +model choice → 95%. Compute allocation, not model identity, dominates. For tech leaders: stop reporting one-shot accuracy. Measure trajectory quality, memory hygiene, drift across sessions. https://t.co/7K7qsc6KHi

guifav's tweet photo. The hidden bottleneck in agentic AI isn't the model. It's the harness around the model.

New from @ShangdingG95714 at @berkeley_ai (arXiv:2605.26112, 25 May 2026): once your foundation model is wired to tools, terminals, repos and memory, performance is no longer set by next-token prediction. It is set by a system of six components: R reasoning, M memory, C context, S skill router, O orchestration, G verification & governance.

P_H = Φ(R, M, C, S, O, G).

Model scaling improves R. System scaling improves M, C, S, O, G. Once R clears a capability threshold, leverage moves down the stack.

Three bottlenecks — and three failure modes already familiar in production:

• Context governance: bigger windows do not buy you better attention. Failure: exposure without access. Move: per-turn assembly as a selection policy.
• Trustworthy memory: storage is not the hard part. Failure: stale-but-confident. Move: per-entry confidence + recency, verified vs the environment.
• Dynamic skill routing: confident-but-unchecked subagents accumulate as silent debt. Move: learned routing + post-condition checks per skill.

Empirical anchor: in BrowseComp multi-agent runs, token usage alone explains 80% of variance; +tool calls +model choice → 95%. Compute allocation, not model identity, dominates.

For tech leaders: stop reporting one-shot accuracy. Measure trajectory quality, memory hygiene, drift across sessions.

https://t.co/7K7qsc6KHi

Guilherme Favaron

@guifav

11 days ago

@elonmusk Ok. Thanks for sharing. Nothing to do until it's publicly released. Not sure about the relationship between grok and cursor. Please detail it.

Guilherme Favaron

@guifav

12 days ago

The reason FlashAttention beat naive attention was never about FLOPs. It was about I/O — bytes moved between HBM and SRAM. That insight rewired every modern serving stack. New from @AleksandrosSob1, Pál András Papp and @zouzias at Huawei Zurich (arXiv:2605.23751, 22 May 2026): FlashAttention is not the floor. Its I/O cost is Θ(n²·d²/M) — quadratic in sequence length. The trivial lower bound only needs Ω(n·d): just read the inputs, write the output. That gap is not closed by smarter tiling. It is closed by *what* you compute. Building on Alman & Song's polynomial-approximation algorithm (NeurIPS'23), the authors prove tight upper *and* lower I/O bounds for approximate attention in the red-blue pebble model. The mechanism: a degree-g polynomial replaces softmax, so A ≈ D⁻¹·U₁(U₂ᵀV). The non-linearity disappears, the two matmuls reorder, and the cost becomes almost-linear in n across most cache regimes. • Case I (M = Ω(d·r), r = n^o(1)): Θ(n·d) — matches the trivial lower bound. FlashAttention is Θ(n²·d²/M). The gap grows with n. • Cases II–IV: still almost-linear in n; lower bounds tight up to constants when g = O(1). Honest caveat: r = C(d+g, g) is combinatorial, so cache pressure is real and a hardware-grade kernel is future work. For tech leaders running long-context inference: the lesson isn't "swap FlashAttention". The next-decade bottleneck is the I/O cost of exactness — and there is now a proven near-linear floor below it. https://t.co/hbmIcQKZzg

guifav's tweet photo. The reason FlashAttention beat naive attention was never about FLOPs. It was about I/O — bytes moved between HBM and SRAM. That insight rewired every modern serving stack.

New from @AleksandrosSob1, Pál András Papp and @zouzias at Huawei Zurich (arXiv:2605.23751, 22 May 2026): FlashAttention is not the floor. Its I/O cost is Θ(n²·d²/M) — quadratic in sequence length. The trivial lower bound only needs Ω(n·d): just read the inputs, write the output.

That gap is not closed by smarter tiling. It is closed by *what* you compute.

Building on Alman & Song's polynomial-approximation algorithm (NeurIPS'23), the authors prove tight upper *and* lower I/O bounds for approximate attention in the red-blue pebble model. The mechanism: a degree-g polynomial replaces softmax, so A ≈ D⁻¹·U₁(U₂ᵀV). The non-linearity disappears, the two matmuls reorder, and the cost becomes almost-linear in n across most cache regimes.

• Case I (M = Ω(d·r), r = n^o(1)): Θ(n·d) — matches the trivial lower bound. FlashAttention is Θ(n²·d²/M). The gap grows with n.
• Cases II–IV: still almost-linear in n; lower bounds tight up to constants when g = O(1).

Honest caveat: r = C(d+g, g) is combinatorial, so cache pressure is real and a hardware-grade kernel is future work.

For tech leaders running long-context inference: the lesson isn't "swap FlashAttention". The next-decade bottleneck is the I/O cost of exactness — and there is now a proven near-linear floor below it.

https://t.co/hbmIcQKZzg

165

guifav retweeted

Linas Beliūnas

@linasbeliunas

12 days ago

AI generated episode of The Office with Karpathy in it is the best thing we’ve seen this year.

Guilherme Favaron

@guifav

13 days ago

The frontier of LLM serving is not the model — it is the memory the model edits while it decodes. Softmax attention keeps an unbounded KV cache. Linear attention compresses it into a fixed-size recurrent state — constant-memory decoding, linear-time prefill. The price: every association shares that finite space, so what gets overwritten matters more than what gets written. New from @ahatamiz1, @YejinChoinka and @jankautz at @NVIDIAAI (arXiv:2605.22791, 21 May 2026): Gated DeltaNet-2 unties a quiet bottleneck shared by Gated DeltaNet, KDA and Mamba-3 — one scalar β_t controlling two different decisions every step. The tie: how much old content to erase on the key side, and how much new content to commit on the value side. β_t was doing both. GDN-2 splits them into a channel-wise erase gate b_t and a channel-wise write gate w_t. Each coordinate of compressed memory negotiates its own overwrite policy. Recovers KDA when b_t = w_t = β_t·1. Same chunkwise WY kernel, gate-aware backward, no throughput tax. 1.3B params, 100B FineWeb-Edu tokens, vs Mamba-2 / GDN / KDA / Mamba-3: • Commonsense avg: 53.11 recurrent / 53.97 hybrid — best in class • MK-NIAH-1 @4K (multi-key retrieval): 37.8 recurrent / 48.0 hybrid — top • S-NIAH-3 @4K: 99.6 recurrent If your serving stack lives or dies by long-context recall on a fixed memory budget, the lesson is operational: stop sharing one knob between erase and write. https://t.co/hD7PNkjYMH code: https://t.co/wYcYN0IShn

guifav's tweet photo. The frontier of LLM serving is not the model — it is the memory the model edits while it decodes.

Softmax attention keeps an unbounded KV cache. Linear attention compresses it into a fixed-size recurrent state — constant-memory decoding, linear-time prefill. The price: every association shares that finite space, so what gets overwritten matters more than what gets written.

New from @ahatamiz1, @YejinChoinka and @jankautz at @NVIDIAAI (arXiv:2605.22791, 21 May 2026): Gated DeltaNet-2 unties a quiet bottleneck shared by Gated DeltaNet, KDA and Mamba-3 — one scalar β_t controlling two different decisions every step.

The tie: how much old content to erase on the key side, and how much new content to commit on the value side. β_t was doing both. GDN-2 splits them into a channel-wise erase gate b_t and a channel-wise write gate w_t. Each coordinate of compressed memory negotiates its own overwrite policy. Recovers KDA when b_t = w_t = β_t·1. Same chunkwise WY kernel, gate-aware backward, no throughput tax.

1.3B params, 100B FineWeb-Edu tokens, vs Mamba-2 / GDN / KDA / Mamba-3:
• Commonsense avg: 53.11 recurrent / 53.97 hybrid — best in class
• MK-NIAH-1 @4K (multi-key retrieval): 37.8 recurrent / 48.0 hybrid — top
• S-NIAH-3 @4K: 99.6 recurrent

If your serving stack lives or dies by long-context recall on a fixed memory budget, the lesson is operational: stop sharing one knob between erase and write.

https://t.co/hD7PNkjYMH
code: https://t.co/wYcYN0IShn

Guilherme Favaron

@guifav

14 days ago

Test-time compute only makes a model smarter when its inner dynamics actually converge to the right basin. Otherwise you are just paying for the same wrong answer, faster. New paper from @ZhengyangGeng, Benhao Huang and @zicokolter at @SCSatCMU (arXiv:2605.21488, 20 May 2026, ICML 2026) makes this concrete. Equilibrium Reasoners reframe iterative reasoning as a learned fixed-point system z_{k+1} = f_θ(z_k; x). Training shapes an attractor landscape; inference descends it. Stable fixed points = valid solutions. Two task-agnostic interventions on top of a weight-tied iterative backbone: 1) Randomized initialization (sample z₀ per trajectory, not fixed) so breadth at test time hits diverse basins. 2) Noise injection (β·ε at each step) so depth at test time escapes narrow / spurious attractors. One 5M-param block, trained at 16 iterations, unrolled to ~1,024 iters at test time. That is ~40,000 effective layers. Sudoku-Extreme: feedforward 2.6 → HRM 55.0 → TRM 84.8 → EqR 99.8 Maze-Unique: feedforward 0.0 → TRM 44.9 → URM 51.4 → EqR 93.0 Operational gift for production: residual ‖f_θ(z;x) − z‖ tracks accuracy. Restarts can be ranked without a verifier. NFE = D × B is the real compute knob. If the inner loop is your product (planners, code agents, multi-hop QA), train the dynamics — not just the output. https://t.co/pnzYoXx61Z code: https://t.co/bBGos2O3Bk

$guifav's tweet photo. Test-time compute only makes a model smarter when its inner dynamics actually converge to the right basin. Otherwise you are just paying for the same wrong answer, faster. New paper from @ZhengyangGeng, Benhao Huang and @zicokolter at @SCSatCMU (arXiv:2605.21488, 20 May 2026, ICML 2026) makes this concrete. Equilibrium Reasoners reframe iterative reasoning as a learned fixed-point system z_{k+1} = f_θ(z_k; x). Training shapes an attractor landscape; inference descends it. Stable fixed points = valid solutions. Two task-agnostic interventions on top of a weight-tied iterative backbone: 1) Randomized initialization (sample z₀ per trajectory, not fixed) so breadth at test time hits diverse basins. 2) Noise injection (β·ε at each step) so depth at test time escapes narrow / spurious attractors. One 5M-param block, trained at 16 iterations, unrolled to ~1,024 iters at test time. That is ~40,000 effective layers. Sudoku-Extreme: feedforward 2.6 → HRM 55.0 → TRM 84.8 → EqR 99.8 Maze-Unique: feedforward 0.0 → TRM 44.9 → URM 51.4 → EqR 93.0 Operational gift for production: residual ‖f_θ(z;x) − z‖ tracks accuracy. Restarts can be ranked without a verifier. NFE = D × B is the real compute knob. If the inner loop is your product (planners, code agents, multi-hop QA), train the dynamics — not just the output. https://t.co/pnzYoXx61Z code: https://t.co/bBGos2O3Bk$

Guilherme Favaron

@guifav

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users