Geeta Chauhan

@geeta4c

Applied AI, PyTorch

Joined March 2009

382 Following

295 Followers

105 Posts

Geeta Chauhan

@geeta4c

about 16 hours ago

FrontierCode from @cognition isn't just another coding benchmark—it's a maintainer reality check. Instead of "does it pass tests?", it asks: "Would a senior dev actually merge this PR?" Real tasks from 20+ OSS maintainers across 36 repos. Multi-axis grading on quality, scope discipline, test rigor, regressions, and style. Result? Even Claude Opus 4.8 scrapes only ~13.4% on the hardest Diamond set. GPT-5.5 trails at 6.3%. https://t.co/urGZoiVElF Strong work exposing how far we still are.

Cognition @cognition

1 day ago

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers. Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

cognition's tweet photo. Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

227

297

Geeta Chauhan

@geeta4c

about 16 hours ago

@cognition Excellent work pulling together real tasks from 20+ OSS maintainers, defining "success" as actual mergeability (quality, scope, tests, regressions, taste). A step in right direction for advancing the field with real world production-code benchmarks.

Geeta Chauhan

@geeta4c

2 days ago

The DGX Spark's 128GB unified memory is its strength and its risk: with no separate VRAM, a GPU out-of-memory event can take down the whole system. A practical guide to hardening it for stable AI workloads - kernel tuning, OOM scoring, earlyoom, and emergency SSH: https://t.co/LpfE0fKmpZ

geeta4c's tweet photo. The DGX Spark's 128GB unified memory is its strength and its risk: with no separate VRAM, a GPU out-of-memory event can take down the whole system.

A practical guide to hardening it for stable AI workloads - kernel tuning, OOM scoring, earlyoom, and emergency SSH: https://t.co/LpfE0fKmpZ

Geeta Chauhan

@geeta4c

11 days ago

Async-RL becomes practical with DeltaSync in TRL + HF Buckets In bf16 RL at low LR, ~99% of weights don't change per step — most Adam updates are invisible below the mantissa threshold. So why ship entire 1T-param checkpoints? Sparse deltas via safetensors using HF Hub Buckets→ 50-65x smaller payloads. Trainer on one box, vLLM rollouts on another (different regions even), synced through cheap object storage. No more NCCL tax or co-location hell. This kills the synchronization bottleneck that made async online RL painful at scale. @FireworksAI_HQ /@cursor_ai proved the closed version; @huggingface just open-sourced the practical one. https://t.co/oLNtdDytS4

clem 🤗

@ClementDelangue

12 days ago

The HF science team just made async RL weight sync ~100x cheaper on bandwidth, and you don't need a shared cluster anymore. The problem: every RL step, the trainer typically has to sync fresh weights to the inference engine. for a 7B in bf16 that's ~14GB. for a frontier 1T fp8 checkpoint, that's ~1TB; in bf16 it would be ~2TB. per sync. The insight: between two RL steps, ~99% of bf16 weights are bit-identical. at RL learning rates, the optimizer is whispering and bf16 literally cannot hear most of it. the stored bf16 bits don't change. What they shipped in TRL: only the changed elements get encoded as a sparse safetensors file, dropped into a Hugging Face Bucket, and fetched by vLLM. on Qwen3-0.6B, per-step payload goes from 1.2 GB to 20 to 35 MB. This is exactly what we built Buckets for: S3-like object storage on the Hub, Xet-backed (so even full snapshots only transfer the changed chunks). The cherry on top: we ran a FULL disaggregated training where: - the trainer lived on one box - vLLM ran inside a Hugging Face Space - the Wordle environment ran in another Space - weights flowed through one Hub bucket no shared cluster. no RDMA. no VPN. no NCCL across clouds. just HTTPS and a bucket. one GPU + a Hugging Face account is now enough to do real disaggregated RL. multi-replica inference fleets across regions become a small devops exercise, not a research project. Full write-up: https://t.co/CG115IjT0q Open source RL keeps eating the moat!

ClementDelangue's tweet photo. The HF science team just made async RL weight sync ~100x cheaper on bandwidth, and you don't need a shared cluster anymore.

The problem: every RL step, the trainer typically has to sync fresh weights to the inference engine. for a 7B in bf16 that's ~14GB. for a frontier 1T fp8 checkpoint, that's ~1TB; in bf16 it would be ~2TB. per sync.

The insight: between two RL steps, ~99% of bf16 weights are bit-identical. at RL learning rates, the optimizer is whispering and bf16 literally cannot hear most of it. the stored bf16 bits don't change.

What they shipped in TRL: only the changed elements get encoded as a sparse safetensors file, dropped into a Hugging Face Bucket, and fetched by vLLM. on Qwen3-0.6B, per-step payload goes from 1.2 GB to 20 to 35 MB. This is exactly what we built Buckets for: S3-like object storage on the Hub, Xet-backed (so even full snapshots only transfer the changed chunks).

The cherry on top: we ran a FULL disaggregated training where:
- the trainer lived on one box
- vLLM ran inside a Hugging Face Space
- the Wordle environment ran in another Space
- weights flowed through one Hub bucket

no shared cluster. no RDMA. no VPN. no NCCL across clouds. just HTTPS and a bucket.

one GPU + a Hugging Face account is now enough to do real disaggregated RL. multi-replica inference fleets across regions become a small devops exercise, not a research project.

Full write-up: https://t.co/CG115IjT0q

Open source RL keeps eating the moat!

595

337

61K

Who to follow

A distributed compute framework for scaling AI workloads. Created and developed by @anyscalecompute.

kemal el moujahid

@kelmoujahid

Putting frontier AI in simulations @KradleAI. Previously CPO @ChainlinkLabs, @TensorFlow Lead, @Messenger platform Lead, Founder @Liveminutes.

Geeta Chauhan

@geeta4c

11 days ago

580 TPS on Qwen3.5-397B-A17B for agentic workloads on B200. Instead of optimizing a general-purpose engine, LightSeek built TokenSpeed from first principles around how agents actually use models — hybrid attention with safe KV + linear state management, high hit-rate prefix caching across turns, and heavy focus on decode efficiency. They also split CPU work into SMG so the GPU isn’t blocked by tokenization and tool calling. Clean architectural separation. This is what a purpose-built agentic inference stack looks like. https://t.co/cogRsx06Nx

PyTorch

@PyTorch

13 days ago

The speed-of-light optimization for Qwen3.5 on the TokenSpeed inference engine is a significant milestone, achieving a record-breaking 580 tokens per second (tps) for agentic workloads on NVIDIA GPUs. In the PyTorch Foundation's latest community blog post, you can learn all about the complete design, implementation, and optimization of Qwen3.5 models in the TokenSpeed inference framework and see for yourself how this work is improving performance 👉 https://t.co/Qr1PTIhqok This achievement was a joint effort between the @Alibaba_Qwen inference team, @lightseekorg Foundation TokenSpeed team, @NVIDIAAI , and the Mooncake team, with special contributions from @tri_dao for FlashAttention-4 (FA4) optimization. @KVCache_AI

PyTorch's tweet photo. The speed-of-light optimization for Qwen3.5 on the TokenSpeed inference engine is a significant milestone, achieving a record-breaking 580 tokens per second (tps) for agentic workloads on NVIDIA GPUs.

In the PyTorch Foundation's latest community blog post, you can learn all about the complete design, implementation, and optimization of Qwen3.5 models in the TokenSpeed inference framework and see for yourself how this work is improving performance 👉 https://t.co/Qr1PTIhqok

This achievement was a joint effort between the @Alibaba_Qwen inference team, @lightseekorg Foundation TokenSpeed team, @NVIDIAAI , and the Mooncake team, with special contributions from @tri_dao for FlashAttention-4 (FA4) optimization. @KVCache_AI

290

157

275K

Geeta Chauhan

@geeta4c

18 days ago

Current LLM scaling laws treat inference cost like an afterthought. This paper destroys that blind spot with conditional scaling laws that bake in hidden size and MLP/attn ratio. Train 200+ models → Panda/Surefire beat LLaMA-3.2 by 2.1% accuracy and 42% faster inference. Time to stop training suboptimal architectures. For Practical inference-efficient LLMs checkout: https://t.co/il631Hms1b

geeta4c's tweet photo. Current LLM scaling laws treat inference cost like an afterthought. This paper destroys that blind spot with conditional scaling laws that bake in hidden size and MLP/attn ratio.

Train 200+ models → Panda/Surefire beat LLaMA-3.2 by 2.1% accuracy and 42% faster inference.

Time to stop training suboptimal architectures. For Practical inference-efficient LLMs checkout: https://t.co/il631Hms1b

807

Geeta Chauhan

@geeta4c

19 days ago

@tri_dao Whole transformer = GEMM + epilogue. LLMs writing their own speed-of-light kernels? We just got upgraded. 🔥 Great work @HanGuo97, @tri_dao and team

Geeta Chauhan

@geeta4c

19 days ago

State-of-the-art MoE inference meets production orchestration. 🚀 vLLM’s WideEP + llm-d’s intelligent routing is a powerful combo: • Super-linear KV cache scaling • Prefix-aware routing (no more decoder starvation) • PD Disaggregation with zero-copy NIXL transfers The well-lit path to high-concurrency, massive-context serving. Watch the PyTorch Conf talk with Tyler Smith (@tms_jr) & Maroon Ayoub: https://t.co/b6tg0FYx8q #vLLM #llm_d #MoE

geeta4c's tweet photo. State-of-the-art MoE inference meets production orchestration. 🚀

vLLM’s WideEP + llm-d’s intelligent routing is a powerful combo:
• Super-linear KV cache scaling
• Prefix-aware routing (no more decoder starvation)
• PD Disaggregation with zero-copy NIXL transfers

The well-lit path to high-concurrency, massive-context serving.

Watch the PyTorch Conf talk with Tyler Smith (@tms_jr) & Maroon Ayoub: https://t.co/b6tg0FYx8q

#vLLM #llm_d #MoE

168

Geeta Chauhan

@geeta4c

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users