FrontierCode from @cognition isn't just another coding benchmark—it's a maintainer reality check.
Instead of "does it pass tests?", it asks: "Would a senior dev actually merge this PR?"
Real tasks from 20+ OSS maintainers across 36 repos. Multi-axis grading on quality, scope discipline, test rigor, regressions, and style.
Result? Even Claude Opus 4.8 scrapes only ~13.4% on the hardest Diamond set. GPT-5.5 trails at 6.3%.
https://t.co/urGZoiVElF
Strong work exposing how far we still are.
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
@cognition Excellent work pulling together real tasks from 20+ OSS maintainers, defining "success" as actual mergeability (quality, scope, tests, regressions, taste). A step in right direction for advancing the field with real world production-code benchmarks.
The DGX Spark's 128GB unified memory is its strength and its risk: with no separate VRAM, a GPU out-of-memory event can take down the whole system.
A practical guide to hardening it for stable AI workloads - kernel tuning, OOM scoring, earlyoom, and emergency SSH: https://t.co/LpfE0fKmpZ
Async-RL becomes practical with DeltaSync in TRL + HF Buckets
In bf16 RL at low LR, ~99% of weights don't change per step — most Adam updates are invisible below the mantissa threshold. So why ship entire 1T-param checkpoints?
Sparse deltas via safetensors using HF Hub Buckets→ 50-65x smaller payloads. Trainer on one box, vLLM rollouts on another (different regions even), synced through cheap object storage. No more NCCL tax or co-location hell.
This kills the synchronization bottleneck that made async online RL painful at scale. @FireworksAI_HQ /@cursor_ai proved the closed version; @huggingface just open-sourced the practical one.
https://t.co/oLNtdDytS4
The HF science team just made async RL weight sync ~100x cheaper on bandwidth, and you don't need a shared cluster anymore.
The problem: every RL step, the trainer typically has to sync fresh weights to the inference engine. for a 7B in bf16 that's ~14GB. for a frontier 1T fp8 checkpoint, that's ~1TB; in bf16 it would be ~2TB. per sync.
The insight: between two RL steps, ~99% of bf16 weights are bit-identical. at RL learning rates, the optimizer is whispering and bf16 literally cannot hear most of it. the stored bf16 bits don't change.
What they shipped in TRL: only the changed elements get encoded as a sparse safetensors file, dropped into a Hugging Face Bucket, and fetched by vLLM. on Qwen3-0.6B, per-step payload goes from 1.2 GB to 20 to 35 MB. This is exactly what we built Buckets for: S3-like object storage on the Hub, Xet-backed (so even full snapshots only transfer the changed chunks).
The cherry on top: we ran a FULL disaggregated training where:
- the trainer lived on one box
- vLLM ran inside a Hugging Face Space
- the Wordle environment ran in another Space
- weights flowed through one Hub bucket
no shared cluster. no RDMA. no VPN. no NCCL across clouds. just HTTPS and a bucket.
one GPU + a Hugging Face account is now enough to do real disaggregated RL. multi-replica inference fleets across regions become a small devops exercise, not a research project.
Full write-up: https://t.co/CG115IjT0q
Open source RL keeps eating the moat!
580 TPS on Qwen3.5-397B-A17B for agentic workloads on B200.
Instead of optimizing a general-purpose engine, LightSeek built TokenSpeed from first principles around how agents actually use models — hybrid attention with safe KV + linear state management, high hit-rate prefix caching across turns, and heavy focus on decode efficiency.
They also split CPU work into SMG so the GPU isn’t blocked by tokenization and tool calling.
Clean architectural separation. This is what a purpose-built agentic inference stack looks like.
https://t.co/cogRsx06Nx
The speed-of-light optimization for Qwen3.5 on the TokenSpeed inference engine is a significant milestone, achieving a record-breaking 580 tokens per second (tps) for agentic workloads on NVIDIA GPUs.
In the PyTorch Foundation's latest community blog post, you can learn all about the complete design, implementation, and optimization of Qwen3.5 models in the TokenSpeed inference framework and see for yourself how this work is improving performance 👉 https://t.co/Qr1PTIhqok
This achievement was a joint effort between the @Alibaba_Qwen inference team, @lightseekorg Foundation TokenSpeed team, @NVIDIAAI , and the Mooncake team, with special contributions from @tri_dao for FlashAttention-4 (FA4) optimization. @KVCache_AI
Current LLM scaling laws treat inference cost like an afterthought. This paper destroys that blind spot with conditional scaling laws that bake in hidden size and MLP/attn ratio.
Train 200+ models → Panda/Surefire beat LLaMA-3.2 by 2.1% accuracy and 42% faster inference.
Time to stop training suboptimal architectures. For Practical inference-efficient LLMs checkout: https://t.co/il631Hms1b
@tri_dao Whole transformer = GEMM + epilogue.
LLMs writing their own speed-of-light kernels?
We just got upgraded. 🔥
Great work @HanGuo97, @tri_dao and team