We are offering grants of $100,000 + Tinker credits to researchers advancing the field of human-AI interactivity. Submit your proposals by June 19th!
https://t.co/907HfBy7g3
@amritwt Its really good for hard, technical problems. Does planning and iterations very well. Can get stuck and take time but definitely delivers working solutions most of the time.
Claude/opus great for execution speed
Towards Building efficient Routed systems for Retrieval
Introduces a routing-based approach that dynamically selects the most informative query representation in late-interaction models, achieving up to 30x speedup while maintaining performance.
📝 https://t.co/QZhT1iPWnY
🔉 Introducing SAM Audio, the first unified model that isolates any sound from complex audio mixtures using text, visual, or span prompts.
We’re sharing SAM Audio with the community, along with a perception encoder model, benchmarks and research papers, to empower others to explore new forms of expression and build applications that were previously out of reach.
🔗 Learn more: https://t.co/FPnfv66UCP
🏆 We are incredibly honored to announce that our paper, "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" has received the NeurIPS 2025 Best Paper Award!
A huge congratulations to our dedicated research team for pushing the boundaries of AI.
Read more: https://t.co/qu3ERa3pH5
You don't need to buy a GPU to master CUDA.
> Make an account on Tensara.
> Get access to GPUs like H100, A100 for free.
> Practice some of the impactful kernels on the platform.
All you need is a learning mindset.
Speed and quality can finally coexist in diffusion-based language generation.
Introducing DiDi-Instruct, a Discrete Diffusion Divergence Instruct method that distills a pre-trained discrete diffusion language model (dLLM) into a few-step student for ultra-fast generation.
Built on integral KL-divergence minimization, DiDi-Instruct achieves up to 64× faster decoding, surpasses both its teacher and GPT-2, and cuts training time by 20×.
Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct
Paper: https://t.co/fHeTZ3ThK2
Code: https://t.co/x5NNw2F0F2
Project: https://t.co/v5IsqghdDz
Our report: https://t.co/LHL3BHCWij
📬 #PapersAccepted by Jiqizhixin
upcoming @GPU_MODE talk on how flash attention 4 was optimized for Blackwell GPUs, e.g. B200s! this is super timely given all the new DSLs and features NVIDIA has been releasing
charles will be live Wednesday, 1pm PST on the YT channel, so make sure to be there to ask any Qs!
“What I cannot create I do not understand” - This is why I started Penny, my own version of NCCL.
Today I'm releasing a first part of a worklog of creating it. It explains GPU communication and shows progress on coding a fast AllReduce(Inter&Intranode) algorithm using NVSHMEM🧵
Happens at least once every single week..
Person X: We’re doing $4M in ARR
(I go to the website, check pricing page, see they have no subscription options 🤔)
Me: What part of your revenue is recurring?
X: None - we’re usage based.
Me: Ok. So what was your revenue last month? Money that *actually* came to your bank account.
X: $30k
Me: Wait, so how is that $4M in annualized revenue run rate - which is not ARR btw..
X shares some absurd math on how it translates and upsell opportunities in pilots.
Me: Hmm, fine, what are your margins? How much of it is pass through to the large labs if you’re not hosting your own?
X (evasively): Umm, we’re not at liberty to share that today.
The end.
Founders, please please don’t do this today. I’m not sure who’s training you but just know that investors are NOT that gullible.
Or maybe they are 🤷♂️
Traditional QoS-based streaming optimization is hitting its limits.
Enter LingXi(灵犀)—the first large-scale system for personalized adaptive video streaming that optimizes directly for user experience (QoE).
✨ Key ideas:
- Uses exit rate to link QoS metrics with engagement
- Builds personalized exit rate predictor
- Optimizes via Monte Carlo sampling + Bayesian methods
Deployed on Kuaishou (8% traffic A/B test):
+0.15% total viewing time
+0.1% bitrate
–1.3% stall time (–15% for low-bandwidth users)
How to build vision-language-action models that train fast, run fast & generalize? In our new paper, we formalize & analyze the approach of our π-0.5 model & further improve it with a single stage recipe.
Blog: https://t.co/IihKEmmxSB
Paper: https://t.co/JfEU7pcoZk
All the recordings for the @GPU_MODE x @scaleml series are up as a playlist in case you missed it 😁
There's so much value in these ~8 hours of lectures, from proving quantization error bounds on a whiteboard to a deep-dive into GPU warp schedulers!
Plz take advantage of it!