Exciting to see this work out in the wild! 🚀 It’s been great working with such a talented group. My contribution (so far): writing optimized GKA decoding fused kernels exploiting state matrix symmetry for extra speedup⚡. Check out the details in the paper. More to come! 🛠️
Introducing Priming
Hybrid models are faster and cheaper than Transformers to scale. But developing alternative architectures from scratch requires expensive pre-training runs.
Priming solves this by leveraging pre-trained Transformer weights to train equally performant Hybrid models with 2× faster throughput. Builders can now iterate on Hybrid architectures for under 150B tokens, 100× cheaper than pre-training.
1/12
💾🚀 Run Llama-3.1-405B FP8 (410GB) on a single 180GB GPU
#NVIDIA
Introducing FlexTensor — NVIDIA's new library that makes host RAM a transparent extension of your GPU memory. One call: flextensor.offload(model). No model rewrites, no framework changes. Works with vLLM, HuggingFace, and any PyTorch model.
Traditional offloading is reactive — move data when you run out of memory, stall the GPU while you wait. FlexTensor instead profiles your model's layer access patterns, then solves a knapsack optimization to schedule prefetches that overlap with compute. By the time a layer needs its weights, they're already there.
The freed VRAM gives vLLM more room for KV cache — enabling 4x longer contexts (8K→32K) or 4x larger batches. For video generation (Wan2.2-T2V-A14B on GB200): +0.1% overhead. Handles FP8, custom Triton kernels, and multi-GPU. Profiles saved to disk — no warmup on repeated runs.
Check it out: https://t.co/VXmou7AaiO
@RickT9900022@SebAaltonen And my points in bringing up who @SebAaltonen are:
- He is saying that (2) above is true, and if he is saying that of all people, believe it.
- He is NOT saying that (1) isn’t true. He (and I) would agree with you there, to a point. If you look at his posts, he pushes for better!
@RickT9900022@SebAaltonen Yep, go ahead, you can voice whatever opinion you want, even if you have no idea what you are talking about. You could yell that game devs should just make amazing games run in 256KB of RAM at 900 FPS. Why not? Obviously they are just lazy, right? lol
@RickT9900022@SebAaltonen Dude, you have no idea who you are talking to lol. He is the most outspoken game dev I can think of about optimizing games to run on low spec devices. 8GB (shared between CPU and GPU on Macs) is not enough for the resolution and level of detail of modern AAA games.
@danoboltup@ChrisO_wiki@HumanistQuaker Their they’re, its all rite, if there so worried im shure their going to come help, your mispelling all you’re words acting like a jerk online for nuthin
@FPupusas@ThePrimeagen And yes, the same very well may happen with agents. i also said it’s complicated and to some degree I agree with you. Still, while beginners can immediately see the benefits, what people like Prime immediately feel are the pain points, drop in productivity, and loss of control.
@FPupusas@ThePrimeagen His complaint of inline autocomplete was primarily based on his use of Copilot well over 6 months ago, as he made clear in his responses to you in this thread.
@FPupusas@ThePrimeagen I’m oversimplifying. It depends on what you are doing, how much effort you put into setting everything up, and how much you care about being hands-on vs. hands-off, and many other things.
@FPupusas@ThePrimeagen For people who didn’t know what they were doing, inline autocomplete was amazing, but for someone like Prime it was not.
Then things got better. Now autocomplete is great for Prime, agents are amazing for folks who don’t know what they are doing but not for people like Prime.
I'm pretty annoyed that Hypersteer (a work by some of my friends applying hypernetworks to produce very effective steering vectors from text descriptions) has not received the appropriate amount of credit in later work pursuing basically the same idea https://t.co/8aGNgrvfcs
We’re excited to introduce Doc-to-LoRA and Text-to-LoRA, two related research exploring how to make LLM customization faster and more accessible.
https://t.co/ApVzVsBuv1
By training a Hypernetwork to generate LoRA adapters on the fly, these methods allow models to instantly internalize new information or adapt to new tasks.
Biological systems naturally rely on two key cognitive abilities: durable long-term memory to store facts, and rapid adaptation to handle new tasks given limited sensory cues. While modern LLMs are highly capable, they still lack this flexibility. Traditionally, adding long-term memory or adapting an LLM to a specific downstream task requires an expensive and time-consuming model update, such as fine-tuning or context distillation, or relies on memory-intensive long prompts.
To bypass these limitations, our work focuses on the concept of cost amortization. We pay the meta-training cost once to train a hypernetwork capable of producing tasks or document specific LoRAs on demand. This turns what used to be a heavy engineering pipeline into a single, inexpensive forward pass. Instead of performing per-task optimization, the hypernetwork meta-learns update rules to instantly modify an LLM given a new task description or a long document.
In our experiments, Text-to-LoRA successfully specializes models to unseen tasks using just a natural language description. Building on this, Doc-to-LoRA is able to internalize factual documents. On a needle-in-a-haystack task, Doc-to-LoRA achieves near-perfect accuracy on instances five times longer than the base model's context window. It can even generalize to transfer visual information from a vision-language model into a text-only LLM, allowing it to classify images purely through internalized weights.
Importantly, both methods run with sub-second latency, enabling rapid experimentation while avoiding the overhead of traditional model updates. This approach is a step towards lowering the technical barriers of model customization, allowing end-users to specialize foundation models via simple text inputs. We have released our code and papers for the community to explore.
Doc-to-LoRA
Paper: https://t.co/87xEEpf0GN
Code: https://t.co/zBfQi2L9LW
Text-to-LoRA
Paper: https://t.co/emLRZ4Vdvo
Code: https://t.co/b9mrdoWWRB
@_arohan_ How did they invent terminology? “expert parallelism” parallelizes along the expert dimension.
By the way, IMO it’s not just sharding, but also includes how to reorganize the computation to minimize communication. TP combines sharding different tensors along two different dims.