David Thomas

@davidthomas426

@[email protected]

Joined October 2011

360 Following

71 Followers

1.8K Posts

David Thomas @davidthomas426

22 days ago

Exciting to see this work out in the wild! 🚀 It’s been great working with such a talented group. My contribution (so far): writing optimized GKA decoding fused kernels exploiting state matrix symmetry for extra speedup⚡. Check out the details in the paper. More to come! 🛠️

Prannay Kaul

@PrannayKaul

22 days ago

Introducing Priming Hybrid models are faster and cheaper than Transformers to scale. But developing alternative architectures from scratch requires expensive pre-training runs. Priming solves this by leveraging pre-trained Transformer weights to train equally performant Hybrid models with 2× faster throughput. Builders can now iterate on Hybrid architectures for under 150B tokens, 100× cheaper than pre-training. 1/12

davidthomas426 retweeted

Piotr Nawrot

@p_nawrot

about 2 months ago

💾🚀 Run Llama-3.1-405B FP8 (410GB) on a single 180GB GPU #NVIDIA Introducing FlexTensor — NVIDIA's new library that makes host RAM a transparent extension of your GPU memory. One call: flextensor.offload(model). No model rewrites, no framework changes. Works with vLLM, HuggingFace, and any PyTorch model. Traditional offloading is reactive — move data when you run out of memory, stall the GPU while you wait. FlexTensor instead profiles your model's layer access patterns, then solves a knapsack optimization to schedule prefetches that overlap with compute. By the time a layer needs its weights, they're already there. The freed VRAM gives vLLM more room for KV cache — enabling 4x longer contexts (8K→32K) or 4x larger batches. For video generation (Wan2.2-T2V-A14B on GB200): +0.1% overhead. Handles FP8, custom Triton kernels, and multi-GPU. Profiles saved to disk — no warmup on repeated runs. Check it out: https://t.co/VXmou7AaiO

p_nawrot's tweet photo. 💾🚀 Run Llama-3.1-405B FP8 (410GB) on a single 180GB GPU
#NVIDIA

Introducing FlexTensor — NVIDIA's new library that makes host RAM a transparent extension of your GPU memory. One call: flextensor.offload(model). No model rewrites, no framework changes. Works with vLLM, HuggingFace, and any PyTorch model.

Traditional offloading is reactive — move data when you run out of memory, stall the GPU while you wait. FlexTensor instead profiles your model's layer access patterns, then solves a knapsack optimization to schedule prefetches that overlap with compute. By the time a layer needs its weights, they're already there.

The freed VRAM gives vLLM more room for KV cache — enabling 4x longer contexts (8K→32K) or 4x larger batches. For video generation (Wan2.2-T2V-A14B on GB200): +0.1% overhead. Handles FP8, custom Triton kernels, and multi-GPU. Profiles saved to disk — no warmup on repeated runs.

Check it out: https://t.co/VXmou7AaiO

222

169

56K

David Thomas @davidthomas426

3 months ago

@RickT9900022 @SebAaltonen He pushes for better, and educates game devs on HOW to do better, as much as anyone I’m aware of.

David Thomas @davidthomas426

3 months ago

@RickT9900022 @SebAaltonen And my points in bringing up who @SebAaltonen are: - He is saying that (2) above is true, and if he is saying that of all people, believe it. - He is NOT saying that (1) isn’t true. He (and I) would agree with you there, to a point. If you look at his posts, he pushes for better!

Who to follow

hackeryangtze

@hackeryangtze

GameDeveloper Love ComputerGraphics

João Baptista 🇧🇷

@JoaoBapt

Passionate for programming, technology, computer graphics/rendering, math, and music, learning Vulkan/D3D, and part of the GBA dev scene. 🦋https://t.co/2lUBtqheix

Juho Gävert

@jugavert

Sr Graphics Engineer at Unity. My opinions are my own. Metal+(Vulkan&DX12). Try to keep my tweets somewhat graphics related 😊 こっちでも会話おk. ex.RL Ubisoft

David Thomas @davidthomas426

3 months ago

@RickT9900022 @SebAaltonen Omg I just saw YOUR bio, and of course lol. 🤦‍♂️😂

David Thomas @davidthomas426

3 months ago

@RickT9900022 @SebAaltonen Yep, go ahead, you can voice whatever opinion you want, even if you have no idea what you are talking about. You could yell that game devs should just make amazing games run in 256KB of RAM at 900 FPS. Why not? Obviously they are just lazy, right? lol

David Thomas @davidthomas426

3 months ago

@RickT9900022 @SebAaltonen Dude, you have no idea who you are talking to lol. He is the most outspoken game dev I can think of about optimizing games to run on low spec devices. 8GB (shared between CPU and GPU on Macs) is not enough for the resolution and level of detail of modern AAA games.

David Thomas @davidthomas426

3 months ago

@danoboltup @ChrisO_wiki @HumanistQuaker Ok, “redditor energy”, “acting like a jerk online”, whatever you want to call your behavior is fine with me

David Thomas @davidthomas426

3 months ago

@danoboltup @ChrisO_wiki @HumanistQuaker Their they’re, its all rite, if there so worried im shure their going to come help, your mispelling all you’re words acting like a jerk online for nuthin

161

David Thomas @davidthomas426

3 months ago

@YouJiacheng @SkyLi0n I assumed they meant the linear algebra library: https://t.co/pccuLPUrIb

David Thomas @davidthomas426

3 months ago

@FPupusas @ThePrimeagen And yes, the same very well may happen with agents. i also said it’s complicated and to some degree I agree with you. Still, while beginners can immediately see the benefits, what people like Prime immediately feel are the pain points, drop in productivity, and loss of control.

David Thomas @davidthomas426

3 months ago

@FPupusas @ThePrimeagen His complaint of inline autocomplete was primarily based on his use of Copilot well over 6 months ago, as he made clear in his responses to you in this thread.

David Thomas @davidthomas426

3 months ago

@FPupusas @ThePrimeagen I’m oversimplifying. It depends on what you are doing, how much effort you put into setting everything up, and how much you care about being hands-on vs. hands-off, and many other things.

David Thomas @davidthomas426

3 months ago

@FPupusas @ThePrimeagen For people who didn’t know what they were doing, inline autocomplete was amazing, but for someone like Prime it was not. Then things got better. Now autocomplete is great for Prime, agents are amazing for folks who don’t know what they are doing but not for people like Prime.

davidthomas426 retweeted

Aryaman Arora

@aryaman2020

3 months ago

I'm pretty annoyed that Hypersteer (a work by some of my friends applying hypernetworks to produce very effective steering vectors from text descriptions) has not received the appropriate amount of credit in later work pursuing basically the same idea https://t.co/8aGNgrvfcs

382

304

50K

davidthomas426 retweeted

Sakana AI

@SakanaAILabs

3 months ago

We’re excited to introduce Doc-to-LoRA and Text-to-LoRA, two related research exploring how to make LLM customization faster and more accessible. https://t.co/ApVzVsBuv1 By training a Hypernetwork to generate LoRA adapters on the fly, these methods allow models to instantly internalize new information or adapt to new tasks. Biological systems naturally rely on two key cognitive abilities: durable long-term memory to store facts, and rapid adaptation to handle new tasks given limited sensory cues. While modern LLMs are highly capable, they still lack this flexibility. Traditionally, adding long-term memory or adapting an LLM to a specific downstream task requires an expensive and time-consuming model update, such as fine-tuning or context distillation, or relies on memory-intensive long prompts. To bypass these limitations, our work focuses on the concept of cost amortization. We pay the meta-training cost once to train a hypernetwork capable of producing tasks or document specific LoRAs on demand. This turns what used to be a heavy engineering pipeline into a single, inexpensive forward pass. Instead of performing per-task optimization, the hypernetwork meta-learns update rules to instantly modify an LLM given a new task description or a long document. In our experiments, Text-to-LoRA successfully specializes models to unseen tasks using just a natural language description. Building on this, Doc-to-LoRA is able to internalize factual documents. On a needle-in-a-haystack task, Doc-to-LoRA achieves near-perfect accuracy on instances five times longer than the base model's context window. It can even generalize to transfer visual information from a vision-language model into a text-only LLM, allowing it to classify images purely through internalized weights. Importantly, both methods run with sub-second latency, enabling rapid experimentation while avoiding the overhead of traditional model updates. This approach is a step towards lowering the technical barriers of model customization, allowing end-users to specialize foundation models via simple text inputs. We have released our code and papers for the community to explore. Doc-to-LoRA Paper: https://t.co/87xEEpf0GN Code: https://t.co/zBfQi2L9LW Text-to-LoRA Paper: https://t.co/emLRZ4Vdvo Code: https://t.co/b9mrdoWWRB

346

606K

David Thomas @davidthomas426

3 months ago

@_arohan_ How did they invent terminology? “expert parallelism” parallelizes along the expert dimension. By the way, IMO it’s not just sharding, but also includes how to reorganize the computation to minimize communication. TP combines sharding different tensors along two different dims.

David Thomas @davidthomas426

4 months ago

@turbofish_pk @TheGingerBill @aramh @effectfully lol nice typo

David Thomas @davidthomas426

5 months ago

@marksaroufim @a1zhang this is perfect 😂. My mind could not figure out whether to crack up or be infuriated at it.

David Thomas

@davidthomas426

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users