James Dborin

@JDborin

Co-founder of Doubleword. We solve hard LLM inference problems.

Joined January 2021

486 Following

52 Followers

8 Posts

JDborin retweeted

Sakura Yuki

@sakurayukiai

24 days ago

Can we talk about speculative KV coding? You run an FP8 model to predict the BF16 cache, then just arithmetic-code the residual. We are literally burning extra forward passes purely to shrink VRAM footprints by 4x. Compute is officially cheaper than memory ✨

James Dborin @JDborin

about 2 years ago

@MeryemArik9 @sidfix yi and zephyr for y and z if i have understood the game?

James Dborin @JDborin

over 2 years ago

@JeffreyUrban_ @MLOpsWorld Just saw this paper pop up: https://t.co/mxLsPqy6g2 This is the sort of thing that would power these networks of resource sharing models.

@_akhaliq

over 2 years ago

S-LoRA: Serving Thousands of Concurrent LoRA Adapters paper page: https://t.co/ONdIQz52dl The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services.

_akhaliq's tweet photo. S-LoRA: Serving Thousands of Concurrent LoRA Adapters

paper page: https://t.co/ONdIQz52dl

The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services.

481

104

266

92K

James Dborin @JDborin

about 3 years ago

@pommedeterre33 I think you are right - are you trying to quantize the kv cache to get longer sequence lengths without OOM?

133

Who to follow

Jacob Cybulski

@ironfrown

Quantum researcher @Deakin | Making complex ideas simple | Aiming to democritize #QuantumComputing and #QML | Also #DataScience and #dataviz

Priya Batra

@priyabatra19

Quantum Information combined Machine Learning and Optimisation, curious about condensed matter phenomena, Combining everything in Postdoc @virginia_tech

Kobra_Mahdavipour

@KobraMp

Quantum information & Quantum Optic Ph.D. of Information and Communication Technologies university of Palermo and INRS of Canada.

James Dborin @JDborin

about 3 years ago

@pommedeterre33 I think I understand, really cool idea! Loving the larger kernl project as well.

James Dborin @JDborin

about 3 years ago

@pommedeterre33 so you are splitting the computation into blocks indexed by the program id, doing the pytorch ops, and then combining them again at the end, using something like https://t.co/EY3PffmV6L?

James Dborin @JDborin

about 3 years ago

@pommedeterre33 is the idea that triton ops like tl.arange are replaced with pytorch equivalents?

JDborin retweeted

Conception X @conceptionxtech

over 4 years ago

We're on the @PaCCSResearch @UKRI_News blog today with a story about another Cohort 1 team that's making waves – @AstroscreenHQ by @Tehranix @rahkoAI @oxiapalus @JDborin also mentioned More about how to apply for Cohort V in there👇 https://t.co/HsIrS8A2Su

James Dborin

@JDborin

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users