Você vê coisas; e você diz: Por quê? Mas nós designer sonha coisas que nunca existiram; e diz: Por quê não?
George Bernard Shaw, alterado por Dick Powell
New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonable cost? This short course is built with @RedHat and taught by @cedricclyburn.
Efficient LLM serving requires efficient memory management. A 70B-parameter model takes ~140 GB just to load the weights. On top of that, every active request needs its own chunk of GPU memory, the KV cache, to store the token context it has built up so far. In this course, you'll learn to reduce a model's memory footprint with quantization and serve it using vLLM, which handles many concurrent requests efficiently through smart memory management.
Skills you'll gain:
- Quantize a model and measure the accuracy tradeoff
- Serve a model with vLLM and watch it handle concurrent requests efficiently
- Benchmark your deployment and make informed tradeoffs between speed, cost, and accuracy
Join and learn to serve LLMs efficiently:
https://t.co/x04xMbFlkO
Só lembrando que o Dallas não ganha fora de casa em… #NFLnaESPN . Não é comparativo para dizer que o Bills melhorou. Só vamos dizer isso quando eles enfrentarem o Dolphins na última rodada. Fora isso é baboseira.
Voltando a alguns eventos locais. Hoje aqui colondo no Women Techmakers no International Women’s Day 2023 organizado por GDG Lauro de Freitas #WTMDareToBe
Desenvolvedores adoram linha de comando e existem muitas ferramentas incríveis por aí.
Aqui vai a coleção com as minhas favoritas:
[🧵 thread] 👇
Cc: @sseraphini