I don't know if this was ever released but 10M is actually doable now
If you're interested I'm sure it would help with code generation and many other things @spawn@cline@replit@cursor_ai@windsurf_ai@boltdotnew @scoutdotnew
lmk my dms are open
LTM-2-Mini is our first model with a 100 million token context window. That’s 10 million lines of code, or 750 novels.
Full blog: https://t.co/oFz4A9ynVZ
Evals, efficiency, and more ↓
[blog]
So I was exploring some very influential vision-language models, and while making notes along the way, it kind of turned into a mega blog.
In this blog, I’ve covered the novelties and interesting aspects of models like Flamingo, BLIP, BLIP-2, and LLaVA. (There’s even a mini-blog inside this one about Perceiver by Google DeepMind).
Some of the common ideas I noticed across these papers were:
- The use of cross-attention to make visual and language information interact.
- The idea of using a mapping network to project from one embedding space into the LLM’s embedding space.
I’ll drop the link in the comments - do check it out, and I really hope you all will like it!!
.@vllm_project has quickly become a go-to open source engine for efficient large language model inference, balancing performance with a strong developer experience. At NVIDIA, direct contributions to projects like vLLM reflect a commitment to advancing open source AI infrastructure for everyone.
In this Q&A, Benjamin Chislett, Senior Systems Software Engineer at NVIDIA and Committer for vLLM, shares his perspective on shaping the project’s future, his work on speculative decoding, and why open source collaboration matters for AI at scale.
🔗 https://t.co/Jg7XjhUs34
I don't know if this was ever released but 10M is actually doable now
If you're interested I'm sure it would help with code generation and many other things @spawn@cline@replit@cursor_ai@windsurf_ai@boltdotnew @scoutdotnew
lmk my dms are open
LTM-2-Mini is our first model with a 100 million token context window. That’s 10 million lines of code, or 750 novels.
Full blog: https://t.co/oFz4A9ynVZ
Evals, efficiency, and more ↓
New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work!
Took me a while to get this level of understanding of the codebase and then to write up this one - i quickly realized i understimated the effort. 😅 It could have easily been a book/booklet (lol).
I covered:
* Basics of inference engine flow (input/output request processing, scheduling, paged attention, continuous batching)
* "Advanced" stuff: chunked prefill, prefix caching, guided decoding (grammar-constrained FSM), speculative decoding, disaggregated P/D
* Scaling up: going from smaller LMs that can be hosted on a single GPU all the way to trillion+ params (via TP/PP/SP) -> multi-GPU, multi-node setup
* Serving the model on the web: going from offline deployment to multiple API servers, load balancing, DP coordinator, multiple engines setup :)
* Measuring perf of inference systems (latency (ttft, itl, e2e, tpot), throughput) and GPU perf roofline model
Lots of examples, lots of visuals!
---
I realize i've been silent on social - many of you noticed and thanks for reaching out! :) --> I'm so back! lots of things happened.
Also, in general, I'm a bit sick of superficial content, it really is an equivalent of junk food (h/t @karpathy).
I want to do the best/deepest technical work of my life over the next years and write much more in depth (high quality organic food ;)) so I might not be as frequent around here as i used to be (? we'll see). I'll make it a goal to share a few paper summaries a week or stuff that's relevant / in the zeitgeist.
If you have any topics that happened over the past few weeks/months drop it down in the comments i might focus on some of those in my next posts.
---
Huge thank you to @Hyperstackcloud for giving me an H100 node to run some of the experiments and analysis that i needed to write this up. The team there led by Christopher Starkey is amazing!
Also a big thank you to Nick Hill (who did a very thorough review of the post - basically a code review lol; Nick's a core vLLM contributor and principal SWE at RedHat) and to my friends Kyle Krannen (NVIDIA Dynamo), @marksaroufim (PyTorch), and @ashVaswani (goat) for taking the time during weekend when they didn't have to!
Part 1 of my article series on fine-tuning an LLM for analysis of massive amounts of Intel Processor Trace is up. Use cases: codebase vulnerability scan, at-scale bug triage, etc. With thanks to @33y0re, @ivanrouzanov, and @vGPUArthur: https://t.co/fx5AdiQR4M