🥳Check out: Token-Explorer! 🤖
Interact with and explore LLM token generation!
Features:
- Step through token selection
- Remove tokens to explore alt paths
- Fork prompt and quickly switch between them
- Visualize all token probabilities and entropy!
- OSS (github in replies)
The new White House policy requiring green card applicants to apply from outside the US is a capricious attack on legal immigration. It will hurt families, leave us with fewer doctors, teachers and scientists, and hurt American competitiveness in AI.
@mr_r0b0t@gajesh@NVIDIAAI Concurrency only matters for throughput, it's not going to impact tokens/s per request/user and you're still fundamentally limited on the size of model. Yes, the DGX at that pricepoint might be the best option, but at that bandwidth you're still very limited.
Reading two wildly different books (Žižek’s “Too late to awaken” and Smullyan’s “To mock a mockingbird”) and each makes reference to the exact same joke (with different names/attributions)!
@oprydai How many GPUs do you have and how many requests do you expect to be serving concurrently?
If the answer to both is roughly 1 then llama.cpp is a good place to start (esp if you have < 1 GPU).
Otherwise you'll probably get more value out of vLLM
@TheAhmadOsman Heat and electricity bills? I run all my image models through my RTX 4090, but I keep any long running LLMs running on my M3 max MBP.
Ultimately it boils down to bandwidth vs wattage
That said, I've not seen a compelling argument in favor of the DGX
A huge distinction in success with agents really boils down to whether or not you're solving a problem with a directly measurable outcome.
I also believe people will increasingly question why anyone is spending significant time writing software without a measurable outcome
Weirdly starting to enjoy X again, you just have to recognize that there are only a handful of real/interesting people left.
A few likes from the right people means what 100s from randos did a few years ago.
@jun_song@LottoLabs People really underestimate how much everything really boils down to memory bandwidth and power consumption.
128GB of memory isn’t much better than 24GB if your bandwidth is still ~250 GB/s and your local model isn’t really “free” if you need 1200 watts to do inference.
@nptacek Seriously, the last chapter of my soon-to-be released book on Stable Diffusion is all about using proprietary models to improve base SD. As long as new information can be added a model can be improved
@jun_song So apparently cats do this because they are trying to imitate you, so if you have a old, small laptop for your cat it *might* use that instead!
Yes you should host your own LLM, and yes you should host your own private git server, but you should also ferment you own booze! Local apples, home fermented cider!
@LottoLabs Exactly. I don’t need local models to be better than proprietary SotA *today*, I just need them to be as good as proprietary models where when I started to be able to reliably trust agents to do their thing.