@bnjmn_marie@WaleedAhmad1a10 any plans to do larger models (122b mtp, stepfun 3.7, v4 flash, 397b nex, mimo etc.) at low (iq1-q3) quants?
would be nice to be able to squeeze more intelligence into commodity hardware and have it be usable still.
@RyanNg26101@Montrey82631182 donbas "locals" only joined after russia already occupied crimea and RU started the same shit there. video clearly shows well trained, professional soldiers in matching uniforms.
also, finding 1000 local bandits in a region with millions of people, doesn't justify an occupation.
@0xSero it would be much easier/better to figure out layered caching instead:
L1 - GPUs, ranked separately both compute for prefill & vram speed for generation
L2 - ram+cpu
L3 - ssd
there has been quite a bit of research & implementations too, but vllm and llama wouldn't let the PRs in
@bnjmn_marie even with -np X and perhaps several server copies at the same time?
maybe it would be nice to have a proper comparison of the engines too.
@bnjmn_marie generally the most interesting question is what can you fit into 11GB, 15GB, 23GB, 31GB... past that, it's just macs and rtx pros, and those can run almost anything anyway.
@bnjmn_marie how so? llama with quants is consistently faster than vllm, at least every time I tried.
also, maybe the battery could be reduced, small models - Q4_K_M, maybe IQ4_NL & IQ3_XSS, + some smarter Q2s on 200B+ models?
those are probably the only ones that need to be tested really
@LenSeaside@stevibe 27B UD-IQ3_XXS [-ngl 65 + 36K Q4 kv cache] 1100-1200pp 36-37 t/s
but only if you connect the display to motherboard/CPU's iGPU, that will get you 1-3GB VRAM back.