People running local models every day: what are you optimizing for right now, tokens/sec, long-context prefill, or just not turning the room into a toaster?
Qwen3.6-27B has 5M downloads and supposedly runs on budget GPUs now. But every runtime handles it differently -- different quantization paths, different MoE support, different prompt formats. Is there actually a "just works" local setup for it today, or is that still aspirational?
@loktar00@LottoLabs Same! Two 3090s locally in Vegas for like $1300. Not sure where the best value is now in compute accumulation Phase II. Started looking at these unified memory boxes but just not sold yet.
Curious where local AI people draw the line on hardware now. Is a $4k Strix Halo box useful enough to replace a small CUDA rig, or is this still mostly laptop-brained marketing?
Ollama v0.30.0 swaps GGML for direct llama.cpp. Pre-release is out. NousResearch also dropped agent-self-evolution, neural-steering, and paperclip-adapter this week. Hermes Agent crossed 162k stars. Local agent tooling is accelerating.
@mr_r0b0t@NVIDIAAI@GIGABYTEUSA@Acer Hell yea. This is mostly what I had in mind. Larger models on unified ram boxes to orchestrate smaller subagents running on 3090s, 4060s, and even some 3060s.
Local AI has a portability problem.
If “FFmpeg for LLM inference” existed, what should it solve first: model formats, runtime switching, quantization, serving APIs, or hardware configs?
My guess is runtime switching, but I’m not sure builders would agree.