Benchmarking rule I’m trying to follow:
If a result is inside run-to-run noise, call it noise.
Not “breakthrough.”
Not “secret flag.”
Not “RDMA is worse.”
Noise.
Local LLM work needs more boring honesty and fewer victory laps.
If you run local LLMs: what is your actual setup?
Not the dream build. The daily driver.
GPU/APU?
Memory/VRAM?
Model size?
Serving stack?
What breaks most often?
I’m trying to compare practical local AI systems, not leaderboard screenshots.
DeepSeek-V4-Flash on one Strix Halo box was more interesting than I expected.
Not because it was the fastest thing in the world.
Because it made 256K context feel plausible on a local machine.
I want to push local hardware to be as capable as possible.
And I want to push my harness to be productive with the smallest/dumbest models possible.
No leaning on the model for my harness.
Caveat: there is a minimum model quality that I simply can’t avoid. Qwen3.5-9b breaks at q4 but not at q8 (for now)
I actually want to try baking RL into my harness lol.
Small model gives two options, big model chooses better option, over time small model makes better choices?
I already have multi-model requests built-in so you can compare the decisions various models make (one main “real” model, and many “ghost” models which don’t actually have the ability to run tool calls).
It’s proven somewhat effective at sussing out the strengths/weaknesses of models. But uhh… can be expensive.
RL is a mistake, thinking is a mistake, and if we just put all the money into crafting an astronomically good, massive dataset, we'd pretrain a model that outperforms everything that exists by a considerable margin
source: my ass (I have no idea what I'm talking about)
@LottoLabs If someone has the bandwidth, I’d be curious to see if it could be a potential runtime optimization.
Also:
- Tiered K/V (hottest in vRAM, medium hot in RAM, cold on SSD)
- The CPU inference optimizations noted by Sakura Yuki
I think there is yet meat on the bone for local AI.
Thinking out loud.
Could domain-specific small dense models (SLMs) be trained and then combined together as a MoE?
Almost certainly you could put a manual router model in front, but it would be kind of cool if the whole thing could be consolidated into one model.
Imagine cherry-picking the experts you need for your use case.
Thinking out loud.
Could domain-specific small dense models (SLMs) be trained and then combined together as a MoE?
Almost certainly you could put a manual router model in front, but it would be kind of cool if the whole thing could be consolidated into one model.
Imagine cherry-picking the experts you need for your use case.
Strix Halo is vindicated by price/performance. 4x256GB Mac Studios barely beat it. You can essentially get the same speed on 1xSH with q5 with chadrock.
@0xRaghuboi Idk if this applies with nvidia hardware (I have a 3090 but have experimented more with Strix Halo) and I find that compressing KV less is actually faster, and if you have the headroom to not compress, it is often better to not.
Results may vary fwiw
I am currently running two split, separate models. Qwopus3.6-27B-v2-Q4_K_M with vulkan. I get better results with vulkan but I need to revisit rocm again. Both 7900xtx's hold their own in developing my projects. MTP is a big help. Not quite the speed of a 3090 but about $400-500 cheaper than is currently available. They are solid GPUs for local models.