Good grief, after weeks of clean room reverse engineering I finally matched IPEX speeds without their binaries, fully local. 55.8 t/s, correct output, 103% of IPEX target with Qwen3 32B. Now I can finally start plugging in Gemma 4 / Qwen3.5.
An update on my 3รArc A770 build. I was hoping to post about this in celebration right after building it, but I did not understand how fundamentally limited these cards were. I thought "hey 16GB of VRAM is 16GB of VRAM am I right???" while at the same time Claude and Codex are gassing me up, "what a genius PC build Ryan, you are a genius."
Tl;dr I can get up to 54 tps in aggregate with 8 slots in parallel at 8-16k context per slot with 32b dense. Not bad right? Well, IPEX dropped support in January (at least for A770s) and Intel never open-sourced their kernels, so while I can technically justify the PC with the Qwen3 pure transformer dense models, I'm sitting on the sidelines drooling watching people bench the newer flash attention GDN models, not to mention newer architectures that will come later.
The problem with using non-NVIDIA GPUs is they can't efficiently batch kernel dispatches. NVIDIA has CUDA Graphs (record once, replay thousands of launches as a single dispatch at 5-10ยตs overhead) and CUDA Streams for overlapping compute with memory transfers across multiple GPUs via NVLink. Intel's Level Zero dispatch costs ~86ยตs per kernel launch โ with ~1,100 launches per decode token, 65% of your token time is just the CPU telling the GPU what to do, not actual compute. Both Intel and AMD work around this by fusing multiple operations into monolithic kernels, getting it down to ~250 launches per token. AMD also has HIP Graphs (same concept as CUDA Graphs) and lower base dispatch overhead (~20-30ยตs). Intel solved it in IPEX with closed-source fused kernels, then archived the repo and the team left Intel. So here I am.
Every model released in 2026 with hybrid attention (Qwen3.5 of all sizes including the dense 27B, Nemotron, LFM2) is completely broken on Arc because nobody has written the GatedDeltaNet/Mamba SYCL kernels and nobody is going to. Standard transformer attention is becoming the legacy path and that's all my cards can run. Would I recommend this build? Only if you're comfortable being locked into last generation's architectures while the rest of the world moves on. The price-to-VRAM ratio is unbeatable ($450 for 48GB), the performance is genuinely good on models that work, but the software stack is a house of cards. If I were starting over I'd buy used 3090s. The extra $300 buys you the entire model ecosystem instead of reverse-engineering closed-source Intel kernels. But I'm not starting over. Full technical writeup coming soon.
@sudoingX It's clean and I'm tired of explaining to openclaw/claude code that a binary is symlinked or we need to switch to x branch to do x or whatever other weird workflow quirk I may be dealing with. Hermes just remembers, zero effort required. Glad I tried it.
@stevibe Oh for sure, and knowing tps + context per slot. I plan to compare the old 32b dense to the new 27b dense with toolcall-15. Trying to get GDN to work on Intel GPUs is pure pain. ๐ญ
@stevibe Awesome repo, thanks for sharing. I've been downloading models from HF and benching them for my tools and somehow didn't even consider openrouter/ollama:cloud to get an effective score before downloading.
Interesting, possibly wasting VRAM no? You can run NVME -> VRAM inference on larger models at 5/6tps without fancy tensor overrides. With a model that big, you might as well only keep active experts on VRAM, run a single GPU instance on both (you can even run -np-2 and have two per card if you have enough VRAM). Then you have 20~tps in aggregate between 4 agents running at 400b params.
@DaveShapi They are seeing demos that have been filtered retroactively, so of course it doesn't look fully coherent with the art direction. Once studios start designing their games with these features in their hands, they will tweak their design around them and I bet it will be amazing.
Getting close now. Bought an open air case then soon after got a kitten, tricky combo. Now it's just getting vLLM tuned to keep active experts and most kv cache on VRAM with cold experts on sys RAM. Having to do it custom since arc 770 support has gotten worse recently.