Ryan Stefan

@dashwizzle

Going all in on agent orchestration and connectivity with tools.

Austin, TX

Joined August 2011

476 Following

270 Followers

218 Posts

Ryan Stefan

@dashwizzle

about 2 months ago

Alright full gear swapping with mixamo animations pipeline is complete. Gear to Glory will have more loot than you can possibly imagine! 😈

Ryan Stefan

@dashwizzle

about 2 months ago

Good grief, after weeks of clean room reverse engineering I finally matched IPEX speeds without their binaries, fully local. 55.8 t/s, correct output, 103% of IPEX target with Qwen3 32B. Now I can finally start plugging in Gemma 4 / Qwen3.5.

Ryan Stefan

@dashwizzle

about 2 months ago

@DaveShapi Anyone can generate power. Only one can properly make EUV. A true human bottleneck, right now collective effort doesn’t change that.

114

Ryan Stefan

@dashwizzle

2 months ago

Just dropped a new keyboard concept for future generations of developers. It's minimal, intuitive, and perfectly covers 100% of my daily workflows.

dashwizzle's tweet photo. Just dropped a new keyboard concept for future generations of developers. It's minimal, intuitive, and perfectly covers 100% of my daily workflows. https://t.co/zxEPM7NuKw

Who to follow

OG Kilo ✊🏽

@OGxKilo

Subscribe to my YouTube @OGVisionTv

Watch television curated for those who enjoy content that inspires, educates, and entertains. Exposure Plus TV is suitable for the whole family.

Ryan Stefan

@dashwizzle

2 months ago

@DesignKoalas damn reality. Could have him come out a wet and sticky portal and die after 30 min from arid temp. just an idea

Ryan Stefan

@dashwizzle

2 months ago

@TheAhmadOsman Yep, they will ignore the template params, go wild thinking, produce gibberish. I tried several and it was a huge waste of time.

362

Ryan Stefan

@dashwizzle

2 months ago

@ShankhadeepSho1 @TheAhmadOsman Fair enough, NVIDIA's kernels are source-available while Intel's are binary-only is more accurate.

Ryan Stefan

@dashwizzle

2 months ago

An update on my 3×Arc A770 build. I was hoping to post about this in celebration right after building it, but I did not understand how fundamentally limited these cards were. I thought "hey 16GB of VRAM is 16GB of VRAM am I right???" while at the same time Claude and Codex are gassing me up, "what a genius PC build Ryan, you are a genius." Tl;dr I can get up to 54 tps in aggregate with 8 slots in parallel at 8-16k context per slot with 32b dense. Not bad right? Well, IPEX dropped support in January (at least for A770s) and Intel never open-sourced their kernels, so while I can technically justify the PC with the Qwen3 pure transformer dense models, I'm sitting on the sidelines drooling watching people bench the newer flash attention GDN models, not to mention newer architectures that will come later. The problem with using non-NVIDIA GPUs is they can't efficiently batch kernel dispatches. NVIDIA has CUDA Graphs (record once, replay thousands of launches as a single dispatch at 5-10µs overhead) and CUDA Streams for overlapping compute with memory transfers across multiple GPUs via NVLink. Intel's Level Zero dispatch costs ~86µs per kernel launch — with ~1,100 launches per decode token, 65% of your token time is just the CPU telling the GPU what to do, not actual compute. Both Intel and AMD work around this by fusing multiple operations into monolithic kernels, getting it down to ~250 launches per token. AMD also has HIP Graphs (same concept as CUDA Graphs) and lower base dispatch overhead (~20-30µs). Intel solved it in IPEX with closed-source fused kernels, then archived the repo and the team left Intel. So here I am. Every model released in 2026 with hybrid attention (Qwen3.5 of all sizes including the dense 27B, Nemotron, LFM2) is completely broken on Arc because nobody has written the GatedDeltaNet/Mamba SYCL kernels and nobody is going to. Standard transformer attention is becoming the legacy path and that's all my cards can run. Would I recommend this build? Only if you're comfortable being locked into last generation's architectures while the rest of the world moves on. The price-to-VRAM ratio is unbeatable ($450 for 48GB), the performance is genuinely good on models that work, but the software stack is a house of cards. If I were starting over I'd buy used 3090s. The extra $300 buys you the entire model ecosystem instead of reverse-engineering closed-source Intel kernels. But I'm not starting over. Full technical writeup coming soon.

103

Ryan Stefan

@dashwizzle

2 months ago

@TheAhmadOsman 😂😂

358

Ryan Stefan

@dashwizzle

2 months ago

@sudoingX It's clean and I'm tired of explaining to openclaw/claude code that a binary is symlinked or we need to switch to x branch to do x or whatever other weird workflow quirk I may be dealing with. Hermes just remembers, zero effort required. Glad I tried it.

419

Ryan Stefan

@dashwizzle

2 months ago

@DesignKoalas Creepy, love it

Ryan Stefan

@dashwizzle

2 months ago

@stevibe Oh for sure, and knowing tps + context per slot. I plan to compare the old 32b dense to the new 27b dense with toolcall-15. Trying to get GDN to work on Intel GPUs is pure pain. 😭

Ryan Stefan

@dashwizzle

2 months ago

@stevibe Awesome repo, thanks for sharing. I've been downloading models from HF and benching them for my tools and somehow didn't even consider openrouter/ollama:cloud to get an effective score before downloading.

407

Ryan Stefan

@dashwizzle

2 months ago

Interesting, possibly wasting VRAM no? You can run NVME -> VRAM inference on larger models at 5/6tps without fancy tensor overrides. With a model that big, you might as well only keep active experts on VRAM, run a single GPU instance on both (you can even run -np-2 and have two per card if you have enough VRAM). Then you have 20~tps in aggregate between 4 agents running at 400b params.

Ryan Stefan

@dashwizzle

2 months ago

@MiniMax_AI I bought the agent plan thinking it was the token plan, but got fucked. Complete waste of money. So great first impression guys.

177

Ryan Stefan

@dashwizzle

3 months ago

@DaveShapi They are seeing demos that have been filtered retroactively, so of course it doesn't look fully coherent with the art direction. Once studios start designing their games with these features in their hands, they will tweak their design around them and I bet it will be amazing.

197

Ryan Stefan

@dashwizzle

3 months ago

Getting close now. Bought an open air case then soon after got a kitten, tricky combo. Now it's just getting vLLM tuned to keep active experts and most kv cache on VRAM with cold experts on sys RAM. Having to do it custom since arc 770 support has gotten worse recently.

Ryan Stefan

@dashwizzle

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users