ML systems engineer building a 100% local AI image+video+sound studio — open-source models, consumer AMD GPU, self-built pipeline, $0 cloud. Own your stack.
@redwooddesigns I just write or speak whatever I want to be rendered and when I get an image that has the look I'm after I look at the structure and change key words to change the look, character, materials. Nice picture, the scene looks very detailed!
This isn’t just “low VRAM mode.”
It’s closer to a tiny GPU operating system for diffusion: residency planning, phase scheduling, PCIe minimization, and warm model reuse. Still early days, but the results are promising. More updates as I push it further. What part of local inference are you fighting with right now? 7/7
Most people treat 16 GB VRAM as a hard wall.
I’m treating it like a managed L1 cache. I’m building a full render-pipeline scheduler for diffusion models on a 16 GB AMD RX 7800 XT. Not just “quantize until it fits” or basic offload.
This is residency decisions, smart streaming, GPU-side dequant, and precision where it actually matters.
🧵
The Flux Schnell case is especially interesting. Right now the Q4 transformer fits, but the fp16 T5 encoder forces constant shuffling. If I quantize the encoder and keep it resident, the transformer can move from Q4 toward Q6 — faster and higher quality. That’s the kind of trade-off this scheduler unlocks. 6/
Not “make everything smaller.” Allocate cost where it matters. 5/This leads to smarter precision choices:
Resident parts → higher precision where possible
Streaming parts → stay quantized to minimize PCIe traffic
Encoder → safe place to save VRAM
Transformer → where image quality lives
VAE → keep fp16
Not “make everything smaller.” Allocate cost where it matters. 5/
Here’s the encoder pass timed three ways on my setup. GPU compute itself is basically free (~0.15s).
Moving the weights is where the real cost lives (24–33s for the encoder transfer). [Attach the residency chart] Compute is solved. Residency is the lever. 4/
The bottleneck isn’t usually the GPU compute. It’s moving the model. 3/For models that still don’t fit, I stream quantized blocks:
Keep weights compressed in CPU RAM
Send compressed block over PCIe
Dequantize on GPU
Run the compute
Discard the expanded fp16 block (no unnecessary round-tripping)
The bottleneck isn’t usually the GPU compute. It’s moving the model. 3/
The normal path (ComfyUI, A1111, Forge, etc.):
Load model → OOM → basic offload or heavy quantization → accept the slowdown. My approach is different. The diffusion pipeline has distinct phases — text encode, transformer/DiT denoise, VAE decode — that don’t all need to be resident at the same time. So I schedule them like an OS: encode → evict → stream/denoise → evict → decode. 2/
@thatcofffeeguy@ASUS@nvidia I would be optimizing that thing for ages lol. I am working on quant and dequant optimizations and found that smaller quants unquant slower if they pass over the pci bus, larger dequants don't spend as much time transferring over the pci bus so it's a free upgrade.😂😎