Brian

Verified account

@Brjen

ML systems engineer building a 100% local AI image+video+sound studio — open-source models, consumer AMD GPU, self-built pipeline, $0 cloud. Own your stack.

Ontario, Canada

Joined October 2010

344 Following

524 Followers

9.8K Posts

about 2 hours ago

@redwooddesigns I just write or speak whatever I want to be rendered and when I get an image that has the look I'm after I look at the structure and change key words to change the look, character, materials. Nice picture, the scene looks very detailed!

0

1

0

0

10

about 5 hours ago

@illusionkrafter Thanks Krafter!

0

0

0

0

1

about 19 hours ago

Brjen's tweet photo. https://t.co/kOgSqM7Afw

1

3

0

0

40

about 16 hours ago

@Novasynthetica

Brjen's tweet photo. @Novasynthetica https://t.co/i0rgoWLRlF

0

0

0

0

9

Who to follow

I grew up in a free country and want the same for my children and grandchildren. I’m from the small fringe side of town.

You have tyranny when your government knows everything about you. You have democracy when you know everything about your government.

about 16 hours ago

This isn’t just “low VRAM mode.” It’s closer to a tiny GPU operating system for diffusion: residency planning, phase scheduling, PCIe minimization, and warm model reuse. Still early days, but the results are promising. More updates as I push it further. What part of local inference are you fighting with right now? 7/7

0

2

0

0

29

about 16 hours ago

Most people treat 16 GB VRAM as a hard wall. I’m treating it like a managed L1 cache. I’m building a full render-pipeline scheduler for diffusion models on a 16 GB AMD RX 7800 XT. Not just “quantize until it fits” or basic offload. This is residency decisions, smart streaming, GPU-side dequant, and precision where it actually matters. 🧵

6

2

0

0

49

about 16 hours ago

The Flux Schnell case is especially interesting. Right now the Q4 transformer fits, but the fp16 T5 encoder forces constant shuffling. If I quantize the encoder and keep it resident, the transformer can move from Q4 toward Q6 — faster and higher quality. That’s the kind of trade-off this scheduler unlocks. 6/

0

1

0

0

16

about 16 hours ago

Not “make everything smaller.” Allocate cost where it matters. 5/This leads to smarter precision choices: Resident parts → higher precision where possible Streaming parts → stay quantized to minimize PCIe traffic Encoder → safe place to save VRAM Transformer → where image quality lives VAE → keep fp16 Not “make everything smaller.” Allocate cost where it matters. 5/

0

0

0

0

14

about 16 hours ago

Here’s the encoder pass timed three ways on my setup. GPU compute itself is basically free (~0.15s). Moving the weights is where the real cost lives (24–33s for the encoder transfer). [Attach the residency chart] Compute is solved. Residency is the lever. 4/

Brjen's tweet photo. Here’s the encoder pass timed three ways on my setup. GPU compute itself is basically free (~0.15s).
Moving the weights is where the real cost lives (24–33s for the encoder transfer). [Attach the residency chart] Compute is solved. Residency is the lever. 4/ https://t.co/B1NIqBv26A

0

0

0

0

17

about 16 hours ago

The bottleneck isn’t usually the GPU compute. It’s moving the model. 3/For models that still don’t fit, I stream quantized blocks: Keep weights compressed in CPU RAM Send compressed block over PCIe Dequantize on GPU Run the compute Discard the expanded fp16 block (no unnecessary round-tripping) The bottleneck isn’t usually the GPU compute. It’s moving the model. 3/

0

0

0

0

15

about 16 hours ago

The normal path (ComfyUI, A1111, Forge, etc.): Load model → OOM → basic offload or heavy quantization → accept the slowdown. My approach is different. The diffusion pipeline has distinct phases — text encode, transformer/DiT denoise, VAE decode — that don’t all need to be resident at the same time. So I schedule them like an OS: encode → evict → stream/denoise → evict → decode. 2/

0

0

0

0

29

about 17 hours ago

@thatcofffeeguy @ASUS @nvidia I would be optimizing that thing for ages lol. I am working on quant and dequant optimizations and found that smaller quants unquant slower if they pass over the pci bus, larger dequants don't spend as much time transferring over the pci bus so it's a free upgrade.😂😎

0

0

0

0

14

about 17 hours ago

@RobotCleopatra

Brjen's tweet photo. @RobotCleopatra https://t.co/sg6fAqTWO0

0

0

0

0

10

about 18 hours ago

@davidavenueai Yeah that's an epic cataclysm happen! 😎

0

1

0

0

6

about 20 hours ago

Brjen's tweet photo. https://t.co/O6ONEISIUV

1

10

2

0

77

Brjen retweeted

InfinityBrushAI @InfinityBrushAI

1 day ago

#mimiめるお題　【奥行き】含めるプロンプト【foreground と depth】

InfinityBrushAI's tweet photo. #mimiめる
お題　【奥行き】
含めるプロンプト【foreground と depth】 https://t.co/7UukLHyn8A

3

108

7

14

1K

about 19 hours ago

@InfinityBrushAI 😮 wow that's impressive!

1

1

0

0

35

about 19 hours ago

Brjen's tweet photo. https://t.co/oB6tVaXO7w

0

6

0

1

35

about 20 hours ago

@vextral_vex 👏😎 oh this just gets better and better!

1

2

0

0

21

about 20 hours ago

@RobotCleopatra

Brjen's tweet photo. @RobotCleopatra https://t.co/deO1vPikPE

0

2

0

0

17

Last Seen Users on Sotwe

Trends for you

Most Popular Users