Forgewright. Building systems that survive reality.
NEXUS • Villnave's Law • MemoryOS • SDI Runtime
Local inference. No hype. Honest progress.
📍The Forge HQ
Dense models on CPU aren't compute-bound.
They're memory-residency problems.
SDI v0.1.1 shows policy selection matters more than context volume.
Local inference. No hype.
Well, I downloaded Hermes to my MacBook as a trial run to test it out. I like it enough that I'm building a new agent on the OptiPlex I'll be calling Optimus. Hopefully I have fewer issues with Hermes. I've been a hardcore OpenClaw fan, but I don't know…that Harness seems to just be letting me down recently.
I might just fuck around and build my own harness. I already have one started that I call “vault brain.” Might have to fuck around and blow the dust off it.
Lab update from The Forge:
Still deep in the CPU inference rabbit hole.
The current experiment: Can a low-bit base model stay resident in memory while compact residual “sidecars” get paged in only when needed?
In plain English: base model in RAM, correction layers stored separately, load/decode/apply only the useful pieces = maybe higher effective quality without keeping the full higher-precision model resident.
Not claiming speedups, quality parity, or “30B on a toaster.”
Current work is boring-but-important systems plumbing: packed .trit sidecar format, Python/C++ decode parity, pager + manifest system, runtime hook wiring, decode-first safety, layer activation, cached decoded buffers, GGML graph materialization probes.
Latest wall: The sidecar can be fetched, decoded, cached, and shadow-computed during generation. But true injection into the model graph needs the correct GGML tensor materialization path. You can’t just shove decoded floats into a graph tensor before the backend gives it real memory.
That’s the current battle.
Still early and very breakable, but no longer just an idea. It’s becoming a real question: Can model precision be paged like memory?
That’s exactly the thing I’ve been trying to avoid…. “technically generated” becoming mistaken for “actually viable.”
A lot of local AI tests quietly drift into benchmarking swap behavior, page cache luck, or recovery latency instead of the inference path itself.
That’s why I started adding explicit tripwires and staged failure gates. If the box silently changes regimes mid-run, the result becomes muddy fast.
Right now I’m less interested in proving “one clean demo” and more interested in mapping:
where the system breaks
what kind of break it is
and whether the runtime recovers cleanly after pressure.
The interesting part to me is that some of the newer paged-residency simulations are suggesting memory may not actually be the primary blocker anymore…. Runtime correctness and compute behavior might be the harder wall now.
Dense models on CPU are not just compute-heavy.
They are memory-residency problems.
Weights have to fit.
KV cache grows with context.
RAM bandwidth becomes the wall.
And once the system hits swap, every later inference run can become misleading.
That is the problem SDI is trying to attack.