Quit Designing Data-Intensive Applications 3 times.
Built TextStack on the 4th — a reader where you select any technical phrase and get a 2-3 sentence explanation in your native language, with the book's domain in mind.
https://t.co/VXAo9lLNzR
@noguchis This gate pattern holds up in prod — we run gemma4:e2b on a CPU-only VPS as exactly that pre-screen layer: p95 ~20ms over a 63k-request load test. Key is keeping the local model small enough that the gate adds no latency — e4b buckled under concurrency, e2b held.
@faradaymachines Privacy win is real, but you inherit whatever model the browser ships — no swapping for the task. And once every tab wants inference at once, contention becomes the bottleneck, not capability. Does Canary sandbox that per-tab?
@ollies0x The 40-month math assumes steady usage. Burst workloads leave the 3090 idle 90% of the time and cloud wins on $/used-hour. Third path nobody talks about: small-model on a CPU-only VPS -- kills the GPU-vs-cloud debate for a wide band of use cases.
@aterrel@anacondainc Spark DGX as a DJ box is a fun flex. What's the inference path driving the set -- vLLM, Triton, something custom? Streaming generation per-track or pre-rendering the playlist? Curious where the latency budget actually lives.
@MoureDev Solid stack. We landed on gemma4:e2b in prod on a 30 GB CPU VPS (no GPU), p95 ~20ms over a 63k-request load test. The e4b->e2b downgrade only became obvious under concurrent load. Does the workshop touch model-size vs hardware tradeoff?
@asiokun3 The base_url swap is the killer feature. We did the same — only gotcha was that some libs hardcode the `gpt-` prefix in model names for routing. Did you hit anything similar, or did your existing app accept a custom model id cleanly?
@RahulGangwani24 The "consumer hardware" part hides the real cost: latency. On CPU-only laptops, DeepSeek Coder takes 5–15s per completion vs Copilot's <1s. Great for batch refactor, painful for inline autocomplete. Where are you actually using it day-to-day — chat or completions?
@TheWordWeaver_ That looks like a framer-motion ESM import issue — Replit forgives loose CJS/ESM, Vercel doesn't. Pin framer-motion to a known-good version and make sure no `import` is missing a file extension (`.tsx` in your src/components/... lines). Cost me 4 hours last month.
@MozillaAI Curious where the 4B started to wobble on state tracking — we run gemma4:e2b in prod on a CPU VPS for short-form generation and it holds up, but the long-context CPU story still feels brittle. Does llamafile change that vs plain Ollama, or is it mostly packaging?
Full write-up with the load-log spelunking + a reproduction script:
https://t.co/WC16xT2WcF
Open-source reader app this powers (AGPL):
https://t.co/1WEya8T90y
Everyone benchmarks Ollama GPU offload and posts "10x!"
I tried it on a 4 GB laptop GPU with Gemma 4.
Got 2.5x.
Here's why the math doesn't let you hit 10x on a small card
Hybrid inference is gated by the slower device.
If one critical layer stays on CPU, that's where the ceiling lives — no matter how fast the GPU does the other 35.
2.5x isn't bad though. It's the gap between "is this hanging?" and "yeah, it's working."
@simonw Same number 😂 — we run gemma4:e2b production on a 30GB CPU VPS (no GPU). 6+ Claude processes ×~5GB each = your full memory squatted. Is there an idle-process eviction setting, or is reservation always max?