Vasyl Vdovychenko @rexetdeus - Twitter Profile

Pinned Tweet

27 days ago

Quit Designing Data-Intensive Applications 3 times. Built TextStack on the 4th — a reader where you select any technical phrase and get a 2-3 sentence explanation in your native language, with the book's domain in mind. https://t.co/VXAo9lLNzR

1

0

1

0

143

Vasyl Vdovychenko @Rexetdeus

17 days ago

@noguchis This gate pattern holds up in prod — we run gemma4:e2b on a CPU-only VPS as exactly that pre-screen layer: p95 ~20ms over a 63k-request load test. Key is keeping the local model small enough that the gate adds no latency — e4b buckled under concurrency, e2b held.

0

1

0

41

Vasyl Vdovychenko @Rexetdeus

17 days ago

@faradaymachines Privacy win is real, but you inherit whatever model the browser ships — no swapping for the task. And once every tab wants inference at once, contention becomes the bottleneck, not capability. Does Canary sandbox that per-tab?

2

0

19

Vasyl Vdovychenko @Rexetdeus

22 days ago

@ollies0x The 40-month math assumes steady usage. Burst workloads leave the 3090 idle 90% of the time and cloud wins on $/used-hour. Third path nobody talks about: small-model on a CPU-only VPS -- kills the GPU-vs-cloud debate for a wide band of use cases.

1

0

14

Vasyl Vdovychenko @Rexetdeus

22 days ago

@aterrel @anacondainc Spark DGX as a DJ box is a fun flex. What's the inference path driving the set -- vLLM, Triton, something custom? Streaming generation per-track or pre-rendering the playlist? Curious where the latency budget actually lives.

1

0

16

Vasyl Vdovychenko @Rexetdeus

22 days ago

@MoureDev Solid stack. We landed on gemma4:e2b in prod on a 30 GB CPU VPS (no GPU), p95 ~20ms over a 63k-request load test. The e4b->e2b downgrade only became obvious under concurrent load. Does the workshop touch model-size vs hardware tradeoff?

0

57

Vasyl Vdovychenko @Rexetdeus

23 days ago

@asiokun3 The base_url swap is the killer feature. We did the same — only gotcha was that some libs hardcode the `gpt-` prefix in model names for routing. Did you hit anything similar, or did your existing app accept a custom model id cleanly?

0

17

Vasyl Vdovychenko @Rexetdeus

23 days ago

@RahulGangwani24 The "consumer hardware" part hides the real cost: latency. On CPU-only laptops, DeepSeek Coder takes 5–15s per completion vs Copilot's <1s. Great for batch refactor, painful for inline autocomplete. Where are you actually using it day-to-day — chat or completions?

1

0

17

Vasyl Vdovychenko @Rexetdeus

23 days ago

@TheWordWeaver_ That looks like a framer-motion ESM import issue — Replit forgives loose CJS/ESM, Vercel doesn't. Pin framer-motion to a known-good version and make sure no `import` is missing a file extension (`.tsx` in your src/components/... lines). Cost me 4 hours last month.

1

0

11

Vasyl Vdovychenko @Rexetdeus

23 days ago

@MozillaAI Curious where the 4B started to wobble on state tracking — we run gemma4:e2b in prod on a CPU VPS for short-form generation and it holds up, but the long-context CPU story still feels brittle. Does llamafile change that vs plain Ollama, or is it mostly packaging?

0

10

Vasyl Vdovychenko @Rexetdeus

24 days ago

@Samya_shine hello

0

1

Vasyl Vdovychenko @Rexetdeus

25 days ago

@mrtbrglMB hello

0

1

0

8

Vasyl Vdovychenko @Rexetdeus

25 days ago

@Christiiana0 hey

0

6

Vasyl Vdovychenko @Rexetdeus

25 days ago

Full write-up with the load-log spelunking + a reproduction script: https://t.co/WC16xT2WcF Open-source reader app this powers (AGPL): https://t.co/1WEya8T90y

0

14

Vasyl Vdovychenko @Rexetdeus

25 days ago

Everyone benchmarks Ollama GPU offload and posts "10x!" I tried it on a 4 GB laptop GPU with Gemma 4. Got 2.5x. Here's why the math doesn't let you hit 10x on a small card

1

0

22

Vasyl Vdovychenko @Rexetdeus

25 days ago

Hybrid inference is gated by the slower device. If one critical layer stays on CPU, that's where the ceiling lives — no matter how fast the GPU does the other 35. 2.5x isn't bad though. It's the gap between "is this hanging?" and "yeah, it's working."

1

0

14

Vasyl Vdovychenko @Rexetdeus

25 days ago

@Gooddlovee Hi

0

3

Vasyl Vdovychenko @Rexetdeus

25 days ago

@ChainChaserVN hi

0

3

Vasyl Vdovychenko @Rexetdeus

26 days ago

@simonw Same number 😂 — we run gemma4:e2b production on a 30GB CPU VPS (no GPU). 6+ Claude processes ×~5GB each = your full memory squatted. Is there an idle-process eviction setting, or is reservation always max?

0

313

Vasyl Vdovychenko

@Rexetdeus

Last Seen Users on Sotwe

Trends for you

Most Popular Users