I calculated what Iโd need more or less to go full local with my project and Iโd need 2 @NVIDIAAI DGX sparks ๐ค๐ฐ
For now 24GB ram have to do haha! But I also believe local models will become smaller, better and perhaps even a commodity. The apps we built and top (output) and the energy that goes in (input) might become the true value!
What would you run if your have 2x DGX Spark (and 1x mac mini 24gbโฆ)and want to go 100% local? Multi agent system that needs to 24/7 run computations and cross-references on an ever growing vectorized data pool. Nemoclaw strongly preferred.
- Yes Iโm relatively new to it
- Yes I first got the mac mini and learned a lot (hence the upgrade)
- Yes I need some help here. ๐
@witcheer@AlexFinn I just added Qwen3.6-plus-preview for non sensitive prompts and cross (research pulls in my case), previously I ran these on Haiku or Sonnet. Moving on! https://t.co/qs12WaN7lx
Just cut my agent's cloud costs significantly without sacrificing quality.
Thanks to @witcheer and @AlexFinn, I starting building from your posts!
The journey:
Started on Mac mini M4 (24GBโฆ) running Claude Haiku/Sonnet for all background tasks ~60 API calls/day.
Tried Qwen3.5-35B-A3B via Ollama first. 23GB model. OOM. Killed.
Tried Qwen2.5-14B. Fit fine, but reasoning quality too weak for my workflow.
Then TurboQuant dropped @GoogleResearch. KV cache compression, 4.6ร smaller memory footprint!
My new stack:
Qwen3.5-27B-IQ3_XXS (10.7GB) via llama.cpp TurboQuant fork โ 13.6 tok/s on M4 Pro โ zero API cost.
Validation before shipping:
โ 15 tool-use scenarios: 12/12 pass
โ Shadow test against Haiku on live data: output indistinguishable
โ Did have 3 timeout failures on error recovery โ but a lot of crons can run overnight with a higher timeout, not a quality issue
My full model stack today:
๐ข Local (free)
โ Qwen3.5-27B IQ3_XXS โ health monitoring, training load, environmental logging. (Based on multimodal biomarkers, keen on getting that data vectorized!)
โ MedGemma 4B โ offline domain-specific model
๐ต Anthropic
โ Claude Sonnet โ interactive sessions, memory synthesis
โ Claude Haiku โ research briefings, clinical alerts, complex reasoning chains
๐ก Google
โ Gemini 2.5 Pro โ fallback when Anthropic unavailable
โซ DeepSeek
โ DeepSeek V3.2 โ cost-efficient tasks when applicable
The "local = cheap but dumb" assumption is breaking down fast fast faster!
Probably it is already outdated, goes so fast! Curious to get it smarter and cheaper everyday ๐ฅณ
@VadimStrizheus Ok. This already changed.... just added Qwen3.6-plus-preview (free tier). Nothing sensitive or private. I runs my research crons and prompts through it now instead of Sonnet 4.6
https://t.co/qs12WaN7lx
Things move fassssst
@VadimStrizheus Iโd do Qwen 27, not 35 to be honest. With 35 you wonโt have enough headspace. And run turboquant on top to compress kv chache. Then make sure to have some cloud api for when you really need it
Things really do go fast!
Just set all my online research cross and prompts to run on Qwen3.6 Plus Preview (free tier) on OpenRouter.
No personal data, no MEMORY.md, no strategy, nothing sensitive because Alibaba collects prompts on the free tier.
Check it out here: https://t.co/0lfGTovTtw
Just cut my agent's cloud costs significantly without sacrificing quality.
Thanks to @witcheer and @AlexFinn, I starting building from your posts!
The journey:
Started on Mac mini M4 (24GBโฆ) running Claude Haiku/Sonnet for all background tasks ~60 API calls/day.
Tried Qwen3.5-35B-A3B via Ollama first. 23GB model. OOM. Killed.
Tried Qwen2.5-14B. Fit fine, but reasoning quality too weak for my workflow.
Then TurboQuant dropped @GoogleResearch. KV cache compression, 4.6ร smaller memory footprint!
My new stack:
Qwen3.5-27B-IQ3_XXS (10.7GB) via llama.cpp TurboQuant fork โ 13.6 tok/s on M4 Pro โ zero API cost.
Validation before shipping:
โ 15 tool-use scenarios: 12/12 pass
โ Shadow test against Haiku on live data: output indistinguishable
โ Did have 3 timeout failures on error recovery โ but a lot of crons can run overnight with a higher timeout, not a quality issue
My full model stack today:
๐ข Local (free)
โ Qwen3.5-27B IQ3_XXS โ health monitoring, training load, environmental logging. (Based on multimodal biomarkers, keen on getting that data vectorized!)
โ MedGemma 4B โ offline domain-specific model
๐ต Anthropic
โ Claude Sonnet โ interactive sessions, memory synthesis
โ Claude Haiku โ research briefings, clinical alerts, complex reasoning chains
๐ก Google
โ Gemini 2.5 Pro โ fallback when Anthropic unavailable
โซ DeepSeek
โ DeepSeek V3.2 โ cost-efficient tasks when applicable
The "local = cheap but dumb" assumption is breaking down fast fast faster!
Probably it is already outdated, goes so fast! Curious to get it smarter and cheaper everyday ๐ฅณ