The back cover of Lithium Confidential. Hopefully I have the printers first copy to approve by the end of next week. Episode 235 of the podcast will be online tonight at 8:01pm eastern. A solo episode.
Some of the OG local LLM guys are stuck in the past.
I still see takes saying quantized models are bad.
No, new methods like DQ, TQ, and dynamic quants drop the loss.
You can shrink a model by multiple times and keep the loss under 1%.
Big model quantized > tiny model in BF16. It's not even a debate.
The local LLM scene a year ago vs today is a completely different universe.
Honestly, you can't even compare it to a month ago.
We have dozens of model options for every size now.
The dark ages of running weak early models that felt like a joke compared to cloud APIs are over.
Time to update your priors, accept the new meta, and admit your old takes might be wrong.
if you are a mlx inferencer maintainer, pls checkout https://t.co/QwSH3DPkCz - the fixes for dsv4 runtime are there with the prefix/paged cache working for the HSA/CSA/SWA attention combo. use this to get dsv4 coherent and working in ur engine.
@hornsby_andrew@dealignai I will be running this soon. Curious to know what context size you're able to get? Are you using TurboQuant for the KV cache as well?
I built an AI app for iPhone where the AI runs 100% on your phone. Agent capabilities being built in v2.
No cloud. No account. No subscription. Works in airplane mode. Has options
13+ models. Built-in tools. MCP support.
$2.99 one-time.
Here's what it actually does 🧵
https://t.co/7eYYjUeb8L
App Store (iPhone + iPad): https://t.co/2RmlwDhN8R
$2.99 one-time. No subscription. Works offline.
Happy to answer anything about how it works.
Why I built it (besides just for fun)
Every AI app on my phone was sending my prompts somewhere else.
I wanted an AI that could actually do useful things - read my PDFs, run code,
search the web when needed, translate on-device - without handing all of that to a server.
Privacy:
NEVER leaves your device:
- LLM inference (runs on your iPhone's chip)
- OCR, translation, code interpreter, file ops, clipboard
- All 13 models after download
Uses network only when you explicitly invoke:
- Web search / web fetcher
- Using your cloud subscriptions
Many models across 6 families:
Gemma 4 / Llama / Qwen / Mistral / DeepSeek / Phi
Small models (1B–3B) for speed.
Mid-size (4B–7B) for “reasoning.”
No per-model unlocks. $2.99 gets you everything.
MCP support for connecting external tools if you want to go deeper.
The tools are the part most local LLM apps don't have.
SoloLLM:
📄 PDF reader + document reader
✍️ File writer + extraction
🌐 Web search
💻 Code interpreter
📊 Data analyzer
🖼️ OCR
📋 Clipboard
🌍 Translation
🔗 Siri Bridge
📁 Files app integration