Your LLM just recomputed 30,000 tokens it already did 5 minutes ago.
Every cache eviction = full prefill from scratch. 10–30 seconds. Again.
tierKV fixes this. Here's how 🧵
@wuhanbat@alexabelonix Works well in larger prompts and repetitive use like iterations over a report. Shared more on metrics here : https://t.co/Ke2nd54ppP
What if we stopped treating Sparse Autoencoders as fragile single-feature detectors for prompt injection?
Instead, I mined **conjunctive co-activation patterns** — groups of features that only fire together in real attacks.
On Gemma Scope (layers 6/12/18). 1/2
🧵 MLX said "pick one precision for all experts." We needed 9 at FP16, 119 at 4-bit. So we split what wasn't meant to be split. Here's how we got Qwen3-MoE-32B running in 64GB on Apple Silicon 👇 1/🧵
Why this works on MLX:
✅ gather_mm and gather_qmm are independent kernels ✅ Each block only sees local indices [0-N]
✅ mx.where on Metal GPUs is basically free
Total cost: ~$2 GPU rental, one weekend, 4 Python scripts.
Model: PKSGIN/qwen3-30b-selective-quant on HuggingFace
Full technical writeup (profiling, quantization, MLX conversion, benchmarks): https://t.co/PFL0JE80py
Security teams deserve local models that actually work.
Running a 30B LLM for security analysis means choosing between:
Cloud APIs (ship sensitive data off-prem)
Local quantized models (too degraded to be useful)
I spent a weekend building a third option.
47 tok/s on a MacBook. Zero data leakage.
🧵
Unified memory changes the game.
Traditional GPU: model must fit in VRAM (RTX 4090 = 24GB) Apple Silicon: CPU/GPU/Neural Engine share 32GB pool
18 GB model loads once. GPU runs inference in-place. Headroom for OS + KV cache.
Air-gapped ready.