You donโt pick an Inference Engine
You pick a Hardware Strategy
and the Engine follows
Inference Engines Breakdown
(Cheat Sheet at the bottom)
> llama.cpp
runs anywhere
CPU, GPU, Mac, weird edge boxes
best when VRAM is tight and RAM is plenty
hybrid offload, GGUF, ultimate portability
not built for serious multi-node scale
> MLX
Apple Silicon weapon
unified memory = โfitsโ bigger models
than VRAM would allow but also slower than GPUs
clean dev stack (Python/Swift/C++)
sits on Metal (and expanding beyond)
now supports CUDA + distributed too
great for Mac-first workflows, not prod serving
> ExLlamaV2
single RTX box go brrr
EXL2 quant, fast local inference
perfect for 1/2/3/4 GPU(s) setups (4090/3090)
not meant for clusters or non-CUDA
> ExLlamaV3
same idea, but bigger ambition
multi-GPU, MoE, EXL3 quant
consumer rigs pretending to be datacenters
still CUDA-first, still rough edges depending on model
> vLLM
default answer for prod serving
continuous batching, KV cache magic
tensor / pipeline / data parallel
runs on CUDA + ROCm (and some CPUs)
this is your โserve 100s of usersโ engine
> SGLang
vLLM but more systems-brained
routing, disaggregation, long-context scaling
expert parallel for MoE
built for ugly workloads at scale
lives on top of CUDA / ROCm clusters
this is infra nerd territory
> TensorRT-LLM
maximum NVIDIA performance
FP8/FP4, CUDA graphs, insane throughput
multi-node, multi-GPU, fully optimized
pure CUDA stack, zero portability
(And underneath all of it:
Transformers โ model architecture
layer โ CUDA / ROCm / TT-Metal
โ compute layer)
What actually happens under the hood:
> Transformers defines the model
> CUDA / ROCm executes it
> TT-Metal (if youโre insane) lets you write the kernel yourself
The Inference Engine is just
the orchestrator (simplified)
When running LLMs locally,
the bottleneck isnโt just โVRAM sizeโ
It isnโt even the model
Itโs:
- memory bandwidth (the real limiter)
- KV cache (explodes with long context)
- interconnect (PCIe vs NVLink vs RDMA)
- scheduler quality (batching + engine design)
- runtime overhead (activations, graphs, etc)
(and your compute stack decides all of this)
P.S. Unified Memory is way slower than VRAM
Cheat Sheet / Rules of Thumb
> laptop / edge / weird hardware โ llama.cpp
> Mac workflows โ MLX
> 1โ4 RTX GPUs โ ExLlamaV2/V3
> general serving โ vLLM
> complex infra / long context / MoE โ SGLang
> NVIDIA max performance โ TensorRT-LLM
Setting up my local AI setup at home. Currently using a spare laptop running Proxmox, RHEL, Ollama.
Need to swap out the laptop for hardware that can support larger models (VRAM constraint).
Recommendations?
Imagine every pixel on your screen, streamed live directly from a model. No HTML, no layout engine, no code. Just exactly what you want to see.
@eddiejiao_obj, @drewocarr and I built a prototype to see how this could actually work, and set out to make it real. We're calling it Flipbook. (1/5)