Gemma 4 12B can now run locally on just 8GB RAM via Dynamic GGUFs.
Google's new model, Gemma 4 12B Unified supports image, audio and 256K context.
You can run and train the model via Unsloth Studio.
GGUF: https://t.co/8cL321pVDh
Guide: https://t.co/odRo9WjRpA
OpenAI's GPT-OSS-120B runs on a single RTX 5090.
it's a 59GB model in native MXFP4. it doesn't fit in 32GB of VRAM.
the move is MoE offload: keep attention on the GPU, spill the expert weights to system RAM (llama.cpp --n-cpu-moe).
this way, only 5.1B of 117B params fire per token, so the CPU side stays cheap.
with reasoning on, measured on my box, temperature 0, ~100 items per task (MMLU 114):
- MMLU 89.5
- GSM8K 97.0
- HumanEval 98.0 pass@1
- ARC-Challenge 95.0
that's a good frontier-grade scores, on one consumer GPU.
~~~
it is quite slow tho: 47 tok/s generation.
that's because the experts live in RAM, so token speed waits on the CPU, not the 5090.
prefill is fine with 473 tok/s at 512 ctx. it is generation that pays the offload tax.
the model is usable, not fast. but you get a real frontier model you fully own, on hardware you can buy, for the price of patience.