For your pandemic Friday viewing enjoyment. 'kubectl run -it --rm --image=https://t.co/y1ra99OfKZ tif' or for the docker inclined 'docker run -it --rm https://t.co/y1ra99OfKZ'
@Nanka696@LyalinDotCom Only for macOS right now unfortunately for the nvfp4 weights. We're getting closer for other platforms.
On Windows you can use `gemma4:12b-it-qat` which should given similar results.
Gemma 4 Quantization-Aware Training (QAT) weights are now available on Ollama!
They reduce memory requirements while maintaining model quality.
E2B:
ollama run gemma4:e2b-it-qat
E4B:
ollama run gemma4:e4b-it-qat
12B:
ollama run gemma4:12b-it-qat
26B:
ollama run gemma4:26b-a4b-it-qat
31B:
ollama run gemma4:31b-it-qat
Try them with ollama launch integrations to use with your favorite tools ๐๐๐
@synthshareai@ai_for_success That particular model is tuned with Nvidia's model optimizer for nvfp4. I'm still running MMLU-Pro right now to test it out and it's getting 74.6% accuracy (at about 15% completed). Google published it at 77.2% for BF16s.
@ivanfioravanti I'm wondering if they would notice the difference w/ bf16 vs. mxfp8. The microscaling formats seem really decent. nvfp4 is surprisingly good if you tune it correctly.
@LyalinDotCom For Ollama I'd recommend the `gemma4:12b-nvfp4` model which is tuned with Nvidia's model optimizer. I realized though that I should have quantized the qkvo attention tensors to nvfp4 (they're at mxfp8) and I have some tweaks for the embedding layer.
@LyalinDotCom For the Mac try `ollama run gemma4:31b-mlx` which will give you significantly better performance. For the DGX Spark you can get a significant performance boost with Ollama 0.30.0 which is just about to come out (it's in prerelease).
@yoeven On your Mac you should run `gemma4:e4b-nvfp4` and you should get a pretty big speed bump over `gemma4:e4b`. I realize the model names are confusing, but we are trying to make this easier!
I think almost all of the MTP/DFlash demos I've been seeing over the last few weeks have been using simple greedy sampling. That's great if you can live with temperature = 0, but I think most people want more sampling options.
DeepSeek v4 Pro is now on Ollama's cloud! ๐๐๐
Try it with Claude Code:
ollama launch claude --model deepseek-v4-pro:cloud
Try it with Hermes Agent:
ollama launch hermes --model deepseek-v4-pro:cloud
Chat with the model:
ollama run deepseek-v4-pro:cloud
๐งต
@ivanfioravanti Maybe try Ollama w/ `qwen3.6:27b-coding-nvfp4`? That's the MLX runner variant which is less quantized than the affine 4 bit integer quants, and it has the hyper-parameters set for coding/agentic use cases.
@dannytt@julien_c@huggingface With ollama, make sure you're using `qwen3.6:27b-coding-nvfp4`. For generation on an M5 I get just about 30 toks/sec, but the real magic is in the prefill speeds and the LRU cache.
deepseek-v4-flash is now available on Ollama's cloud! Hosted in the US.
Try it with Claude Code:
ollama launch claude --model deepseek-v4-flash:cloud
Try it with OpenClaw:
ollama launch openclaw --model deepseek-v4-flash:cloud
Try it with Hermes:
ollama launch hermes --model deepseek-v4-flash:cloud
Try it with chat:
ollama run deepseek-v4-flash:cloud
(DeepSeek V4 Pro is coming shortly)
๐งต
@iansltx@ollama The `coding` tags just have the recommended hyperparameters set for coding/agentic use. They share the same weights (it doesn't take up extra disk space).