everyone says NVFP4 makes blackwell cards "faster."
I benchmarked Qwen3.6-27B three ways on my 5090:
>NVFP4
>plain Q4_K_M (same 4-bit budget)
>Q6_K - same llama.cpp b9365 and same harness.
~~~ prefill (processing your prompt):
NVFP4 wins big, and it's real. +32 to 42% over equal-bit Q4_K_M at every context from 512 to 16k, so that gain is pure FP4-tensor-core compute.
vs Q6 it's +52 to 68%. concretely at pp512: 5415 tok/s vs 3826 (Q4) vs 3222 (Q6).
~~~ decode (generating tokens):
here's the myth. vs an equal-size Q4 it moves only +9% (84 vs 77 tok/s). the headline "+36% vs Q6" decode number isn't the FP4 cores at all but it's just NVFP4 being smaller (14.6GB vs 21GB).
decode is memory-bandwidth bound, so it tracks footprint, not how the weights are packed.
prefill = compute, decode = size.
~~~
the 4-bit tax is almost nothing: 93.2 vs 94.0 q_avg across five tasks vs Q6. MMLU, ARC, HellaSwag, GSM8K all land within half a point; only code dips meaningfully (HumanEval 90.2 vs 92.7).
net, vs the Q6 a lot of people serve:
~+60% prefill
+36% decode
-30% VRAM (17.3 vs 23.5GB) for -0.8 quality.
for an always-on local agent that's an easy yes - faster replies, more context headroom, and 6GB of VRAM handed back.
@LottoLabs I have been using Qwen 9b as second model running on the second gpu, mainly memory extraction, compression and summary. Worked decently with a very strict prompt, im testing Gemma now, hopefully this will be better (apparently Gemma models are good at writing)
@slippyfox@sudoingX I got a 3080 10gb, the Unsloth Q4_k_m fits perfectly at 160 ctx, probably you can squeeze more context if you quantize kv cache at q4
Siap siap banyak pengangguran 🥶
Peter Steinberger, creator OpenClaw, datang ke Microsoft Build buat jelasin gimana OpenClaw bakal diintegrasi jadi aplikasi native Windows, lengkap sama fitur keamanan baru yang namanya Microsoft Execution Containers
Dia bilang, "Sekarang lu bisa jalanin OpenClaw langsung di lingkungan perusahaan lu dengan lebih aman"
Demo-nya juga sempat ditampilin langsung di atas panggung pake Surface Laptop Ultra
#microsoft #ai #tech
@meabed 😅 my bad, got mixed up with another tailscale project. What I actually meant is Docker support, most of my services run in containers without host ports so tsp can't discover them. I'll try something with docker API, this would make everything discoverable at the same time
Just a reminder, if you are using llama.cpp as backend, this might be helpful. Full llm serving, voice pipeline, config for hot swapping, playtest, model management. Based on llama.swap.
@loktar00 I have a project folder with 18 different sub-folders. Each one with half baked code and architecure files. Guess how many are on my Github? Two.
@pangshuo1981 I’m all local with this Qwen3-TTS setup, but I see how a unified workflow layer like yours could help organize these kinds of pipelines. Great project
Git my agent pretty close to real time voice on local hardware.
STT: Parakeet TDT 0.6 (ONNX INT8, CPU)
TTS: Qwen3-TTS 0.6B on RTX 3080 (torch.compile + CUDA graphs) — RTF ~0.41x
Bottleneck is of course TTS path , but the quality is unmatched so i kept it.
Continue down below
nobody told me Hermes Agent could just... join your Discord VC and talk back
for those using Discord w/ Hermes Agent theres a feature where you can just have your Hermes agent jump in on a vc call with you and have a normal conversation like if you were having it with another human being its pretty cool
@TeksEdge I was doing the same, PC always on, me and my partner using the chatbot for different reasons at all time, I tried debloating windows as much as possible but there always was some sort of friction. Then I moved to Ubuntu, you should try if you are willing to change OS