A bee does not waste its energy trying to convince a fly that honey is better than shit, it simply goes on about its business.
Not every mind is open to growth, ego builds walls so thick that wisdom simply walks away.
Conserve your energy for something that matters.
@mick__net@leftcurvedev_ I have similar results on vulkan backend AMD APU. About 30% increase on n max 2. That is best increase. Increasing n max or adding ngram decreases speed gains.
@FunkyClam@danielhanchen PR is not merged in mainline llama.cpp. So only way to use it right now is to merge the PR manually and build llama.cpp yourself.
You can wait for it if you use pre-built llama.cpp release or Lmstudio.
PR for Dflash is also in draft in llama.cpp repo.
@Wronglebowsk@ggerganov It cannot be supported by llama bench. This type of speculative decoding is dependent on your previous prompt and cache of previous conversation. Llama bench has dummy data. In real life ngram does not have advantage in creative writing or open ended reasoning also but coding.
Damnn. I gave a problem to Google Gemini 3.1 pro and asked the same problem to Qwen3.6 Max preview.
Qwen gave a better solution with more comprehensive response.
On another note. Improvement in AI over last year is scary good. Traditional IT jobs are in absolute danger.
@WuMinghao_nlp Are you going to continue with next architecture ? Good thing about that architecture was very little drop in tps either pp or tg with increase in context.
@basecampbernie This interesting Gemma4 is faster than qwen3.5 A3B at q8 in your machine and reverse is true at q4.
I think DGX spark is Blackwell architecture so you can also try bf16 variants rather than q8 xl which i believe is q8 plus fp16 for these models it might be faster or maybe nvfp4