@stevibe Thank you. You probably saw it already, just for others PR is merged and one could just use "vllm/vllm-openai:nightly" image.
If you have resources it would be nice to compare google's MTP with RedHatAI/gemma-4-31B-it-NVFP4 as verifier model against original google model
@CodyKnowsCode@vllm_project With the latest changes from vllm. My understanding is you were getting 35tok/s with old vllm and that you expect better results with this 0.20 version of vllm?
@RedHat_AI thank you. Whatever I do I just cannot pass the attention backend error "Selected backend AttentionBackendEnum.FLASH_ATTN is not valid for this configuration. Reason:
['partial multimodal token full attention not supported']"
vllm-0.20.0
transformers-5.6.2
speculators-0.5.0
@johnny_everson@stevibe yeah, me too, just my personal feeling that gemma4 dense model is way faster and better... Trying to make it work with dflash speculative model, but just cannot make it work at the moment...
@RedHat_AI@vllm_project "Validated for H100" - does this mean it depends on GPU architecture? Tried to make it work on dgx spark but with no success regardless of verifier model and type of attention backend...
@dsikka84@BitcoinComfy@RedHat_AI@vllm_project I tried it with both fp8 and nvfp4, but couldn't make it work with containerised vllm nightly on dgx spark at all... Since it's mentioned it's verified only on H100, does that means its dependable on GPU architecture?
I'm amazed none of these videos are picked up by YouTube's 'Likeness Detection' feature (though I guess, that's why it's in Beta?). cc @YouTubeInsider@YouTubeCreators
(Yes, all of these videos contain my AI-cloned likeness and cloned voice. I submitted a 'privacy violation'.)
@RedHat_AI@dsikka84 Thank you and thanks to @dsikka84 and team. I probably could do some simple and quick tests if someone points me what, where and how - which is I guess more time consuming then running those tests :) thanks again
@RedHat_AI It would be interesting to see results of speculator model RedHatAI/gemma-4-31B-it-speculator.eagle3 with RedHatAI/gemma-4-31B-it-NVFP? Any possibility to do that?
@protect_whales@bnjmn_marie +1, additional to that list I would include two from Nvidia (nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 and nvidia/Nemotron-Cascade-2-30B-A3B)