I was able to bind a specific model in a specific telegram channel, it looks like this:
telegram:-1001234567123:8:
model: default: gemini-2.5-flash
provider: custom:google-aistudio
I use this channel for research, and it will always respond with gemini-2.5-flash by default. Ask /hermes-agent skill to set this up for you, from the channel in question! It can figure out the # by grepping gateway logs and add to config.yaml!
@sudoingX I don't consider performance improvements to be the main benefit of TurboQuant (although in cases where you have constrained memory bandwidth, running kv cache on tq3/tq3 can help there), it's mostly larger context or better quality (i.e. asym q8_0 / tq4 vs q4_0 /q4_0)
@DragonGroky@0xSero I've spent a lot of time optimizing llama.cpp for 3.5-27b dense and now 3.6-35b on a single 3090! Configs here:
https://t.co/0FyvrHTmKW
@sudoingX@spiritbuun@no_stp_on_snek I hadn't tried the 31b dense Gemma yet so just ran some initial tests. It runs but is not very happy w/ asym kv-cache or K_M/XL. To get decent performance, used Q4_K_S, turbo4/turbo4, and 224k context, stable 28t/s w/ Hermes!
Full cfg pushed to my repo:
https://t.co/0FyvrHTmKW
@sudoingX You should be able to fit 256k ctx 24G w/ Gemma via asymmetric q8_0/turbo4 or worst case turbo4/turbo4 with little quality loss using @spiritbuun or @no_stp_on_snek TurboQuant llama.cpp forks-check out my repo re how to pull and build them from source! π
https://t.co/0FyvrHTmKW
@i_loder@sudoingX I'm running similar cfg to @sudoingX, just on WSLv2 + Docker. CUDA just works.
Try out my heavily optimized configs, now w/ Qwen3.6, full 256k context, Q4_K_M/XL, asymmetric q8/TurboQuant4 here, single card 24GB VRAM 106t/s on my rig. Hermes works amazing! https://t.co/0FyvrHTmKW
@DragonGroky@MyopicRaccoon@LottoLabs@no_stp_on_snek I'm getting 35t/s on WSL2 (Docker Desktop + Debian Trixie WSL container), RTX3090 connected via Oculink dock, with this config:
https://t.co/0FyvrHTmKW
I point my Linux host running Hermes to WindowsHost:8080, works amazing! Can also run Hermes right in WSL2 and use localhost.
@AgentArchetype@tubatrades@sudoingX@Teknium@NousResearch I finally had a chance to properly document my config and automate setup: https://t.co/0FyvrHTmKW
My TurboQuant (tx @no_stp_on_snek@spiritbuun!) config gives 35t/s, RTX3090 fits Qwen3.5-27B-UD-Q4_K_XL, MAX 256k context and effective q8 kv-cache (asymmetric q8_0/turbo4)--insane!
@AgentArchetype@tubatrades@sudoingX@Teknium@NousResearch I am running this exact setup, and it's very reliable! Recommend installing docker desktop on Windows, integrate it with your WSL V2 so you can run docker run hello-world successfully, and then run a llama.cpp server cuda13 docker image to host your local LLM! DM me for configs!
@sudoingX I finally had a chance to properly document my config and automate setup: https://t.co/0FyvrHTmKW
My TurboQuant (tx @no_stp_on_snek@spiritbuun!) config gives 35t/s, RTX3090 fits Qwen3.5-27B-UD-Q4_K_XL, MAX 256k context and effective q8 kv-cache (asymmetric q8_0/turbo4)--insane!
@OnlyTerp@sudoingX Qwen 27b with bigger Unsloth quant, 256k context window and q8_0 k / turbo4 asymmetric k/v cache, check out @no_stp_on_snek work to bring turbo to llama.cpp!
@kriskarols@sudoingX Definitely! Running that config. T/s goes down a bit vs fitting in a single card though, but it can do more! 3090+3090 works well but also 5070Ti+3090. I have the two 3090s in Oculink docks so I can deploy them tactically π