a new 8GB VRAM GPU dense Local LLM leader was born yesterday
runs on: RTX 4060 / RTX 3070 / RTX 2080. any 8GB card
Qwen 3.5 9B (dense) was the go to for 6-8GB VRAM builds.
Gemma 4 12B QAT (dense) just changed that.
same llama.cpp + cuda 13.2. i7 12700H. 16GB RAM. same -ngl 99 flags. same 48k context.
unsloth gemma-4-12b-it-Q4_K_M.gguf
→ 15 tok/sec @ 48k ctx
unsloth gemma-4-12B-it-qat-UD-Q4_K_XL.gguf
→ 32 tok/sec @ 48k ctx
→ 26 tok/sec @ 64k ctx
64k context is a big deal. Hermes 3 agent requires 64k minimum to run. you're now getting full hermes compatible context on a budget consumer GPU at 26 tok/sec locally.
2.1x faster on identical hardware. and here's the part that breaks your brain:
the QAT-UD-Q4_K_XL is actually SMALLER than the Q4_K_M "XL" why?
QAT = Quantization Aware Training Google didn't train the model first and compress it later they trained it to be quantized from day one the weights already know how to survive low precision that's why you get more quality per byte
llamacpp flags: -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -cnv -ngl 99 -c 48000 -v
fits in 8GB VRAM clean. no API. no cloud. no subscription.
and this isn't even the MTP variant yet
Gemma-4-E2B QAT runs on 3GB RAM, E4B on 5GB, 12B on 7GB, 26-A4B on 15GB and 31B on 18GB.
I have benchmarked the 26b and 31b qat as well on a single RTX 4090, checkout the comments for details.
If you have a 6GB or 8GB VRAM GPU, post your numbers.
more benchmarks and configs coming soon
Gemma 4 12B can now run locally on just 8GB RAM via Dynamic GGUFs.
Google's new model, Gemma 4 12B Unified supports image, audio and 256K context.
You can run and train the model via Unsloth Studio.
GGUF: https://t.co/8cL321pVDh
Guide: https://t.co/odRo9WjRpA
@muso_am@openclaw@Microsoft Literally do work for you while you are in a teams meeting or commuting. It had a bad growth phase until last week. But if it can finally replace copilot in a sandboxy way and not leak data from eu to us, it is worth a chance to reduce worker stress.
Qwen3.6 35B A3B can't fill out a paper form on its own. But give it NVIDIA's LocateAnything-3B — the #1 trending model on HuggingFace — as its eyes, and the two small models get it done together.
(The test: place each element at the right pixel position on a blank form image, not type into a field.)
Setup:
> Qwen is the brain (main model), LocateAnything is the eyes (helper model acting as a tool).
> I gave Qwen a new tool: ask "where's the email field?" and LocateAnything returns the exact x, y, width, height.
> The blue boxes on the screen are its detections. Look how tight they are — it nails every field.
Result:
> Qwen3.6 35B A3B + LocateAnything-3B: form completed, all info correct.
> Name, DOB, ID, gender, marital status, nationality, email, phone, address, postal code: all landed in the right field areas.
> Character-box alignment still a touch loose, but every value is where it belongs.
> 9m10s, 224.5k input, 24.3k output, 21 turns.
Why it matters:
> Qwen alone can't finish this test. Bolt on a 3B model that does exactly one thing > locate > and suddenly it can.
> A combination of small models can do the work of a single large one.
@helloiamleonie Using step-3.7 free for 30 days via hermes, to fix and plan out my agents config files, fixing and contributing to open source projects, helping me learn to build local scripts and realiable execution of skills and deterministic agent output. Help with CV and local private data.
Step 3.7 Flash is now free for 30 days via Nous Portal
It is a new MoE vision-language model focused on agent efficiency, coding, search, and multimodal workflows — and Hermes Agent users have been loving it, so thank you to @StepFun_ai for hooking them up!
Hello again, everyone! Welcome, Qwopus 3.5-Coder 4B!
Lots of awesome model drops are coming out, so we've got so many great new candidates for fine-tuning and dataset generation. We're so pumped and have a lot of great experiments running currently!
We've put together this significantly smaller coder model, Qwopus Coder 4B, and it seems to be impressive for something that could run well on most smartphones, or really fast on older GPUs.
It scored a 43.5% on a 225 slice of swe bench mini for completed patches, 32.5% for all patches, including empties due to missing the specific format required by swe, but on the ones that it output patches, it performed surprisingly well at 73/168 patches submitted for 43.5%
Bear in mind, this is a tiny 4b model with additional coding training and COT improvements. I was able to make a neon snake game (HF space link in comments to try) in just a few turns of the model. It's laser fast running at 270tps at Q8 with MTP on my 5090, with tons of headroom for concurrent instances! I was able to get over 500tps aggregate with parallel requests running SWE bench with it!
It also shows improvement in @stevibe's BenchLocal agent and coding benchmarks! Check out the full results in the model card!
If you want to do some simple HTML game coding at lightning speeds on older hardware or less VRAM, I strongly recommend playing with it! Or if you want an intelligent model to do some serious swarm data cleaning or large dataset processing, this could be an excellent option!
Blessed to be here; you all are so enjoyable to engage with! Please let us know your thoughts in the comment section, and let us know what use cases jump out to you for a small 4b model like this one!
https://t.co/mykGsjmESv
💧 LFM2.5-8B-A1B is HERE and Liquid AI means business this time
🔹 8.3B params, only 1.5B active per token — MoE done right
🔹 Beats Gemma-4-26B and Qwen3-30B on instruction following
🔹 Strongest agentic tool use in its class
🔹 128K context, 38T training tokens, reasoning built in
🔹 Tested live: Hermes Agent + 80-language translation — all local, open weights
🔥 Full video below 👇
https://t.co/zZE5pkNcVB
@liquidai
@bradmillscan Aren't you the guy who is just writing his agent in whatsapp? Please use telegram group topics and get your agents/bots their separated context windows. Use nous portal sub (even free) and models like owl-alpha to agent