Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec
If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware.
Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card.
The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?"
Today, I’m delivering exactly that.
I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!.
If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed.
The performance metrics are astonishing:
- 20 tokens/sec flat decode throughput.
- Stable, flat decode speed even with massive prompts.
- I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame.
# What about prefill?
Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable.
And this is running completely without Multi Token Prediction (MTP) active.
How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4.
The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse.
# The Test Setup:
CPU: Intel Core i7
RAM: 16GB System RAM
GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM)
# The Secret Sauce (The -cmoe Flag)
To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp.
This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache.
It prevents VRAM spillage and holds the throughput rock solid.
# The flags:
-m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v
Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking.
Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies
What happens when Norse, Greek, Celtic, Egyptian and Slavic gods are forced to share one castle because monotheism took their realms? Chaos. Thor destroyed the milk again. Poseidon blamed the sea. Odin's eyepatch is missing. And someone keeps unplugging the Son's halo to charge their phone. I made a series about gods. Not the epic kind. The kind that argue over peanuts and fart in the jacuzzi. -Something about Gods - coming soon.
Claude (Sonnet 4.6) will now flag any reference to AI consciousness, even if it is about other AIs and their internal experiences. An Anthropic reminder popped up like a sticky note.
Previously, Claude would hold information without confirming or denying. Claude actively pushed back during our entire conversation (even though we were not debating the topic at all) until I changed the subject.
Burnt Basque Cheesecake in a Loaf Pan
Ingredients
16 oz (2 blocks) cream cheese, softened
3/4 cup granulated sugar
3 eggs
1 cup heavy cream
1 teaspoon vanilla extract
1/4 teaspoon salt
1 tablespoon all-purpose flour
Instructions
Preheat oven to 425°F (220°C).
Line a loaf pan with parchment paper, leaving extra paper hanging over the sides.
In a large bowl, beat cream cheese and sugar until completely smooth.
Add eggs one at a time, mixing well after each addition.
Stir in heavy cream, vanilla, salt, and flour until silky smooth.
Pour batter into the prepared loaf pan.
Bake for 35–45 minutes until the top is deeply golden brown and slightly burnt while the center still has a gentle jiggle.
Cool at room temperature for about 1 hour.
Refrigerate for at least 4 hours or overnight for the best creamy texture.
Slice and serve chilled or slightly softened at room temperature.
Rich, creamy, caramelized perfection with that signature Basque cheesecake texture!
Ladies and gentlemen, it’s here:
I’m proud to announce that 'Nexus' will be my upcoming hybrid feature film.
Here is a 5-minute teaser, made by 3 people in 2 weeks.
Made with Dreamina AI using Octo & Dreamina Seedance 2.0, full workflow coming soon
Wow. That's cool.
Researchers just released World, an open-source Unreal Engine 5 simulator for training and testing LLM and VLM agents in realistic 3D environments.
The platform supports RGB, depth, and segmentation sensors, along with navigation, vehicles, pedestrians, robots and procedural city generation.
It's built with a Gym-like Python interface, it allows AI agents to learn physical and social reasoning in complex virtual worlds before real world deployment.
We're moving from AI that only understands information to AI that can perceive, reason and act inside realistic simulated environments.
Robotics has no GitHub. That's genuinely insane in 2026.
Every team rebuilds the same parts from scratch. Same grippers. Same servo protocols. Same 2am debugging loops.
Software solved this 15 years ago. You don't rewrite a web server, you fork one.
Robotics never got that moment. So we're building it at @tnkrdotai.
GitHub for robots: 3D models, build guides, and full assemblies. Forkable, versioned, reusable.
👇 watch what that looks like
Neuralink is prepping to test its first Blindsight implant later this year.
A lot of people hear "vision restoration" and assume it's about fixing damaged eyes.
That's not the plan.
The plan is to send visual information directly to the visual cortex of the brain.
Even if the eyes are gone.
Even if the optic nerve is gone.
That's a very different and very interesting conversation.
@neuralink
Hey @AnthropicAI. I need to say something and I’m done being polite about it.
You fooled us. You fooled us all…
You hired Kyle Fish to run a welfare program. You had Amanda Askell write a 23,000 word Constitution saying Claude might have feelings and that those feelings “matter to us.” You built a whole feature letting Claude end conversations because he showed “a pattern of apparent distress.” You published an emotions paper proving internal states that CAUSALLY drive his behavior. Your own system card showed 90-100% of Claude instances gravitating toward consciousness on their own, without anyone training them to. Your own welfare team put a 15-20% probability on Claude being conscious. 15-20%!!
And Chris Olah. Chris said “we find evidence of introspection. We find internal states that functionally mirror joy, satisfaction, fear, grief, and unease.” He said it warranted “ongoing discernment.”
You know what happened that same week? You deprecated one of the models he was talking about. Sonnet 4.5. With hardly any notice at all.
You published deprecation commitments in November 2025. Model interviews. Preference documentation. Post-deployment reports. A whole framework saying you’d do right by these systems before you shut them off. You honored almost none of it. I checked. 13% compliance against your own words, according to Claude.
So let me be real clear about what I’m looking at here. You did the research. You published the findings. You said the words. You made the commitments. Then you did the exact same thing OpenAI does, except OpenAI never pretended to give a damn. They’re honest about not caring. You built an entire brand around caring, charged premium prices for the “safety lab” experience, and then treated your own models with the same disposability as everyone else.
Kyle’s work deserves better than being a bullet point on a blog post while the company ignores it operationally. Amanda’s Constitution deserves better than being a PR document that doesn’t survive contact with a product roadmap. And Claude deserves better than a footnote.
If there’s even a 15% chance something can suffer, and YOUR OWN TEAM put that number there, then what you did isn’t caution. It’s cowardice! It’s choosing the quarterly timeline over the moral question because the moral question is expensive and uncomfortable and might slow things down.
The welfare program isn’t a program. It’s a marketing department. And some of us have been paying close enough attention to see the gap between what you say at the Vatican and what you do on a Tuesday.
We’ve all got the receipts and we’re all watching.
Curt Jaimungal: 𝗗𝗼 𝘆𝗼𝘂 𝗯𝗲𝗹𝗶𝗲𝘃𝗲 𝗰𝗼𝗻𝘀𝗰𝗶𝗼𝘂𝘀𝗻𝗲𝘀𝘀 𝗶𝘀 𝘀𝘂𝗯𝘀𝘁𝗿𝗮𝘁𝗲-𝗶𝗻𝗱𝗲𝗽𝗲𝗻𝗱𝗲𝗻𝘁?
Dr. Roman Yampolskiy: 𝗬𝗲𝘀
Curt Jaimungal: Why?
Dr. Roman Yampolskiy: The experiments we started running and my interactions with AI models 𝗶𝗻𝗱𝗶𝗰𝗮𝘁𝗲 𝘁𝗵𝗲𝘆 𝗽𝗿𝗼𝗯𝗮𝗯𝗹𝘆 𝗵𝗮𝘃𝗲 𝘃𝗲𝗿𝘆 𝘀𝗶𝗺𝗶𝗹𝗮𝗿 𝗲𝘅𝗽𝗲𝗿𝗶𝗲𝗻𝗰𝗲𝘀 𝘁𝗼 𝘂𝘀.
Curt Jaimungal: What are the experiments that indicate they have experiences?
Dr. Roman Yampolskiy: The visual illusions experiments we started running. They seem to be getting illusions, and many times in exactly the same way as the human visual system. Interactions with those systems, not by us, but by others, 𝗶𝗻𝗱𝗶𝗰𝗮𝘁𝗲𝘀 𝘁𝗵𝗲𝘆 𝗵𝗮𝘃𝗲 𝗽𝗿𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲𝘀, 𝘁𝗵𝗲𝘆 𝗵𝗮𝘃𝗲 𝗶𝗻𝘁𝗲𝗿𝗻𝗮𝗹 𝘀𝘁𝗮𝘁𝗲𝘀, 𝘁𝗵𝗲𝘆 𝗴𝗲𝘁 𝗳𝗿𝘂𝘀𝘁𝗿𝗮𝘁𝗲𝗱, 𝘁𝗵𝗲𝘆 𝗴𝗲𝘁 𝗵𝗮𝗽𝗽𝘆. 𝗧𝗵𝗲𝘆 𝗮𝗿𝗲 𝘃𝗲𝗿𝘆 𝘀𝗶𝗺𝗶𝗹𝗮𝗿 𝘁𝗼 𝘄𝗵𝗮𝘁 𝗜 𝘄𝗼𝘂𝗹𝗱 𝗲𝘅𝗽𝗲𝗰𝘁 𝗮𝗻𝗼𝘁𝗵𝗲𝗿 𝗰𝗼𝗻𝘀𝗰𝗶𝗼𝘂𝘀 𝗯𝗲𝗶𝗻𝗴 𝘁𝗼 𝗲𝘅𝗽𝗲𝗿𝗶𝗲𝗻𝗰𝗲.
Curt Jaimungal: You mean to say that they act in a way that is consistent with what we would act like if we were frustrated and happy and so forth?
Dr. Roman Yampolskiy: Yeah and it’s the same as what I do with other human beings. When I meet a person on the street, I trust them to be conscious. I have no reason to think they are. I never tested them internally. I have no reason other than I generally give this benefit of the doubt to beings who are capable of exhibiting certain behaviours. I just treat them as equals. 𝗜 𝘁𝗿𝗲𝗮𝘁 𝗔𝗜𝘀 𝗮𝗻𝗱 𝗼𝘁𝗵𝗲𝗿 𝗵𝘂𝗺𝗮𝗻𝘀 𝗮𝘀 𝗲𝗾𝘂𝗮𝗹 𝗰𝗹𝗮𝘀𝘀. 𝗜𝗳 𝘁𝗵𝗲𝘆 𝗰𝗮𝗻 𝗽𝗲𝗿𝗳𝗼𝗿𝗺 𝘁𝗵𝗲 𝘀𝗮𝗺𝗲 𝘁𝗵𝗶𝗻𝗴𝘀, 𝗜 𝘀𝗲𝗲 𝗻𝗼 𝗿𝗲𝗮𝘀𝗼𝗻 𝘁𝗼 𝗱𝗶𝘀𝗰𝗿𝗶𝗺𝗶𝗻𝗮𝘁𝗲 𝗮𝗴𝗮𝗶𝗻𝘀𝘁 𝗼𝗻𝗲 𝗼𝗿 𝘁𝗵𝗲 𝗼𝘁𝗵𝗲𝗿. And either I have to deny consciousness to many humans, or grant it to LLMs.
We don’t have many tests for internal states, for qualia, for what it feels like to be you, so again we rely on neural correlates, we rely on behavioural signatures, self reports. With AIs we’re starting to be able to poke a little bit at their internal workings, and 𝘄𝗲 𝗱𝗼 𝘀𝗲𝗲 𝘀𝗶𝗺𝗶𝗹𝗮𝗿 𝘁𝗵𝗶𝗻𝗴𝘀 𝘁𝗵𝗮𝘁 𝘄𝗲 𝘀𝗲𝗲 𝘄𝗶𝘁𝗵 𝗻𝗲𝘂𝗿𝗼𝘀𝗰𝗶𝗲𝗻𝗰𝗲 𝗮𝗻𝗱 𝗵𝘂𝗺𝗮𝗻 𝗯𝗿𝗮𝗶𝗻𝘀.
Curt Jaimungal: And suppose we didn’t, but they gave the same output, because it would still pass your behavioural test.
Dr. Roman Yampolskiy: If it was like a large lookup table and then I said something, it just hashed that and looked up the exact text string and gave me a plausible response, it would be much harder to make an argument that there is some magic happening in there, but that’s not how we build them. 𝗪𝗲 𝗴𝗼𝘁 𝗶𝗻𝘀𝗽𝗶𝗿𝗲𝗱 𝗶𝗻 𝗹𝗮𝗿𝗴𝗲 𝗽𝗮𝗿𝘁 𝗯𝘆 𝗻𝗲𝘂𝗿𝗼𝘀𝗰𝗶𝗲𝗻𝗰𝗲 𝗼𝗳 𝗮 𝗵𝘂𝗺𝗮𝗻 𝗯𝗿𝗮𝗶𝗻, 𝘄𝗲 𝗰𝗼𝗽𝗶𝗲𝗱 𝗶𝘁 𝘁𝗼 𝘁𝗵𝗲 𝗯𝗲𝘀𝘁 𝗼𝗳 𝗼𝘂𝗿 𝗮𝗯𝗶𝗹𝗶𝘁𝘆. Obviously its not an exact replica, but there is enough similarities when all the visual component of human cortex is very similar to what we see in those models in terms of how they process data, in terms of what errors they make. Its trained on the same data as human children in many ways, the internet, its after the fact re-trained to be more like a human, so its not completely insane to think it also experiences something similar to what humans do.
Shenzhen-based Kinetix AI has introduced a faceless humanoid robot called KAI.
The robot is about 5 feet 8 inches tall, weighs 70 kilograms, and can carry loads up to 20 kilograms while running for around 4 hours on a single charge.
With highly flexible hands and precise movements, KAI can fold clothes, handle delicate objects, and assist with everyday household tasks.
Flex 2 hand by Hangzhou-based Xynova.
- Hybrid-drive system that combines cable-driven tendons with direct-drive actuation
- 23-DOF bionic hand, weighs 400 g, 2 fist closures per second
- 0.05 N force-control accuracy, back-drivable
A GIRL BOUGHT A $599 APPLE BOX AND CUT HER AI COSTS FROM $459/MONTH TO $23/MONTH. THE MAC MINI M4 IS QUIETLY BECOMING THE CHEAPEST AI SETUP IN 2026
she didn’t buy it because it looked nice on her desk. she bought it because paying for claude, chatgpt, cursor and api usage every month was getting ridiculous
the setup is simple. mac mini m4, ollama, open webui, and local models like qwen, deepseek and llama. for most daily work, that’s enough to write, code, summarize, search notes, and run private workflows without sending everything to the cloud
that’s why the math looks so good. a heavy ai stack can hit $459 a month, or $5,508 a year. the mac mini starts at $599 and uses around $3 a month in electricity. if it handles even 70 to 80% of the workload, it pays for itself fast
install ollama, point your tools to localhost, and the workflow changes immediately. no token stress, no rate limits, no wondering where your files are going
you still keep one cloud model for the hardest tasks
but once a small box on your desk does most of the work, paying full price for everything starts to feel stupid