BOOM!
Meet the open source Cambrian Explosion of repulsion of Anthropic!
Meet Qwythos 9B, a Qwen3.5 based GGUF that's both uncensored and quantized for efficiency.
I am running it now and it is brilliant!
A model that can reason through 1 million tokens of context, understand images and text, and even call functions.
Come and take it!
https://t.co/UFU3fas9OD
I just got Gemma 4 26B A4B MoE model running fully locally with Hermes agent on an 8GB RTX 4060 and it's now backtesting trading strategies end to end, no hand holding.
If you’re a trader or work on Wall Street, you don’t want to miss this.
Yes. fully automated. No cloud. No APIs beyond market data.
# Here's what I did:
Setup:
- Model: Gemma 4 26B-A4B QAT (MoE), Q4_K_XL Unsloth's quant (link in the comments)
- Inference: llama.cpp (turboquant fork by @no_stp_on_snek link in the comments)
- Hardware: RTX 4060, 8GB VRAM + 16GB RAM only (with 50 other chrome tabs open)
- Context: 64K
llama.cpp turboquant flags:
-m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 --port 8080
turboquant helps achieve high prefill and decode throughput for interactive sessions.
throughput with Hermes agent:
decode: 25+ tokens/sec
prefill: 250+ tokens/sec
# Then I gave the agent one task:
Backtest a strategy:
- Buy when RSI crosses above 30
- Sell at +2% profit or -1% stoploss
- No overlapping positions
- Use Google stock via yfinance
- Generate a full HTML report with candlestick charts + signals
What happened next was wild. It didn't just write code, it ran the entire workflow itself:
Audited the environment (pip list, dependency check)
Hit a ModuleNotFoundError, multiple Python installs were conflicting
Ran where python to map every interpreter on the system
Manually selected the correct Python 3.13 path and re ran the script
Wrote a clean statevmachine backtester (strict no overlapping trades logic)
Patched a yfinance MultiIndex quirk that would've crashed the script
Built Plotly candlestick + RSI charts with buy/sell markers
Calculated win rate, PnL, and summary stats
Exported a polished single file HTML report. check the report at the end of the video or in the comments.
Biggest takeaway: local LLMs aren't just "chat assistants" anymore. They debug their own environment, write production code, and ship a finished deliverable on consumer hardware, for $0 in API costs.
If you're still calling local models "toys," you're already behind.
This is just the beginning.
Hermes agent just surpassed 1 trillion tokens in a single day on OpenRouter. Think about the scale of total token generation happening right now.
Disclaimer: This is not financial advice. Consult a professional before making any trading decisions.
Web scraping will never be the same.
(100% open-source visual search at scale)
PixelRAG is a retrieval system that skips HTML parsing completely.
Instead of scraping a page into text and embedding chunks, it screenshots the page and retrieves the image. A vision-language model reads the answer straight off the pixels.
Why that matters: parsing is where web RAG quietly loses information.
- A single HTML-to-text parser can drop 40%+ of a page.
- Tables, charts, and layout get flattened or thrown out.
- Swapping parsers alone can move accuracy ~10 points on the same docs.
PixelRAG indexes the page a person actually sees. The team built a visual index of all of Wikipedia, 30M+ screenshots, and it still beats the strongest text RAG baseline by 18.1% on text-only QA.
The repo also ships a Claude Code plugin that gives Claude eyes.
It lets Claude screenshot any URL and read the rendered page instead of scraping the DOM. So you can hand it a live page, an arXiv paper, or your local site and ask what it actually looks like.
One setup script. No MCP server, no backend.
How the pipeline works:
- Renders each document (web, PDF, image) to image tiles.
- Embeds them with Qwen3-VL-Embedding, LoRA fine-tuned on screenshots.
- Builds a FAISS index and serves a search API.
A stronger reader model lifts accuracy with no re-indexing, since the index is just pixels.
Everything is open-source under Apache-2.0.
GitHub repo: https://t.co/qun9TjAdmw
Talking about RAG, I recently wrote an article on a new approach that makes retrieval much more efficient by cutting corpus size by 40x, reducing tokens per query by 3x, and improving vector search relevance by 2.3x.
The article is quoted below.
gemma-4-12B-agentic-fable5-composer2.5 V2 is out.
the agentic upgrade to the model trained on Fable 5's reasoning. Running it now with TurboQuant llama.cpp on a single RTX 4060( 8 GB VRAM) at 30 tokens/second with full 25000 context and reasoning:
# The benchmarks
v2 is built for coding + agentic work. writing code, running commands, using tools, debugging, multi step technical tasks. The clearest signal is tau2 bench telecom, an agentic tool use benchmark whose diagnose → fix → verify loop mirrors real terminal/debugging work:
tau2 bench telecom numbers:
base Gemma 4 12B: ~15%
this finetune: ~55%. (Self reported)
thats a huge jump
# TheTom/llama-cpp-turboquant flags:
llama-server.exe -m gemma4-v2-Q4_K_M.gguf -ngl 99 -c 25000 --cache-type-k q8_0 --cache-type-v turbo3 --port 8080
Flag breakdown:
-ngl 99 → full GPU offload
-c 25000 → 25K context
--cache-type-k q8_0 --cache-type-v turbo3 → mixed-precision KV cache — K at 8-bit, V at ~3-bit via TurboQuant (Walsh Hadamard rotated polar quant, Google's own KV-compression research).
Not even merged into mainline llama.cpp. running it off a fork.
No API. No cloud. Just llama.cpp. well, a fork of it and any 6gb+ GPU.
If you tried yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF, check this out and share your experience with the models
I genuinely don't understand why everyone isn't using this yet
Andrej Karpathy, a co-founder of OpenAI, posted a simple idea that hit 16 million views: stop using AI to write code, use it to build a second brain.
You point Claude Code at a folder, drop in any source, an article, a transcript, a PDF, and Claude reads it, links it, and files it into a living wiki of everything you know. It compounds like interest, the more you feed it, the smarter it gets.
Here's the whole thing:
> Install Obsidian, create a vault, open it in Claude Code
> Paste Karpathy's wiki idea file and tell Claude to build it
> Claude makes three folders: raw for sources, wiki for its pages, a CLAUDE.md that runs it
> Drop any source into raw and say "ingest this"
> Ask questions across everything, forever
Five minutes to set up, and you never start from a blank chat again.
Full step-by-step guide with Claude and Obsidian, link below.
Bookmark this
Un desarrollador chino llamado tw93 se hartó de que sus aplicaciones de escritorio le devoraran la RAM y el disco.
Abría Slack y desaparecían cientos de megabytes. Abría Discord, Notion o cualquier otra app y pasaba lo mismo. ¿La razón? Casi todas son lo mismo por dentro: un sitio web empaquetado con una copia completa del motor de Chrome (Electron).
Decidió que tenía que haber una forma mejor.
En 2022 empezó a construir Pake. Usó Rust + Tauri, que en vez de incluir un navegador completo, aprovecha el WebView nativo del sistema operativo.
El resultado fue brutal:
- Slack con Pake → 8 MB (en vez de 524 MB)
- Discord con Pake → 9 MB (en vez de 265 MB)
- ChatGPT con Pake → 9 MB (en vez de 260 MB)
Cuatro años después, su repositorio tiene más de 51.000 estrellas en GitHub. Tiene builds listos para Grok, ChatGPT, Gemini, Discord, YouTube, Twitter y muchos más. Todo bajo los 10 MB, ligero, rápido y gratis.
Y lo mejor: con un solo comando puedes convertir cualquier página web en una aplicación de escritorio nativa.
No fundó una startup. No levantó inversión. Solo resolvió un problema que molestaba a millones de personas.
A veces el cambio real lo hace una sola persona que se cansa de las cosas como están.
Esta brutal, repo en los comentarios 👇
Google Translate is cooked after this.
A developer built a local AI translation engine that runs 40 languages entirely on your own laptop.
It's called LibreTranslate.
No API key.
No usage limits.
No sending your documents to Google's servers.
You install it once. It runs forever.
Here's what it handles:
→ Paste text. Translated instantly.
→ Drop in a file. Outputs the translated version.
→ Point it at a URL. Returns the page in your language.
→ Build it into your own app via its local REST API.
The speed is not the story. The privacy is.
Google Translate reads every sentence you paste into it. Legal contracts. Medical records. Internal emails. Client documents. Every word goes to their servers and stays there.
LibreTranslate runs entirely offline. Nothing leaves your machine. Ever.
The numbers:
→ 40 languages supported
→ Runs on CPU -- no GPU needed
→ Self-hosted in under 5 minutes
→ REST API built in for developers
→ 10K+ stars on GitHub
100% open source. MIT licensed. Price: $0.
Google charges nothing for Translate either but it charges you something else.
GitHub: https://t.co/5XF7qgbMRB
Gemma 4 12B Coder is here and it's a game changer for local code generation. This GGUF model packs Google's latest gemma-4 architecture into a compact 12B size, perfect for running on consumer hardware. It's optimized for reasoning and thinking, making it ideal for developers who want fast, private coding assistance without the cloud.
Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec
If you own any 8GB VRAM graphics card, stop what you are doing. Local AI just had its absolute "Holy Shit" moment for budget hardware.
Yesterday, I benchmarked Unsloth Gemma 4 12B Q4_K_XL on an 8GB card.
The community went wild but immediately demanded more: "Can we run a 25B+ model on budget GPUs?"
Today, I’m delivering exactly that.
I am running a massive 26B parameter Mixture of Experts (MoE) model locally on a standard 8GB VRAM setup with 250k full native context!.
If you own an RTX 3060, 3070, 4060, or any budget GPU with 8GB of VRAM, the local AI paradigm has completely changed.
The performance metrics are astonishing:
- 20 tokens/sec flat decode throughput.
- Stable, flat decode speed even with massive prompts.
- I threw a 60k token prompt at it, and it still clocked in at 20 TPS without dropping a single frame.
# What about prefill?
Yes, Time To First Token (TTFT) is slightly high when swallowing massive contexts. But with a solid 200 tokens/sec prefill speed, the wait is barely noticeable and highly usable.
And this is running completely without Multi Token Prediction (MTP) active.
How is this possible? It’s the magic of Google's new QAT (Quantization Aware Training) quants for Gemma 4.
The model weight file (unsloth gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) is only 13.2 GB, making it the ultimate local powerhouse.
# The Test Setup:
CPU: Intel Core i7
RAM: 16GB System RAM
GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8GB VRAM)
# The Secret Sauce (The -cmoe Flag)
To make this work properly on any 8GB card, you must use the -cmoe (CPU MoE) flag in llama.cpp.
This flag isolates the heavy MoE expert weights directly to system memory (CPU/RAM) while letting your GPU focus strictly on the Attention layers and the KV Cache.
It prevents VRAM spillage and holds the throughput rock solid.
# The flags:
-m "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" -cmoe -c 248000 -v
Once running, just open the UI on localhost and toggle the new reasoning lightbulb icon in the text input box to watch the model perform multi step thinking.
Are you still running smaller models, or are you ready to scale up your budget local setups? Let's discuss in the replies
THIS FEELS ILLEGAL 🤯
NVIDIA is giving access to 120+ AI models FREE for an entire year.
No credit card.
No payment.
Just a free API key.
Hermes Studio already supports NVIDIA out of the box, so setup takes minutes.
→ 120+ models available
→ 40 requests per minute
→ Free access for 1 full year
While most people are spending money on AI APIs, this is hiding in plain sight.