AppleScript is the oldest way to drive a Mac. Frontier models butcher it.
So we trained two small ones that don't.
Open weights. On-device. 100% compile.
Today we’re releasing Laguna XS 2.1.
It’s a small upgrade to the Laguna XS.2 model, the same 33B total / 3B active MoE and stronger results on multilingual coding and terminal-style tasks.
Available now on @huggingface, @OpenRouter, and via Poolside API.
The future of the firm is a learning loop in which human capital and token capital compound.
With our new Frontier Co., our ambition is to help every enterprise build its own AI capability, and to help create a frontier ecosystem where every organization can turn its knowledge, workflows, and judgment into its own AI systems that continuously improve. https://t.co/mvYhkRFyqa
Qwythos-9B is a full-parameter reasoning model post-trained on over 500 million tokens of high-quality Claude Mythos / Claude Fable traces with chain-of-thought generated in-house by Empero AI's internal rethink tool.
It dominates the base Qwen3.5-9B under matched evaluation (+34 pts MMLU, +30 pts gsm8k-strict, +19 pts gsm8k-flex), supports native function calling per the Qwen3.5 spec, and ships with a 1,048,576-token (1M) context window via YaRN rope-scaling enabled by default.
Exactly. I've been disseminating a similar message for years.
The concentration of power in AI and the desire for control is by far the biggest danger of AI. It could lead to a few private companies and/or countries being in control of access to information, access to knowledge, and access to the tools of economic expansion.
It's a kind of medieval obscurantism akin to the Ottoman empire banning the use of the printing press for 200 years, in part to keep control of the dogma, but also to protect the corporation of the calligraphers and scribes.
Relevant historical bits about the Internet:
1. It took a deliberate decision by Al Gore and Bill Clinton to open up access of what was then ARPAnet to commercial entities and to the public, against the desires of the entrenched telecom industry. During a public roundtable about the "information superhighway" in 1993, the CEO of AT&T told Gore and Clinton "leave it to us". Gore said no.
2. In the late 1980s, setting up an Internet presence required buying proprietary hardware with proprietary OS and software stack from Sun Microsystems, HP, IBM, or Dell. By the 2000s, all of this was wiped out by commodity hardware, Linux, Apache, and an entirely free/open software stack. This migration to open platforms was the result of market forces.
Infrastructure wants to be open.
Foundation models are becoming an infrastructure and will inevitably become commoditized.
Long term, the money is in the application layer, which is what I, Arthur Mensch, Alex Karp, and others have been saying.
Ornith 1.0 is not just another open-source model.
It changes how AI agents actually think through work.
Here’s the simple breakdown:
→ It is MIT licensed.
→ It can be used commercially.
→ It has 9B, 35B, and 397B versions.
→ The 9B can run on a laptop.
→ The 35B is 21.2GB at 4-bit quantization.
→ The flagship reportedly scores 82.4 on SWE-Bench Verified.
→ It uses self-scaffolding reinforcement learning.
That last part matters most.
Most AI agents need humans to build the workflow around them.
Ornith starts building the workflow while solving the task.
Save this video, you’ll understand why AI agents are changing fast.
Want the SOP? DM me. 💬
🎙️ Serving TTS isn't the same problem as serving an LLM. It has to hit a first-audio budget of a few hundred ms, keep audio continuous across streaming chunks, and sustain enough concurrent streams per GPU to keep serving cost down. It's also a multi-stage pipeline where each stage bottlenecks differently, so no single recipe carries across models. vLLM-Omni TTS team tuned a different lever for each of four TTS models:
🗣️ Qwen3-TTS: decouple connector chunking from the Code2Wav decode window, batch the Stage-0 decode preprocessing. +61.5% audio throughput on H20×2, P99 latency nearly halved.
🌊 VoxCPM2: whole-forward torch.compile, plus CFM/LocDiT decode-tail batching across requests. +172% audio throughput.
🎚️ Higgs Audio V3: move the multi-codebook decode state machine into GPU-resident tensors. 2.7x speedup.
🐟 Fish Speech S2 Pro: a model-specific q_len=1 Triton attention kernel for the pure-decode path.
Full engineering deep-dive on how we picked each lever:
🔗 https://t.co/ZVROwJwYoT
جوجل تفتح خزائنها للمطورين بشكل غير متوقع، وتتيح رسمياً 1,000,000 توكن في الدقيقة مجاناً بالكامل وبـ صفر قيود. 😳
بدون الحاجة لبطاقة ائتمانية، وبدون أي اشتراكات شهرية؛ فقط دخول رسمي ومباشر عبر منصة Google AI Studio لامتلاك طاقة حوسبة هائلة كانت تكلف آلاف الدولارات شهرياً.
إليك تفاصيل هذه الفرصة وكيف تستغلها في مشروعك القادم: 👇
This is actually wild. Hermes just let you merge any two AI models into one virtual model. 🤯
It is called Mixture of Agents. Here is how it works.
You pick any two models. GPT-5.5 and Claude Opus for example. One runs as the reference, one as the aggregator. Name the combo anything you want. It shows up as a single selectable model in your picker like any other.
Every task, both models run in parallel. The reference analyzes and responds.
The aggregator reads that, synthesizes everything, writes the final answer, and handles all tool calls. You see one clean output.
The results on hard agentic tasks:
→ 8% higher than Opus 4.8 alone
→ 11% higher than GPT-5.5 alone
Full Hermes features work untouched. Memory, tool use, skills, long sessions, cross-channel messaging. Nothing breaks.
The combo just performs better than either model on its own.
You can mix any providers too. OpenAI, Anthropic, OpenRouter, local models. Whatever you have access to.
Introducing LFM2.5-230M: our smallest model yet, built to run fast anywhere (CPUs, NPUs, and GPUs) to enable agentic tasks on phones, robots, home and network automation devices.
> 230M parameters, built on the LFM2 architecture
> Pre-trained on 19T tokens, with a 32K context extension
> Post-trained with distillation from LFM2.5-350M
> 213 tok/s decode speed on Galaxy S25 Ultra (CPU)
> 42 tok/s on a Raspberry Pi 5 (CPU)
> Competes with and often beats models more than twice its size on instruction following, data extraction, and tool use.
> use it for large-scale data extraction pipelines or lightweight on-device agentic workloads.
🧵
Aloha! 🌺 Meet Ornith-1.0, a family of open-source LLMs specialized for agentic coding.
Ornith-1.0 spans the full parameter sizes including 9B Dense, 31B Dense, 35B MoE, and 397B MoE. It achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks including:
✅Terminal-Bench 2.1(77.5)
✅SWE-Bench(82.4 on verified, 62.2 on pro, 78.9 on Multilingual)
✅NL2Repo(48.2)
✅SWE Atlas(41.2 on QnA, 42.6 RF, 39.1 TW)
✅ClawEval(77.1)
Post-trained on top of gemma4 and qwen3.5, Ornith-1.0 employs a novel self-improving training strategy in which reinforcement learning is used to generate not only solution rollouts, but also the task-specific scaffolds that drive those rollouts. By jointly optimizing the scaffold and the resulting solution, the model generate higher-quality solutions in agentic coding.😎
All models are released under the MIT license, enabling full commercial and research use.
📖Tech Blog: https://t.co/qT9N2HYWFn
🤗Huggingface: https://t.co/PRrwqjeBtM
Web scraping will never be the same.
(100% open-source visual search at scale)
PixelRAG is a retrieval system that skips HTML parsing completely.
Instead of scraping a page into text and embedding chunks, it screenshots the page and retrieves the image. A vision-language model reads the answer straight off the pixels.
Why that matters: parsing is where web RAG quietly loses information.
- A single HTML-to-text parser can drop 40%+ of a page.
- Tables, charts, and layout get flattened or thrown out.
- Swapping parsers alone can move accuracy ~10 points on the same docs.
PixelRAG indexes the page a person actually sees. The team built a visual index of all of Wikipedia, 30M+ screenshots, and it still beats the strongest text RAG baseline by 18.1% on text-only QA.
The repo also ships a Claude Code plugin that gives Claude eyes.
It lets Claude screenshot any URL and read the rendered page instead of scraping the DOM. So you can hand it a live page, an arXiv paper, or your local site and ask what it actually looks like.
One setup script. No MCP server, no backend.
How the pipeline works:
- Renders each document (web, PDF, image) to image tiles.
- Embeds them with Qwen3-VL-Embedding, LoRA fine-tuned on screenshots.
- Builds a FAISS index and serves a search API.
A stronger reader model lifts accuracy with no re-indexing, since the index is just pixels.
Everything is open-source under Apache-2.0.
GitHub repo: https://t.co/qun9TjAdmw
Talking about RAG, I recently wrote an article on a new approach that makes retrieval much more efficient by cutting corpus size by 40x, reducing tokens per query by 3x, and improving vector search relevance by 2.3x.
The article is quoted below.
Local AI hardware = capacity × bandwidth × software stack
- Capacity tells you what fits
- Bandwidth tells you how hard the box can breathe
- The software stack tells you how much of the spec sheet you can actually cash out.
Hardware by Memory Bandwidth
- Mac Studio M3 Ultra: up to 512GB @ 819 GB/s
- RTX PRO 6000 Blackwell: 96GB @ 1792 GB/s
- RTX 5090: 32GB @ 1792 GB/s
- RTX 4090: 24GB @ 1008 GB/s
- RX 7900 XTX: 24GB @ 960 GB/s
- Radeon PRO W7900: 48GB @ 864 GB/s
- AMD Radeon AI PRO R9700: 32GB @ 640 GB/s
- Intel Arc Pro B65: 32GB @ ~608 GB/s
- Tenstorrent Wormhole n300: 24GB @ 576 GB/s
- Tenstorrent Blackhole p150: 32GB @ 512 GB/s + 800G
- MacBook Pro M5 Max: 460-614 GB/s
- MacBook Pro M5 Pro: 307 GB/s
- DGX Spark: 128GB @ 273 GB/s (coherent + CUDA)
- Mac mini M4 Pro: 273 GB/s
- Ryzen AI Max / Strix Halo: ~256 GB/s (~96GB usable GPU)
- MacBook Air M5: 153 GB/s
- Snapdragon X2 Elite: 152-228 GB/s
- Intel Lunar Lake: 136 GB/s
- Snapdragon X Elite: 135 GB/s
- Mac mini M4: 120 GB/s
- Arc Pro B60: 24GB @ ~456 GB/s
Verdict
- GPUs are still the bandwidth kings
- Apple wins: stupid amounts of memory, don’t want to shard across GPUs
- Apple loses: when raw tokens/sec & concurrency matter more
- DGX Spark: coherent memory + NVIDIA stack
- Strix Halo / Ryzen AI Max: first real x86 unified-memory contender
- Tenstorrent: fully OSS stack, excited to see this mature
Fitting ≠ serving
Even if it fits, you still pay for
- bandwidth during decode
- KV cache growth
- dequantization
- batching + concurrency
- scheduler quality
- framework overhead
The only mental model that matters:
1. What must fit?
2. What bandwidth tier do I need?
3. What software stack can actually deliver it?
In short:
- NVIDIA → fastest raw speed
- Apple Studio M3 Ultra → biggest one-box memory
- Strix Halo → first real x86 unified
- DGX Spark → coherent NVIDIA dev appliance
- AMD / Intel Arc → rising alternatives
- Tenstorrent → fully opensource stack
Do ask: “which bottleneck am I buying?”
Not: “which hardware is best?”
Do not infer with AI that which can be queried without.
That's from an internal presentation I gave at HubSpot today.
---
LLMs are great, but there are a *lot* of use cases that are much better handled with a structured query (like SQL). It's much more economical, much faster and predictable.
Just because an LLM can potentially answer a question you have by passing a bunch of unstructured text into the context window doesn't mean you should.
DiffusionGemma can now run at 2000+ tokens/sec! ⚡
We made local DiffusionGemma inference 1.8× faster.
Run it on 18GB RAM via Unsloth Studio.
GitHub: https://t.co/aZWYAtakBP
Guide: https://t.co/wYLfJWE6kG
MiniMax M3, Open-Weight, Now On Hugging Face , with only ~428B parameters and ~23B activated parameters
Weights:
https://t.co/g4Ybfa2kWH
MiniMax Sparse Attention:
https://t.co/HcTlWRotG3
Congrats to @GoogleDeepMind on the launch of DiffusionGemma.
The model generates 256 tokens in parallel per step, delivering 150+ TPS on DGX Spark, and 1,000+ TPS on a single H100.
We're supporting it from day one with:
• BF16 and NVFP4 checkpoints on @huggingface🤗
• Free GPU-accelerated endpoints on https://t.co/6T0R9P7EXS
• @vllm_project support with FP8 precision
Get started with DiffusionGemma on NVIDIA: https://t.co/vurk7GCQUs
WSL containers ⚡
At #MSBuild, we announced a built-in way to create, run, and interact with Linux containers on Windows.
Watch the demo on demand: https://t.co/NVotUyk1U9
Meet DiffusionGemma!
An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.
Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇