I just shipped oMLX v0.4.0, the first official release with the new native Swift macOS app.
https://t.co/AJT6hKvMPL
oMLX now ships with a redesigned onboarding flow, settings UI, Hugging Face cache discovery, and a much more native-feeling way to manage and run local models on macOS.
- Huge thanks to GitHub contributor popfido for the excellent work that drove the Swift transition.
My goal is still the same: I want oMLX to be the app someone can open on a Mac and immediately try Local AI with, without needing to understand all the machinery first.
If you try 0.4.0, I’d really appreciate feedback on the macOS app experience, especially first launch, model discovery, server start/stop, update checks, and anything that still feels confusing.
Mellum started with code completion.
Mellum2 is built for more – handling both natural language and code.
A 12B-parameter open-source LLM for routing, RAG, and sub-agents, optimized for ultra-low-latency inference.
Now on @huggingface.
Learn more: https://t.co/28sG8Ql52L
I just asked M3 to improve the https://t.co/h8nT16uH4t static pages and it had the excellent idea to add internal page navigation. Nice and useful. Good job @MiniMax_AI ! Looking forward to testing this model more.
🔥@KwaiKeye 's Keye-VL-2.0-30B-A3B is now officially live on ModelScope! A major milestone that brings DSA (DeepSeek Sparse Attention) into multimodal AI. 🎬🤖
By coupling sparse attention with advanced feature aggregation, Keye 2.0 unlocks a 256k context window, allowing seamless processing of hour-long videos with zero context degradation. 📈 🔗 Get the weights: https://t.co/ScMIfM6ExY
🌟 Core Technical Highlights:
• 🧠 MoE Performance, Flash Cost: Outperforms 200B+ open models on LongVideoBench (74.10) while slashing prefill costs by 50%.
• ⏱️ Frame-Level Precision: Captures complex causal chains and timestamps in long vlogs, handicraft tutorials, and gaming clips.
• 🚀 Anti-Decay Mastery: On VideoMME V2, expanding input from 64 to 512 frames actually boosts accuracy from 35.34% to 42.44%.
llama.cpp now has an official website: https://t.co/vztdUpdBWL
Our goal is to make local AI accessible to everyone, and improving the user experience is a big part of that. On the new landing page you’ll find a single-line cross-platform installer. The installation provides a single unified `llama` entrypoint which you can use to run/serve models and interface with 3rd-party agentic applications.
While oriented towards simplified user experience, the new `llama` application also provides all the advanced functionality of the existing llama.cpp tooling with which experienced users are already familiar. Also note that all GGUF models that you might have already downloaded with llama.cpp in the past will be automatically available to use without downloading again (they are stored in the common HF cache on your machine).
We have many improvements in the pipeline both at the UX and at the engine level and we plan to iteratively ship new things over the coming months. One of the main focuses will be seamless integration with local-friendly 3rd-party agents (such as Pi). In the meantime, we’ll continue to listen for feedback from the community and adjust accordingly, so keep letting us know what you think and need.
The HF science team just made async RL weight sync ~100x cheaper on bandwidth, and you don't need a shared cluster anymore.
The problem: every RL step, the trainer typically has to sync fresh weights to the inference engine. for a 7B in bf16 that's ~14GB. for a frontier 1T fp8 checkpoint, that's ~1TB; in bf16 it would be ~2TB. per sync.
The insight: between two RL steps, ~99% of bf16 weights are bit-identical. at RL learning rates, the optimizer is whispering and bf16 literally cannot hear most of it. the stored bf16 bits don't change.
What they shipped in TRL: only the changed elements get encoded as a sparse safetensors file, dropped into a Hugging Face Bucket, and fetched by vLLM. on Qwen3-0.6B, per-step payload goes from 1.2 GB to 20 to 35 MB. This is exactly what we built Buckets for: S3-like object storage on the Hub, Xet-backed (so even full snapshots only transfer the changed chunks).
The cherry on top: we ran a FULL disaggregated training where:
- the trainer lived on one box
- vLLM ran inside a Hugging Face Space
- the Wordle environment ran in another Space
- weights flowed through one Hub bucket
no shared cluster. no RDMA. no VPN. no NCCL across clouds. just HTTPS and a bucket.
one GPU + a Hugging Face account is now enough to do real disaggregated RL. multi-replica inference fleets across regions become a small devops exercise, not a research project.
Full write-up: https://t.co/CG115IjT0q
Open source RL keeps eating the moat!
GalaxDB looks interesting. SQL + vector search + local embeddings in one binary. No Pinecone. No OpenAI API. No data pipeline. Your existing psycopg2 code works unchanged https://t.co/qpdop7k2o1
cool new release: a tiny open video VLM that understands what happens in videos and when 👀
Marlin-2B (Apache 2.0!) can caption clips into timestamped events, or find a natural-language moment inside the video (can see a ton of cool use cases with it)
Made a Hugging Face demo for it ⬇️