LocalAI ( @LocalAI_API ) 4.2.0 is out, just few numbers and facts:
- +392 commits ( we squash these π )
- +11 Backends: voice and face recognition, vibevoice.cpp (from me), LocalQVE from @jichiep and among @sgl_project , @__tinygrad__ , @no_stp_on_snek 's Turboquant, ik_llama.cpp, sam.cpp from @el_PA_B
- Many new QoL improvements, increased sglang and VLLM support and hardening on distributed mode
- 16+ new contributors ! Thanks to the community!
LocalAI is all about give you flexibility to run the latest from the community, and ds4 support from @antirez is on its way!
This is the year of Local AI!
parakeet.cpp: native C++/ggml (@ggml_org) inference for @NVIDIAAIDev's Parakeet, one of the best speech-to-text models out there, from the @LocalAI_API team.
Every Parakeet model (TDT/CTC/RNNT/hybrid + cache-aware streaming), byte-for-byte identical output to NeMo, now running anywhere with no Python and even a bit faster, on CPU and GPU.
Quantized GGUF on @huggingface π€
Huge thanks to @ggerganov for ggml and to @NVIDIAAIDev for releasing Parakeet! π§΅
parakeet.cpp now does batched transcription.
Decode N clips in one pass and a single GB10 runs up to 12x faster at batch 16. Peak ~1,260 clips/s. CPU sees 3-5x.
Same model, bit-for-bit identical output. No accuracy traded for speed.
parakeet.cpp: native C++/ggml (@ggml_org) inference for @NVIDIAAIDev's Parakeet, one of the best speech-to-text models out there, from the @LocalAI_API team.
Every Parakeet model (TDT/CTC/RNNT/hybrid + cache-aware streaming), byte-for-byte identical output to NeMo, now running anywhere with no Python and even a bit faster, on CPU and GPU.
Quantized GGUF on @huggingface π€
Huge thanks to @ggerganov for ggml and to @NVIDIAAIDev for releasing Parakeet! π§΅
parakeet.cpp: native C++/ggml (@ggml_org) inference for @NVIDIAAIDev's Parakeet, one of the best speech-to-text models out there, from the @LocalAI_API team.
Every Parakeet model (TDT/CTC/RNNT/hybrid + cache-aware streaming), byte-for-byte identical output to NeMo, now running anywhere with no Python and even a bit faster, on CPU and GPU.
Quantized GGUF on @huggingface π€
Huge thanks to @ggerganov for ggml and to @NVIDIAAIDev for releasing Parakeet! π§΅
Scaling LLMs across nodes? When a follow-up lands on a replica that never saw your chat, the whole prompt is recomputed and the KV cache wasted.
LocalAI fixes this at the router: cache-aware routing across a mixed fleet of vLLM + SGLang + llama.cpp + ...
what a wonderful project: parakeet.cpp
https://t.co/idw7t2y106
GGML based parakeet inference pipeline that's 2x faster than my ONNX parakeet pipeline on Apple Silicon! (Needed a few local patches to get it going)
and I really mean it. Everyday we fight now:
- Lots of Security reports which aren't valid (while some are, but get buried in the mix now)
- AI Automated PR which looks legit, but then looking at detail you realize nothing was really well put or at least very superificially. And even if asking for fixes on the PR, the author just goes away (why opening it then?)
- And then, harassment, and github does nothing about it. here's the last one that I received: https://t.co/ZICcymfrTl
yeah that's very bad. People taking pitchforks are gonna push away OSS maintainers even more.
It's becoming already barely unsustainable: from thousands of bad security reports, github issues with attacks and violent phrasing, and fake automated PRs with zero to little mind put on it.
This is a great find. Qwen3.6-35B-A3B APEX by @mudler_it is surprisingly speedy for a 32GB Mac Mini M2 Pro.
Is able to fix basic Scala unit tests; prefill starts at 400 tk/s, drops to 120 by 32k ctx; tg around 25tk/s dropping to 13tk/s - not uber fast, but for a device not made for AI this is fantastic. Client: Mistral Vibe.
./llama-b9434/llama-server -hf mudler/Qwen3.6-35B-A3B-APEX-MTP-GGUF:I-Compact --spec-type draft-mtp --spec-draft-n-max 2 -fa on -ngl all --host 0.0.0.0 --port 8080 -c 70000 --parallel 1 --no-warmup -b 2048 -ub 2048 -ctk q4_0 -ctv q4_0
There's also an I-Nano quant available [https://t.co/kL9StcxrXb] which is 11.7 GB in size (!! - might work for those on 16GB VRAM)
yeah that's very bad. People taking pitchforks are gonna push away OSS maintainers even more.
It's becoming already barely unsustainable: from thousands of bad security reports, github issues with attacks and violent phrasing, and fake automated PRs with zero to little mind put on it.