Gemma4 E2B, compressed by @TheStageAI , from 9.3GB to 1.4GB, is running on iPhone 16e with tool calls!
The smallest and the best quality checkpoints open-sourced! @GoogleDeepMind
The smallest checkpoints for Gemma 4 E2B and E4B for local inference. Results for E2B:
size: 9.3 GB → 1.4 GB
speed: 113 tok/s on Apple M3
quality: -3% on ifEval
runs with: MLX, llama.cpp (coming)
Pareto-optimal, open source! Links to the blog post and GitHub repo ⬇️
@GoogleDeepMind@lmstudio@ollama@huggingface@ggerganov
Proud to team up with @brilliantlabsAR and @neuphonicspeech on Halo’s on-device privacy engine.
Coming to Brilliant Labs’ Halo smart glasses: real-time voice + vision, POV stays private.
ANNA + GPU/NPU SDK + memory manager for wake word, STT, TTS, diarization.
SDK demo 👇
@sebuzdugan Its not just an idea, team already applying that for models compression. You can check some benchmarks for compressed models here: https://t.co/vGOFcsBXrA
This month we are releasing a lot of benchmarks and ablation study.
@Nau__One We have local engines, so it can run fully on-device. We also provide ready-to-go containers for inference on your GPUs. We are SOC 2 compliant, and you can easily scan the container for vulnerabilities.
Beyoncé heard cursing. TheWhisper heard Arsenal.
The fastest Whisper in the world.
Open-source real-time ASR.
Top 5 on OpenASR benchmarks.
1800 RTFx.
Built for live captions, transcription, and voice apps.
See the repo
@Beyonce heard cursing. TheWhisper heard @Arsenal.
Fastest open-source real-time ASR in the world.
Top 5 on OpenASR.
1800 RTFx.
Built for live captions, transcription, and voice apps.
See the repo
For AI engineers, latency is product.
Wan 2.2 in Elastic Models now generates 5s of video in 34s on H100. Elastic Models is a library of accelerated open-source models.
Also new: TheWhisper at 1800 RTFx on a single H100 and instant FLUX LoRA switching.
Try it
@DnuLkjkjh@brilliantlabsAR@neuphonicspeech NPU used not only for VAD, its also used for transcription and for TTS partially. We are using heterogeneous inference to deliver the best speed and lowest power consumption.
How do you make text-to-music run in real time in production?
The model has to keep audio generation ahead of playback.
Our new case study with @MireloAI shows how inference optimization delivered up to 2.4х higher throughput.
See the full case study ↓