Alexandre Momeni

about 2 months ago

This is what context does to your speech-to-text system! Our new paper studies the impact of contextual information on the accuracy of leading open-source and proprietary systems.

atiorh's tweet photo. This is what context does to your speech-to-text system!

Our new paper studies the impact of contextual information on the accuracy of leading open-source and proprietary systems. https://t.co/gVbmIkwwn5

1

21

4

12

2K

AlexandreMomeni retweeted

about 2 months ago

localhost Ep. 2 Bryan Catanzaro (@ctnzr) on @NVIDIAAI's open models and risky bets (00:20) Who is Bryan? (07:38) Getting Nvidia to care about Deep Learning (14:13) Why did Bryan leave Nvidia right when Deep Learning was taking off (18:02) Leadership: Aligning a village of researchers (24:12) Will the frontier flip back to open? (32:16) Nvidia's models: Side project or core business? (38:19) Efficiency leads to edge inference: Does Apple capture inference? (42:43) Nvidia’s risky bets: Fewer and fewer bits (47:19) Nvidia's misstep with Volta (52:30) Every model is already obsolete as soon as you stop training it

2

14

3

8

2K

HackerNewsTop5 @hackernewstop5

3 months ago

@AElkrief Nice

0

30

AlexandreMomeni retweeted

3 months ago

Mistral Releases Leanstral #HackerNews https://t.co/hO1gpNyB2r

0

1

0

98

3 months ago

@Alfred_Lin 😂😂😂

0

80

AlexandreMomeni retweeted

3 months ago

Why is the 100 ms barrier for Qwen3-TTS (1.7b) this important?👇 Nvidia GPUs scale up amazingly, but they don't scale down well to serving a single user with sub-3b Transformers. They are throughput-maximizers, not latency-minimizers. @Alibaba_Qwen's Qwen3-TTS paper showed that an optimized vLLM implementation on Nvidia GPUs achieved 101 ms time-to-first-byte latency under idealized conditions: no concurrency and no network round-trip latency. Argmax TTSKit achieves as low as 70 ms on Apple Silicon Macs in the post below, but the takeaway is not 70 vs 101 ms here. The takeaway is that, when we move from idealized conditions to the real world: - Mac will actually serve a single user without an internet round-trip, and the user will experience sub-100ms latency as-is - Nvidia GPUs will serve many users concurrently in the cloud, resulting in at least 3-5x higher latency. Most importantly, latency will have high variance. Real-time streaming inference for sub-3b Transformers is where on-device inference is differentiated from cloud, and companies pay the premium for this today. This is the only commercially relevant market segment where the broadly repeated but rarely substantiated claim of "on-device is faster" actually holds, not running 1T LLMs on 2 Mac Studios.

atiorh's tweet photo. Why is the 100 ms barrier for Qwen3-TTS (1.7b) this important?👇

Nvidia GPUs scale up amazingly, but they don't scale down well to serving a single user with sub-3b Transformers. They are throughput-maximizers, not latency-minimizers.

@Alibaba_Qwen's Qwen3-TTS paper showed that an optimized vLLM implementation on Nvidia GPUs achieved 101 ms time-to-first-byte latency under idealized conditions: no concurrency and no network round-trip latency.

Argmax TTSKit achieves as low as 70 ms on Apple Silicon Macs in the post below, but the takeaway is not 70 vs 101 ms here.

The takeaway is that, when we move from idealized conditions to the real world:
- Mac will actually serve a single user without an internet round-trip, and the user will experience sub-100ms latency as-is
- Nvidia GPUs will serve many users concurrently in the cloud, resulting in at least 3-5x higher latency. Most importantly, latency will have high variance.

Real-time streaming inference for sub-3b Transformers is where on-device inference is differentiated from cloud, and companies pay the premium for this today.

This is the only commercially relevant market segment where the broadly repeated but rarely substantiated claim of "on-device is faster" actually holds, not running 1T LLMs on 2 Mac Studios.

3

135

13

100

23K

AlexandreMomeni retweeted

3 months ago

WhisperKit is at 5M! Up 5x in 35 days 2026 is the year of on-device inference❤️

0

104

9

73

13K

AlexandreMomeni retweeted

3 months ago

Real-time Transcription with Speakers is now generally available!

2

25

3

18

3K

AlexandreMomeni retweeted

3 months ago

Ultra low-latency real-time speech-to-text in Superwhisper is out!

2

82

6

62

9K

3 months ago

Beyond @GoogleDeepMind and @IsomorphicLabs, @demishassabis’s legscy may be the generation of founders he’s inspired - @MistralAI @orbitalmaterials @latentlabs and many more.

James Dacombe

@jamesdacombe

4 months ago

Two observations: 1. @demishassabis has done more for the UK by demanding DeepMind remain headquartered in London than arguably any Briton in recent decades (never mind all of his other achievements for the world). His actions will single-handedly account for the majority of the UK’s future growth, if the politicians can manage to stay out of the way. What a legend. 2. Sequoia appear to be back and playing aggressively again.

32

1K

73

176

384K

0

1

0

245

AlexandreMomeni retweeted

argmax

@argmax

3 months ago

We are open-sourcing TTSKit! Run state-of-the-art text-to-speech models on your Mac and iPhone. The launch version supports @Alibaba_Qwen Qwen3-TTS and generates audio faster than real-time playback with sub-200 ms time-to-first-byte. Voice cloning and advanced speed optimizations will be in the next version. Link to the GitHub repo and models on @huggingface in comments.

19

386

66

436

62K

AlexandreMomeni retweeted

4 months ago

Pro tip: When using @superwhisper for AI meeting notes, select Parakeet (voice to text) + Sonnet 4.5 (text to summary) and put all of your company jargon in Vocabulary. Thank me later.

1

5

2

3

438

4 months ago

@MistralAI and @argmax are going to be a fire combo

Mistral AI

@MistralAI

4 months ago

Introducing Voxtral Transcribe 2, next-gen speech-to-text models by @MistralAI. State-of-the-art transcription, speaker diarization, sub-200ms real-time latency. Details in 🧵

117

4K

440

2K

658K

0

69