Karan Hiremath

@karan_hiremath

Always Strive + Prosper | Currently Securing the Swarm @Cartesia | UPenn M&T ‘16 | ex-Office of Naval Research, ex-@EasyPost | build better bots

Los Angeles, CA

Joined January 2010

555 Following

322 Followers

2.5K Posts

Pinned Tweet

Karan Hiremath @karan_hiremath

about 1 month ago

Excited to take the 👑 for TTS w/ @cartesia today! Give it a whirl at https://t.co/hIjnWg5dwy in over 40+ languages and get in touch to learn more Can’t wait to keep building the future of human-first AI across all modalities!

Artificial Analysis

@ArtificialAnlys

about 1 month ago

Cartesia’s Sonic-3.5 takes the #1 spot on the Artificial Analysis Speech Arena Leaderboard, surpassing Inworld Realtime TTS 1.5 Max and Google’s Gemini 3.1 Flash TTS Sonic-3.5 is the latest TTS model from @cartesia . It supports 42 languages, including 9 Indian languages, with 500+ voices available out of the box. The model has been highly preferred among voters in the TTS Arena, with its demonstrated naturalness and accurate transcript following. Key takeaways: ➤ Quality: Sonic-3.5 has an Elo score of 1,218 (+16/-16) based on 1,144 arena appearances, placing it ahead of Inworld Realtime TTS 1.5 Max at 1,194 and Gemini 3.1 Flash TTS at 1,209 ➤ Pricing: Sonic-3.5 is priced at $39/1M characters, a premium compared to Gemini 3.1 Flash TTS at $18.3/1M characters, and Inworld Realtime TTS 1.5 Max at $35/1M characters ➤ Speed: 105.5 characters per second, compared to 205 characters per second for Inworld Realtime TTS 1.5 Max and 26.3 characters per second for Gemini 3.1 Flash TTS See more details and listen to samples below 🧵

ArtificialAnlys's tweet photo. Cartesia’s Sonic-3.5 takes the #1 spot on the Artificial Analysis Speech Arena Leaderboard, surpassing Inworld Realtime TTS 1.5 Max and Google’s Gemini 3.1 Flash TTS

Sonic-3.5 is the latest TTS model from @cartesia . It supports 42 languages, including 9 Indian languages, with 500+ voices available out of the box. The model has been highly preferred among voters in the TTS Arena, with its demonstrated naturalness and accurate transcript following.

Key takeaways:
➤ Quality: Sonic-3.5 has an Elo score of 1,218 (+16/-16) based on 1,144 arena appearances, placing it ahead of Inworld Realtime TTS 1.5 Max at 1,194 and Gemini 3.1 Flash TTS at 1,209

➤ Pricing: Sonic-3.5 is priced at $39/1M characters, a premium compared to Gemini 3.1 Flash TTS at $18.3/1M characters, and Inworld Realtime TTS 1.5 Max at $35/1M characters

➤ Speed: 105.5 characters per second, compared to 205 characters per second for Inworld Realtime TTS 1.5 Max and 26.3 characters per second for Gemini 3.1 Flash TTS

See more details and listen to samples below 🧵

18

269

52

91

113K

1

20

1

0

2K

karan_hiremath retweeted

6 days ago

Cartesia has the best text-to-speech AND best speech-to-text models in the world. That’s really enough said - check out the demo below, and try them out at https://t.co/kXeTe8qv7c!

3

21

1

1

666

Karan Hiremath @karan_hiremath

6 days ago

The story is always deeper than just a single benchmark - and when you break down how models actually fit into real product solutions or automations at scale, speed and quality require equal consideration

Animesh Bohara @animeshbohara

6 days ago

Some teams train good models. Some train fast ones. We don't think you should have to choose. Today at @cartesia we shipped two: Sonic 3.5 (speaking) + Ink 2 (listening). Both SOTA, both realtime. 90ms TTFA. 3.6% WER, #1 on AA. Try them out at https://t.co/8FxdqNLp8t

5

51

2

25

2K

0

2

0

0

34

Karan Hiremath @karan_hiremath

6 days ago

@lulu32125 HUMAN FIRST AI for all!

0

0

0

0

25

Who to follow

Verified account

health, human potential & consciousness——✺ wharton alum, ex-uber designer

Verified account

AI things at @a16z. Prefers 🌶️ takes. Makes noises about b2b sales/GTM/ops. Prev @northflank @vertexvus @segment @box @berkeleyhaas 🏎️🚴🎾

Verified account

karan_hiremath retweeted

6 days ago

Two new models just dropped 👀 Sonic-3.5 and Ink-2 are the #1 streaming models for text to speech and speech to text

10

98

23

32

14K

Karan Hiremath @karan_hiremath

6 days ago

the post-transformer revolution will help bring your agents with you everywhere you go - and @cartesia is leading the charge to bring the most human, lowest latency, and highest throughput models into the real-world!

6 days ago

We released Sonic-3.5 and Ink-2, the #1 streaming models for text to speech and speech to text you can use in your voice agents today. New architectures enable new frontiers for speed and quality. We're now the only provider to have #1 models for both speaking and listening.

723

3K

594

2K

7M

0

7

0

0

269

karan_hiremath retweeted

13 days ago

The bottleneck of frontier robotics isn’t compute, labeling, or the models themselves. It’s data collection. While language models scaled effortlessly on open internet text, robotics requires physical trajectories, motor torques, and tactile forces that cannot simply be scraped from a webpage. Every token has to be fought for. Here is a breakdown of the 7 data types shaping the industry today, each representing a trade-off between collection cost and action-label purity: 1. Real Teleoperation (AgiBot World, DROID). Collected by humans guiding hardware, it scales linearly with human hours. 2. Low-cost Capture (Mobile ALOHA, UMI handheld). It drives collection cost down while keeping real physics, though it introduces an embodiment mapping problem when transferring human hand actions to robotic joints. 3. Fleet / Deployment Data (Tesla Optimus, Figure). These are trajectories from robots already working in the field. Tesla is betting its automotive fleet infrastructure transfers to Optimus. It generates powerful, real edge cases, but requires scaled deployment. 4. Simulation (NVIDIA Isaac Sim, Genesis). While offering near-infinite scale, the sim-to-real gap still struggles to model contact-rich dynamics like slipping, twisting, friction. 5. World-Model Synthetic (NVIDIA Cosmos 3). NVIDIA just shipped Cosmos 3, which natively outputs action trajectories, not only video pixels. If a world model can accurately simulate the laws of physics natively, it reduces the need for manual teleop data drastically. 6. Egocentric video (Ego4D, Meta’s Project Aria). First-person human video captured with head-mounted rigs. Far more scalable than teleop and closer to a robot’s own viewpoint. Still carries no robot action signals on its own. 7. Internet video (Youtube, TikTok). Maximum scale, lowest cost, effectively free. It captures the widest range of objects, tasks and physical situations, but with zero action labels and (mostly) a third-person viewpoint. Collecting data is only the step one. The next great execution challenge is engineering a coherent training recipe that can blend these heterogeneous data sources into a single model.

michellelsun's tweet photo. The bottleneck of frontier robotics isn’t compute, labeling, or the models themselves.

It’s data collection.

While language models scaled effortlessly on open internet text, robotics requires physical trajectories, motor torques, and tactile forces that cannot simply be scraped from a webpage.

Every token has to be fought for.

Here is a breakdown of the 7 data types shaping the industry today, each representing a trade-off between collection cost and action-label purity:

1. Real Teleoperation (AgiBot World, DROID). Collected by humans guiding hardware, it scales linearly with human hours.

2. Low-cost Capture (Mobile ALOHA, UMI handheld). It drives collection cost down while keeping real physics, though it introduces an embodiment mapping problem when transferring human hand actions to robotic joints.

3. Fleet / Deployment Data (Tesla Optimus, Figure). These are trajectories from robots already working in the field. Tesla is betting its automotive fleet infrastructure transfers to Optimus. It generates powerful, real edge cases, but requires scaled deployment.

4. Simulation (NVIDIA Isaac Sim, Genesis). While offering near-infinite scale, the sim-to-real gap still struggles to model contact-rich dynamics like slipping, twisting, friction.

5. World-Model Synthetic (NVIDIA Cosmos 3). NVIDIA just shipped Cosmos 3, which natively outputs action trajectories, not only video pixels. If a world model can accurately simulate the laws of physics natively, it reduces the need for manual teleop data drastically.

6. Egocentric video (Ego4D, Meta’s Project Aria). First-person human video captured with head-mounted rigs. Far more scalable than teleop and closer to a robot’s own viewpoint. Still carries no robot action signals on its own.

7. Internet video (Youtube, TikTok). Maximum scale, lowest cost, effectively free. It captures the widest range of objects, tasks and physical situations, but with zero action labels and (mostly) a third-person viewpoint.

Collecting data is only the step one.

The next great execution challenge is engineering a coherent training recipe that can blend these heterogeneous data sources into a single model.

31

409

52

407

66K

karan_hiremath retweeted

16 days ago

It's truly insane that it's even possible to build software as bad as the Claude Code and Codex CLIs are. AI has opened up whole new capabilities that I never would have foreseen

5

15

1

0

531

Karan Hiremath @karan_hiremath

19 days ago

More slop more security nightmares 😅😅😅

International Cyber Digest

@IntCyberDigest

19 days ago

AI isn't taking our jobs. It's keeping cybersecurity people employed for life.

49

2K

251

236

144K

0

0

0

0

41

karan_hiremath retweeted

International Cyber Digest

@IntCyberDigest

19 days ago

AI isn't taking our jobs. It's keeping cybersecurity people employed for life.

49

2K

251

236

144K

karan_hiremath retweeted

21 days ago

meta gave their AI support agent the ability to modify your instagram account. no identity verification. people figured this out and accounts are being taken over right now

weezerOSINT's tweet photo. meta gave their AI support agent the ability to modify your instagram account. no identity verification. people figured this out and accounts are being taken over right now https://t.co/60yRrImnaZ

124

13K

1K

4K

2M

karan_hiremath retweeted

21 days ago

if you're under 50 and you stay healthy, i think you will live to 150 years old minimum the medical singularity is happening. just in the past 2 months alone: > revmed's pancreatic cancer drug (daraxonrasib) doubled survival in the deadliest cancer there is, 13.2 months vs 6.7 on chemo. it got a standing ovation from 40k+ doctors at the world's biggest cancer conference > a one-time gene editing infusion (verve-102) permanently switched off the gene that drives bad cholesterol and cut it up to 62% from a single dose. one and done, no daily pill for life > a lung cancer pill (lorlatinib) kept 60% of patients with spread cancer progression-free at 5 years. the longest anyone has ever held back a metastatic solid tumor with a single drug > mayo built an ai that catches pancreatic cancer on routine ct scans up to 3 years before doctors can. it spotted 73% of the earliest cases vs 39% for human radiologists > lilly's new weight loss drug (retatrutide) hit up to 30% body weight loss in its big phase 3 trial, and along the way it cut knee arthritis pain by 76% and dropped bad cholesterol about 20% and we are still just at the beginning of the exponential call me crazy but i'm a believer when Demis hassabis says we will cure all disease in the next 10 years

233

3K

445

2K

1M

karan_hiremath retweeted

21 days ago

new grads often ask me what they should be doing so they don't fall behind in the ai space. there's a lot, but its honestly super manageable. become intimate with model internals. proof based linear algebra. non-convex optimization. this is stuff you could've done in undergrad. it definitely takes some time and work, but its doable. have taste, have opinions. train a small model, then train a big one. vLLM internals, tensor parallelism. hand roll kernels. cluster orchestration. do you have opinions on synthetic data? why don't you? SFT, PPO, you should know this. learn Triton. everyone is reproducing papers now so you need to be doing more. do you know the semi supply chain? where are the bottlenecks? hardware, man, hardware. your little gpu rig erector set in your basement isnt gonna cut it. build a cluster, a big one. pretrain a 800B model. now postrain it. serve it to millions of people. you should be able to beat deepseek on some benchmarks now. its a lot to take in but it all snowballs. this what job security looks like from now on. do you want to work in tech or not

102

4K

249

6K

740K

karan_hiremath retweeted

24 days ago

Earnings Before Tokens

brandonjcarl's tweet photo. Earnings Before Tokens https://t.co/TvXayAOj68

52

6K

741

524

2M

karan_hiremath retweeted

Eli @elipughresearch

24 days ago

ink-2 🙀 cartesia has stt ???

0

27

2

1

2K

karan_hiremath retweeted

24 days ago

Incredibly proud of the @cartesia team for this tremendous achievement - 2 different models in two weeks. AI Models for Speech to Text (STT) is not just about transcribing words - though that is essential. If you're using STT in conversational AI Voice Agents then you need speed, accuracy, and - ideally - the ability to accurately detect when a user has started, finished , or resumed speaking. If the terminology is not clear, I've added a link in the thread to a glossary I created when learning this stuff.

1

10

1

1

322

karan_hiremath retweeted

Eli @elipughresearch

24 days ago

🧵 on some fun insider details on ink-2 😼

2

48

9

7

7K

karan_hiremath retweeted

vamshi chowdary @vamchowdary72

24 days ago

Thought it was over? Just getting started...

0

9

1

1

192

karan_hiremath retweeted

24 days ago

happened again (oc @radiuskia)

fluorane's tweet photo. happened again (oc @radiuskia) https://t.co/2FhE9tumiW

1

29

4

3

2K

karan_hiremath retweeted

Ganesh Murugappan @guruthemuru

24 days ago

#1 text-to-speech last week… #1 speech-to-text this week…

0

21

3

1

404

karan_hiremath retweeted

24 days ago

@ArtificialAnlys We've built Ink-2 to excel on all of these axes in production. Give it a try on our website https://t.co/dMIz7SPgPe

0

20

2

2

2K

Last Seen Users on Sotwe

Trends for you

Most Popular Users