mulberry can:
- respond in 162ms, the fastest ttfb on the market
- switch between two languages mid-sentence, naturally
- speak in different languages with their actual native accents, not a translated voice
launching silk mulberry 1.5
one of the fastest multilingual voice models in the world
it matches the best voice models in quality benchmarks (MOS)
all this at more than 95% lower cost ₹0.40/min (~$0.0046/min)
try now 👇
@wandering_mush haha i should have been more lucid but i wrote this half asleep so some parts came out bit loose than they should have. thanks for the follow lol. i started posting this stuff when like 20 people followed me. i m mostly just learning and posting as i go.
also i m not claiming gemma 4 12b proves encoder free is already the best pure performance choice or that google dropped the encoders only because it performs better. It may very well be a deployment /latency/memory tradeoff. my point is more modest gemma 4 12b shows that you can strip the input side encoder down much more than the usual multimodal pipeline and still get a useful capable model. that is the data point i care about. So I m using it as evidence that this direction works at all not as proof that the tradeoff is already solved everywhere.
@wandering_mush yeah I get u r not arguing against the premise.
i just wanted to clarify the gemma bit I wasnt saying the whole gemma 4 family scales toward encoder free as the models get bigger. i was only pointing to gemma 4 12b specifically which is the one I mentioned.
i have no fucking idea why people pretend like ai or llms are the pinnacle of human achievement or some boundary we have reached. why do people act like theres nothing after this? if they are so fucking capable where is the unified theory for all four fundamental forces? why isnt quantum physics solved? why is the three body problem still a nightmare? why are people still dying from random viruses?
for fucks sake we still dont even fully know whats under half the ocean. space travel is absurdly expensive. i dont see cars flying over my head. half the millennium Prize problems are still unsolved. we dont understand aging. ofcourse ai is impressive but acting like its the final chapter of science is insane. there are still entire fields of reality we barely understand.
we havent even finished understanding intelligence itself yet people are already talking as if ai is the final destination of science.
All of this is just trying to motivate something which should be intrinsic there is nothing u need for research except utter undenying curiosity it should be natural and should come within if u have been able to save your inner child then u r a researcher. It doesnt get more deeper than that
Someone writing small cheques into American AI labs, doing PR tours about how they “backed the future” probably shouldn’t be the loudest voice lecturing everyone on what India must do after the Fable news.
For those of us actually building models from scratch across modalities, the bottlenecks are not a breaking headline or a geopolitical event. We live them every day, data, compute, talent, inference, distribution, and relentless execution.
You don’t wake up one morning, see a model get pulled down internationally, and suddenly discover the importance of sovereign AI.
At @rumik_ai we’ve always believed in owning our stack and building foundational capabilities ourselves. Soon, we’ll be open-sourcing India’s first expressive TTS model with deep code-switching support across Hindi, Hinglish, and multiple Devanagari languages. Not because it’s fashionable, but because we genuinely believe India can build world-class AI infrastructure and models.
What the ecosystem needs isn’t more hindsight experts chasing engagement after every headline. It needs patient builders, conviction, long-term capital, and VCs who help create enduring AI companies instead of pretending to be the smartest AI researchers on Twitter.
India doesn’t have a talent problem. It has a conviction problem.
Genuine question : your combined filter weights DNSMOS, WER and SR or VAD ranks equally. But Table 3 shows the SR/VAD signal is your weakest filter (5.20 avg rank vs 3.40 for combined) and filtering harder on it makes things worse (VAD-50% = 6.20).
Why give equal weight to SR?
Isn't Silero VAD unreliable on exactly the wild YouTube-type audio that's most of your pool?
Raon-OpenTTS paper is finally out! We fully open-sourced 615K hours of TTS data and a 1B model competitive with Qwen3-TTS-1B and Voxtral-TTS-4B. Like DCLM and DataComp, our work closed the gap towards SOTA closed-data models in TTS, which will help push the TTS community forward!
@eigenron Im sure people like hitler , napoleon, alexander the great and countless revolutionaries weren’t just searching for “better explanations” they explicitly wanted to reshape the world and did.