Congratulations to CM Vijay on taking oath as Tamil Nadu's Chief Minister today. 🎉
Here's what I hope this government prioritizes:
Tamil has 78M speakers.
Google Translate still makes errors in Tamil.
Voice AI accuracy in Tamil: 67%.
Quality Tamil AI datasets: barely 3 exist.
An AI can write sonnets in English — but can't transcribe one line of Tamil.
That's not a language problem. It's a data problem.
We started TamilRootLabs to fix it: world-class AI data for South Indian languages.
Building in the open. Follow along. 🌳
We're working on building TamilRootLab's audio annotation infrastructure to make speech datasets more structured, reviewable, and culturally accurate.
The future of multilingual AI won't be built only with bigger models.
It will be built with better data.
#NLP#TamilAI
Most AI models still struggle with real-world Tamil speech.
Not because the models are weak.
Because the data is.
Different districts.
Different dialects.
Code-mixed conversations.
Background noise.
Emotional speech patterns.
That data layer is still massively underbuilt.
We started TamilRootLab to help bridge that gap through: speech datasets RLHF workflows AI evaluation transcription native-speaker intelligence
The future of AI will belong to multilingual system that actually understand people.
Most AI systems still think India means “Hindi + English”.
But millions of people interact in Tamil every single day:
• different dialects
• mixed speech
• emotional variations
• cultural context
• code-switching patterns
1/2
This is why your Tamil AI model fails on real users.
Dialect is not an accent. It is a different data distribution.
We annotate by district. Not by language.
@gnaniai@SarvamAI@krutrim — this is the gap in your training data.
🌱 TamilRootsLab
I asked 10 Tamil speakers to describe a mango.
Chennai speaker: "மாம்பழம் — sweet, ripe"
Madurai speaker: "மாங்காய் ஆச்சா? கொஞ்சம் புளிப்பா இருக்கு"
Jaffna speaker: "கொழும்பு மாம்பழம் வேற மாதிரி தான்"
Same fruit. 3 completely different Tamil descriptions.
TamilRootsLab from Tamil Nadu:
• Tamil voice datasets
• Human AI evaluation
• RLHF workflows
• Transcription & localization
• Native speaker workforce infrastructure
The next wave of AI will not belong only to English.
Regional language intelligence will define the future.
Building AI for India without Tamil data makes no sense.
78M+ Tamil speakers are still underrepresented in speech models, RLHF pipelines, transcription systems, and AI evaluation datasets.
That’s why we’re building
We're fixing this from the ground up.
District by district. Speaker by speaker.
@gnaniai@SarvamAI@AI4Bharat — this is the data problem your models have.
Building it at TamilRoots. 🌱
Tamil has 78M speakers.
But most Tamil voice AI is trained on one dialect — Chennai.
Madurai farmers don't speak Chennai Tamil. Coimbatore textile workers don't. Jaffna Tamils definitely don't.
Every dialect gap = lost accuracy on real users.
This week TamilRoots has a team:
→ Nithish— CEO, Coimbatore
→ Dhayanand — Biz Dev, Coimbatore
→ @motooto4 — Annotation Tech, Nilgiris
→ @SugenthiD26 — QA Lead, Coimbatore
→Srini — Community, Salem
5 Tamil native. 5 district.
1 mission.
Building the data layer.
#TamilAI
India's AI market: $7.8 billion today.
$184 billion by 2035.
That's 23x growth in 9 years.
Every rupee of that needs training data.
Tamil speakers: 78 million.
Quality Tamil datasets: 3.
In 2026, every Indian bank, hospital, and government office deploying AI.
Those AIs need to speak Tamil.
They need to speak Telugu.
They need to speak Kannada.
They need to speak Malayalam.
The companies that build that data infrastructure now will own that market for a decade.
150 million people use an Indian language keyboard.
Every time your Tamil keyboard suggests the wrong word
That is a training data problem.
Every time voice typing fails in Tamil
That is a training data problem.
Every time Tanglish transliteration gets it wrong
1/2
Not because Tamil is hard.
Because Tamil training data barely exists.
Today TamilRootsLab became an officially registered company.
We're building that data.
UDYAM-TN-03-0325762 🇮🇳
#Tamil#TamilAI#TamilRootsLab#IndiaAI#BuildInPublic
2/2
Tamil is one of the world's oldest living languages.
2,000+ years of literature.
Classical status recognised by Government of India.
78 million speakers worldwide.
And AI models still struggle to understand it.
1/2
15 days of building TamilRootsLab.
Here is what proper Tamil AI annotation actually needs:
✓ Native Tamil speakers — not translators
✓ 4 dialects — Madurai, Chennai, Coimbatore, Jaffna
✓ Code-switching data — Tanglish
✓ Colloquial text separated from formal
1/2