NITHISH

@NithishKumarAI

AI Data Annotation Specialist 🤖 | Tamil & South Indian Language Data | Worked with Toloka · Turing · Crowdgen | DM for free pilot 📩

Coimbatore, Tamil Nadu

Joined June 2022

671 Following

71 Followers

613 Posts

Pinned Tweet

NITHISH @NithishKumarAI

25 days ago

Congratulations to CM Vijay on taking oath as Tamil Nadu's Chief Minister today. 🎉 Here's what I hope this government prioritizes: Tamil has 78M speakers. Google Translate still makes errors in Tamil. Voice AI accuracy in Tamil: 67%. Quality Tamil AI datasets: barely 3 exist.

181

NITHISH @NithishKumarAI

3 days ago

An AI can write sonnets in English — but can't transcribe one line of Tamil. That's not a language problem. It's a data problem. We started TamilRootLabs to fix it: world-class AI data for South Indian languages. Building in the open. Follow along. 🌳

NITHISH @NithishKumarAI

5 days ago

We're working on building TamilRootLab's audio annotation infrastructure to make speech datasets more structured, reviewable, and culturally accurate. The future of multilingual AI won't be built only with bigger models. It will be built with better data. #NLP #TamilAI

NITHISH @NithishKumarAI

5 days ago

Most AI models still struggle with real-world Tamil speech. Not because the models are weak. Because the data is. Different districts. Different dialects. Code-mixed conversations. Background noise. Emotional speech patterns.

Who to follow

Skrajny centrysta, radykalny symetrysta, ksenofob, islamofob, antysemita, Profesjonalny Krytyk Etniczny. Prywatnie cham i prostak.

MovinShakin

@MovinShakin777

NITHISH @NithishKumarAI

6 days ago

That data layer is still massively underbuilt. We started TamilRootLab to help bridge that gap through: speech datasets RLHF workflows AI evaluation transcription native-speaker intelligence The future of AI will belong to multilingual system that actually understand people.

NITHISH @NithishKumarAI

6 days ago

Most AI systems still think India means “Hindi + English”. But millions of people interact in Tamil every single day: • different dialects • mixed speech • emotional variations • cultural context • code-switching patterns 1/2

NITHISH @NithishKumarAI

7 days ago

This is why your Tamil AI model fails on real users. Dialect is not an accent. It is a different data distribution. We annotate by district. Not by language. @gnaniai @SarvamAI @krutrim — this is the gap in your training data. 🌱 TamilRootsLab

NITHISH @NithishKumarAI

7 days ago

I asked 10 Tamil speakers to describe a mango. Chennai speaker: "மாம்பழம் — sweet, ripe" Madurai speaker: "மாங்காய் ஆச்சா? கொஞ்சம் புளிப்பா இருக்கு" Jaffna speaker: "கொழும்பு மாம்பழம் வேற மாதிரி தான்" Same fruit. 3 completely different Tamil descriptions.

NITHISH @NithishKumarAI

8 days ago

TamilRootsLab from Tamil Nadu: • Tamil voice datasets • Human AI evaluation • RLHF workflows • Transcription & localization • Native speaker workforce infrastructure The next wave of AI will not belong only to English. Regional language intelligence will define the future.

NITHISH @NithishKumarAI

8 days ago

Building AI for India without Tamil data makes no sense. 78M+ Tamil speakers are still underrepresented in speech models, RLHF pipelines, transcription systems, and AI evaluation datasets. That’s why we’re building

NITHISH @NithishKumarAI

9 days ago

We're fixing this from the ground up. District by district. Speaker by speaker. @gnaniai @SarvamAI @AI4Bharat — this is the data problem your models have. Building it at TamilRoots. 🌱

NITHISH @NithishKumarAI

9 days ago

Tamil has 78M speakers. But most Tamil voice AI is trained on one dialect — Chennai. Madurai farmers don't speak Chennai Tamil. Coimbatore textile workers don't. Jaffna Tamils definitely don't. Every dialect gap = lost accuracy on real users.

NITHISH @NithishKumarAI

10 days ago

This week TamilRoots has a team: → Nithish— CEO, Coimbatore → Dhayanand — Biz Dev, Coimbatore → @motooto4 — Annotation Tech, Nilgiris → @SugenthiD26 — QA Lead, Coimbatore →Srini — Community, Salem 5 Tamil native. 5 district. 1 mission. Building the data layer. #TamilAI

NITHISH @NithishKumarAI

10 days ago

India's AI market: $7.8 billion today. $184 billion by 2035. That's 23x growth in 9 years. Every rupee of that needs training data. Tamil speakers: 78 million. Quality Tamil datasets: 3.

NITHISH @NithishKumarAI

11 days ago

This is not a prediction. @SarvamAI . @GnaniAi . @desh_keyboard . @BharatGPT. It's already happening. We're building the Tamil data layer. From Coimbatore. #IndiaAI #GovTech #TamilAI #RegionalLanguage #DataAnnotation

NITHISH @NithishKumarAI

11 days ago

In 2026, every Indian bank, hospital, and government office deploying AI. Those AIs need to speak Tamil. They need to speak Telugu. They need to speak Kannada. They need to speak Malayalam. The companies that build that data infrastructure now will own that market for a decade.

NITHISH @NithishKumarAI

12 days ago

That is a training data problem. The technology exists. The data to make it work properly — barely. We're building it. From Coimbatore. #TamilKeyboard #TamilAI #IndiaAI #DataAnnotation @desh_keyboard 2/2

NITHISH @NithishKumarAI

12 days ago

150 million people use an Indian language keyboard. Every time your Tamil keyboard suggests the wrong word That is a training data problem. Every time voice typing fails in Tamil That is a training data problem. Every time Tanglish transliteration gets it wrong 1/2

NITHISH @NithishKumarAI

13 days ago

Not because Tamil is hard. Because Tamil training data barely exists. Today TamilRootsLab became an officially registered company. We're building that data. UDYAM-TN-03-0325762 🇮🇳 #Tamil #TamilAI #TamilRootsLab #IndiaAI #BuildInPublic 2/2

NITHISH @NithishKumarAI

13 days ago

Tamil is one of the world's oldest living languages. 2,000+ years of literature. Classical status recognised by Government of India. 78 million speakers worldwide. And AI models still struggle to understand it. 1/2

NITHISH @NithishKumarAI

14 days ago

Most annotation companies provide none of these. We provide all four. From Coimbatore. #TamilNLP #DataAnnotation #IndiaAI #BuildInPublic 2/2

NITHISH @NithishKumarAI

14 days ago

15 days of building TamilRootsLab. Here is what proper Tamil AI annotation actually needs: ✓ Native Tamil speakers — not translators ✓ 4 dialects — Madurai, Chennai, Coimbatore, Jaffna ✓ Code-switching data — Tanglish ✓ Colloquial text separated from formal 1/2

NITHISH

@NithishKumarAI

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users