Today, we’re excited to officially launch our Yoruba speech data campaign on Dialectra.
Over the past two months, we’ve seen contributors across Hausa, Kanuri, and Fulfulde help us build one of the fastest-growing African speech data communities.
Now it’s time to expand.
With Yoruba joining Dialectra, contributors can now participate in:
• Corpus script recordings
• Transcription tasks
• Live conversational speech through Dialect Connect
As always, every contribution goes through our transcription, annotation, standardization, and human verification pipeline before becoming training-ready datasets.
Yoruba is one of Africa’s most influential and widely spoken languages, yet high-quality conversational speech infrastructure for it remains limited.
We want to help change that.
If you speak Yoruba, you can now join Dialectra, contribute your voice, and help shape the future of African speech AI while earning rewards for your contributions.
More speech data doesn't mean better speech AI.
Let me give example with Hausa, one fo the largest datasets we have followed by Yoruba, Hausa isn't just another language to throw into a voice model and expect it to work.
It has tones, implosives, and consonant clusters that most ASR systems have never seen, e.g
ɓ → ɓera, ɓarna
ɗ → ɗaki, ɗalibi
ƙ → ƙasa, ƙofa
ƴ → ƴanci, ƴatsa
sh, ts, gw, gy, ƙw, kw, ky, ƙy...
Every single one of these is phonetically distinct, miss one and your model breaks. So we don't just collect voice data.
Every submission in our pipeline is tagged with:
→ Full transcription text
→ Gender & age group
→ Dialect region
→ Phonetic metadata
Random data collection is not a dataset. STRUCTURE IS.
All our datasets goes through human-in-the-loop verification, our annotators aren't just listening for audio quality, they're checking phonetic accuracy, transcription fidelity, and dialect consistency for every single submission.
And we still don't stop there.
After human review, every dataset runs through WER and CER measurement to catch anything the human pass might have missed.
We believe two layers. No compromises. Because we're not building a data collection platform.
We're building the infra layer for African voice AI.
The foundation that didn't exist until now.
Today we're making part of our Yoruba speech datasets publicly available for the community.
It's a small release compared to what we've collected so far, but we believe African language AI advances faster when researchers, developers, students, and startups have access to quality datasets.
Over the last few months, Dialectra has grown to more than 500,000+ voice datasets samples and over 500+ hours of speech data across multiple collection workflows.
One of the most exciting contributors to that growth has been Dialect Connect, our conversational speech platform that captures how people actually speak not just scripted recordings.
We're committed to releasing datasets where it makes sense and where contributor permissions allow.
For organizations, research teams, and enterprises looking for larger datasets, custom collections, dialect-specific corpora, conversational speech data, transcription projects, or benchmarking support, our team has a structured onboarding process and would be happy to discuss your requirements.
We're not only collecting speech data we're building the infrastructure layer for African voice AI.
HF:https://t.co/Htqb6j2gKe
We released less than 0.05% of our Hausa speech dataset.
Within days: researchers, startups, enterprises, and builders were in our inbox.
That's not a coincidence.
High-quality African speech data is one of the most valuable layers in AI right now and almost nobody is building it seriously.
Dialectra is.
Enterprise access: https://t.co/4389JazlEJ
And next up: our first Hausa TTS model, trained on Dialectra data, available via API.
The real work is just beginning.
We're preparing to release our first Hausa TTS model, trained on Dialectra datasets and designed to give builders, developers, and Hausa content creators access to real Hausa voice AI through an API.
One interesting thing happened after we open-sourced a very small portion of our datasets on Hugging Face.
We're talking about less than 0.05% of what we've collected so far.
Within days, we started receiving messages from researchers, startups, enterprises, and builders interested in partnerships, collaborations, and even purchasing access to larger datasets.
That was a strong signal.
It confirmed something we've believed from the beginning: high-quality African speech data remains one of the most valuable and underserved layers in the AI ecosystem.
As interest continues to grow, Dialectra has a structured onboarding process for enterprise users, research teams, and organizations looking to work with our datasets and speech infrastructure. Getting in touch is straightforward through our platform. https://t.co/eCLj0KHPQu
What's even more exciting is what's coming next.
We're preparing to release our first Hausa TTS model, trained on Dialectra datasets and designed to give builders, developers, and Hausa content creators access to real Hausa voice AI through an API.
Work done not finish.
Today, Dialectra releases its first open Hausa voice dataset on Hugging Face for free.
This is a small but meaningful step toward something bigger: dialect-aware AI infrastructure for African languages.
Hausa is one of the most widely spoken languages in Africa, with over 70 million speakers across Nigeria, Niger, Ghana, and beyond. Yet most voice AI systems fail Hausa speakers not because the technology can't work, but because the right data doesn't exist.
Dialectra is building that data layer.
This first dataset includes:
→ Dialect-tagged audio recordings from native Hausa speakers
→ Rich metadata: region, speaker profile, speaking style, quality score
→ Benchmark-ready structure for ASR and TTS development
We're making it available to researchers, AI teams, and developers who are building voice technology that actually works for African users.
This is just the beginning. More languages, more dialects, more infrastructure coming soon.
Dataset: https://t.co/02zqaBPuI6
Learn more: https://t.co/4389JazlEJ
This is a solid step in the right direction.
African AI progress will depend heavily on datasets like this - not just models. Hausa being properly represented means real users can finally be included, not ignored.
If this momentum continues, we’re not just building AI for Africa… we’re building AI with Africa.
Excited to share that @_dialectra has been selected for the iHatch @ONDINigeria Cohort 5 Startup Incubation Programme 🇳🇬
Less than a year ago, we started with a simple belief: African languages deserve better representation in AI.
If two biggest national incubators like @iDICEStartup Startup Bridge Founders Lab and iHatch selecting your solution in less than a month it's another reminder that we're moving in the right direction.
Thank you to our contributors, partners, team, and everyone supporting the mission.
A lot more work ahead.
Slowly building the team behind Dialectra.
Happy to welcome a new AI/ML Engineer as we continue laying the foundations for African speech AI.
We've spent months building datasets, annotation workflows, benchmarking systems, and speech infrastructure.
Now it's time to push deeper into models, evaluation, and product development.
Welcome to the journey @nababa_h
Most people talk about model accuracy when it comes to AI.
We spend a lot of time thinking about what happens before the model ever sees the data.
At https://t.co/eCLj0KHPQu our WER pipeline isn't just a final evaluation score.
Every voice submission goes through the following..
→ transcription
→ human review
→ normalization
→ annotation
→ verification
→ benchmark testing
This helps us catch errors early and continuously improve dataset quality before training begins.
The goal isn't simply to achieve a lower WER.
The goal is to understand "WHY" errors happen across different dialects, accents, speech patterns, and conversational contexts.
The infrastructure layer matters more than most people think.
Thanks Msageer for the clear and powerful Article !
It perfectly explains why we’re focused on building real dialect-aware data for African languages and beyond.
If you want AI that actually understands every voice, jump in at https://t.co/V8CWYjEHsC and start contributing.
Let’s build this together.
Founders supporting Founders.
Drop what you are building, I will help you in users, feedbacks, posting a breakdown about the project/product, and of course funds
I will go first https://t.co/czWIKAxbEY
👇👇
Are you familiar with our new update?
Dialect Connect lets you invite a partner, discuss a topic together, and submit when you're done.
Collaborate. Contribute. Earn.