Dialectra

5 days ago

More speech data doesn't mean better speech AI. Let me give example with Hausa, one fo the largest datasets we have followed by Yoruba, Hausa isn't just another language to throw into a voice model and expect it to work. It has tones, implosives, and consonant clusters that most ASR systems have never seen, e.g ɓ → ɓera, ɓarna ɗ → ɗaki, ɗalibi ƙ → ƙasa, ƙofa ƴ → ƴanci, ƴatsa sh, ts, gw, gy, ƙw, kw, ky, ƙy... Every single one of these is phonetically distinct, miss one and your model breaks. So we don't just collect voice data. Every submission in our pipeline is tagged with: → Full transcription text → Gender & age group → Dialect region → Phonetic metadata Random data collection is not a dataset. STRUCTURE IS. All our datasets goes through human-in-the-loop verification, our annotators aren't just listening for audio quality, they're checking phonetic accuracy, transcription fidelity, and dialect consistency for every single submission. And we still don't stop there. After human review, every dataset runs through WER and CER measurement to catch anything the human pass might have missed. We believe two layers. No compromises. Because we're not building a data collection platform. We're building the infra layer for African voice AI. The foundation that didn't exist until now.

_dialectra's tweet photo. More speech data doesn't mean better speech AI.

Let me give example with Hausa, one fo the largest datasets we have followed by Yoruba, Hausa isn't just another language to throw into a voice model and expect it to work.

It has tones, implosives, and consonant clusters that most ASR systems have never seen, e.g

ɓ → ɓera, ɓarna
ɗ → ɗaki, ɗalibi
ƙ → ƙasa, ƙofa
ƴ → ƴanci, ƴatsa
sh, ts, gw, gy, ƙw, kw, ky, ƙy...

Every single one of these is phonetically distinct, miss one and your model breaks. So we don't just collect voice data.

Every submission in our pipeline is tagged with:

→ Full transcription text
→ Gender & age group
→ Dialect region
→ Phonetic metadata

Random data collection is not a dataset. STRUCTURE IS.

All our datasets goes through human-in-the-loop verification, our annotators aren't just listening for audio quality, they're checking phonetic accuracy, transcription fidelity, and dialect consistency for every single submission.

And we still don't stop there.

After human review, every dataset runs through WER and CER measurement to catch anything the human pass might have missed.

We believe two layers. No compromises. Because we're not building a data collection platform.

We're building the infra layer for African voice AI.

The foundation that didn't exist until now.

3

24

6

0

1K

6 days ago

Today we're making part of our Yoruba speech datasets publicly available for the community. It's a small release compared to what we've collected so far, but we believe African language AI advances faster when researchers, developers, students, and startups have access to quality datasets. Over the last few months, Dialectra has grown to more than 500,000+ voice datasets samples and over 500+ hours of speech data across multiple collection workflows. One of the most exciting contributors to that growth has been Dialect Connect, our conversational speech platform that captures how people actually speak not just scripted recordings. We're committed to releasing datasets where it makes sense and where contributor permissions allow. For organizations, research teams, and enterprises looking for larger datasets, custom collections, dialect-specific corpora, conversational speech data, transcription projects, or benchmarking support, our team has a structured onboarding process and would be happy to discuss your requirements. We're not only collecting speech data we're building the infrastructure layer for African voice AI. HF:https://t.co/Htqb6j2gKe

_dialectra's tweet photo. Today we're making part of our Yoruba speech datasets publicly available for the community.

It's a small release compared to what we've collected so far, but we believe African language AI advances faster when researchers, developers, students, and startups have access to quality datasets.

Over the last few months, Dialectra has grown to more than 500,000+ voice datasets samples and over 500+ hours of speech data across multiple collection workflows.

One of the most exciting contributors to that growth has been Dialect Connect, our conversational speech platform that captures how people actually speak not just scripted recordings.

We're committed to releasing datasets where it makes sense and where contributor permissions allow.

For organizations, research teams, and enterprises looking for larger datasets, custom collections, dialect-specific corpora, conversational speech data, transcription projects, or benchmarking support, our team has a structured onboarding process and would be happy to discuss your requirements.

We're not only collecting speech data we're building the infrastructure layer for African voice AI.

HF:https://t.co/Htqb6j2gKe

7

30

12

1

3K

🐶 Chris Cōcks ... What tha #Shrook

6 days ago

@jeffreyon_

0

1

0

15

Who to follow

Overline

@Lauris050816

@ChrisCocks12

6 days ago

@Abdulhamid0912 Ta yaya keke ganin Dialectra zai iya taimaka maka?

3

4

1

0

153

6 days ago

We released less than 0.05% of our Hausa speech dataset. Within days: researchers, startups, enterprises, and builders were in our inbox. That's not a coincidence. High-quality African speech data is one of the most valuable layers in AI right now and almost nobody is building it seriously. Dialectra is. Enterprise access: https://t.co/4389JazlEJ And next up: our first Hausa TTS model, trained on Dialectra data, available via API. The real work is just beginning.

_dialectra's tweet photo. We released less than 0.05% of our Hausa speech dataset.

Within days: researchers, startups, enterprises, and builders were in our inbox.

That's not a coincidence.

High-quality African speech data is one of the most valuable layers in AI right now and almost nobody is building it seriously.

Dialectra is.

Enterprise access: https://t.co/4389JazlEJ

And next up: our first Hausa TTS model, trained on Dialectra data, available via API.

The real work is just beginning.

7

64

10

2

2K

6 days ago

We're preparing to release our first Hausa TTS model, trained on Dialectra datasets and designed to give builders, developers, and Hausa content creators access to real Hausa voice AI through an API.

6 days ago

One interesting thing happened after we open-sourced a very small portion of our datasets on Hugging Face. We're talking about less than 0.05% of what we've collected so far. Within days, we started receiving messages from researchers, startups, enterprises, and builders interested in partnerships, collaborations, and even purchasing access to larger datasets. That was a strong signal. It confirmed something we've believed from the beginning: high-quality African speech data remains one of the most valuable and underserved layers in the AI ecosystem. As interest continues to grow, Dialectra has a structured onboarding process for enterprise users, research teams, and organizations looking to work with our datasets and speech infrastructure. Getting in touch is straightforward through our platform. https://t.co/eCLj0KHPQu What's even more exciting is what's coming next. We're preparing to release our first Hausa TTS model, trained on Dialectra datasets and designed to give builders, developers, and Hausa content creators access to real Hausa voice AI through an API. Work done not finish.

Abba_kakaa's tweet photo. One interesting thing happened after we open-sourced a very small portion of our datasets on Hugging Face.

We're talking about less than 0.05% of what we've collected so far.

Within days, we started receiving messages from researchers, startups, enterprises, and builders interested in partnerships, collaborations, and even purchasing access to larger datasets.

That was a strong signal.

It confirmed something we've believed from the beginning: high-quality African speech data remains one of the most valuable and underserved layers in the AI ecosystem.

As interest continues to grow, Dialectra has a structured onboarding process for enterprise users, research teams, and organizations looking to work with our datasets and speech infrastructure. Getting in touch is straightforward through our platform. https://t.co/eCLj0KHPQu

What's even more exciting is what's coming next.

We're preparing to release our first Hausa TTS model, trained on Dialectra datasets and designed to give builders, developers, and Hausa content creators access to real Hausa voice AI through an API.

Work done not finish.

22

89

17

2

4K

6

25

6

0

938

_dialectra retweeted

9 days ago

If you're working on ASR, TTS, or conversational AI for African languages, come find us on Hugging Face. The data is there. Let's build together.

6

47

11

6

2K

9 days ago

@creptosolutions Curiosity brings simplicity in Dialectra

1

0

95

9 days ago

Today, Dialectra releases its first open Hausa voice dataset on Hugging Face for free. This is a small but meaningful step toward something bigger: dialect-aware AI infrastructure for African languages. Hausa is one of the most widely spoken languages in Africa, with over 70 million speakers across Nigeria, Niger, Ghana, and beyond. Yet most voice AI systems fail Hausa speakers not because the technology can't work, but because the right data doesn't exist. Dialectra is building that data layer. This first dataset includes: → Dialect-tagged audio recordings from native Hausa speakers → Rich metadata: region, speaker profile, speaking style, quality score → Benchmark-ready structure for ASR and TTS development We're making it available to researchers, AI teams, and developers who are building voice technology that actually works for African users. This is just the beginning. More languages, more dialects, more infrastructure coming soon. Dataset: https://t.co/02zqaBPuI6 Learn more: https://t.co/4389JazlEJ

13

80

18

12

5K

_dialectra retweeted

Crypto Solutions 🕊️

@creptosolutions

9 days ago

This is a solid step in the right direction. African AI progress will depend heavily on datasets like this - not just models. Hausa being properly represented means real users can finally be included, not ignored. If this momentum continues, we’re not just building AI for Africa… we’re building AI with Africa.

2

19

5

0

2K

9 days ago

@omarladan_ This is corpus V1...🧐

0

49

9 days ago

@fuchaweb3 🧐

0

1

0

70

11 days ago

@Abba_kakaa @ONDINigeria Dialectra doing, from from founder to contributors, thank you

0

2

1

0

188

11 days ago

We are moved to the next step as start up 🤝 Thank you for believing in us.

11 days ago

Excited to share that @_dialectra has been selected for the iHatch @ONDINigeria Cohort 5 Startup Incubation Programme 🇳🇬 Less than a year ago, we started with a simple belief: African languages deserve better representation in AI. If two biggest national incubators like @iDICEStartup Startup Bridge Founders Lab and iHatch selecting your solution in less than a month it's another reminder that we're moving in the right direction. Thank you to our contributors, partners, team, and everyone supporting the mission. A lot more work ahead.

Abba_kakaa's tweet photo. Excited to share that @_dialectra has been selected for the iHatch @ONDINigeria Cohort 5 Startup Incubation Programme 🇳🇬

Less than a year ago, we started with a simple belief: African languages deserve better representation in AI.

If two biggest national incubators like @iDICEStartup Startup Bridge Founders Lab and iHatch selecting your solution in less than a month it's another reminder that we're moving in the right direction.

Thank you to our contributors, partners, team, and everyone supporting the mission.

A lot more work ahead.

57

191

33

7

4K

10

38

7

1

434

13 days ago

Slowly building the team behind Dialectra. Happy to welcome a new AI/ML Engineer as we continue laying the foundations for African speech AI. We've spent months building datasets, annotation workflows, benchmarking systems, and speech infrastructure. Now it's time to push deeper into models, evaluation, and product development. Welcome to the journey @nababa_h

_dialectra's tweet photo. Slowly building the team behind Dialectra.

Happy to welcome a new AI/ML Engineer as we continue laying the foundations for African speech AI.

We've spent months building datasets, annotation workflows, benchmarking systems, and speech infrastructure.

Now it's time to push deeper into models, evaluation, and product development.

Welcome to the journey @nababa_h

9

34

5

1

371

_dialectra retweeted

13 days ago

Most people talk about model accuracy when it comes to AI. We spend a lot of time thinking about what happens before the model ever sees the data. At https://t.co/eCLj0KHPQu our WER pipeline isn't just a final evaluation score. Every voice submission goes through the following.. → transcription → human review → normalization → annotation → verification → benchmark testing This helps us catch errors early and continuously improve dataset quality before training begins. The goal isn't simply to achieve a lower WER. The goal is to understand "WHY" errors happen across different dialects, accents, speech patterns, and conversational contexts. The infrastructure layer matters more than most people think.

8

67

13

2

1K

_dialectra retweeted

Msageer

@Msageer_

14 days ago

https://t.co/CiP1j8Ypqc

3

16

5

4

2K

14 days ago

Thanks Msageer for the clear and powerful Article ! It perfectly explains why we’re focused on building real dialect-aware data for African languages and beyond. If you want AI that actually understands every voice, jump in at https://t.co/V8CWYjEHsC and start contributing. Let’s build this together.

Msageer

@Msageer_

14 days ago

https://t.co/CiP1j8Ypqc

3

16

5

4

2K

5

24

4

0

199

_dialectra retweeted

18 days ago

Founders supporting Founders. Drop what you are building, I will help you in users, feedbacks, posting a breakdown about the project/product, and of course funds I will go first https://t.co/czWIKAxbEY 👇👇

23

97

18

4

1K