We are hiring.Join our team
AI & Machine LearningData Engineering2026

From raw audio to
production-ready AI training data in 3 weeks

How we built Beruniy — Uzbekistan's first verified multi-speaker speech dataset engineered for commercial AI training

The problem

Uzbek is a low-resource language

AI companies building voice recognition, text-to-speech, and NLP systems had no reliable production-grade training data. Existing datasets were small, unverified, and required months of cleaning before they could be used. There was no commercial-grade option.

What we built

A dataset designed for production

Multi-speaker dataset sourced from native Uzbek podcast audio

Verified and segmented with speaker diarisation

Standardised metadata format — ready for ML pipelines immediately

Zero additional cleaning required by client teams

3 weeks
Project delivery
Multi-speaker
Verified voice coverage
0 hrs
Additional cleaning needed
Tech stack

Modern infrastructure for massive data

PythonPyTorchAWS S3Custom audio pipeline
The result

Immediate deployment

Beruniy launched as the reference Uzbek speech dataset for AI companies operating in Central Asia. The dataset is immediately deployable in production AI training pipelines — no preprocessing, no guesswork.

Have a complex AI or data project?