AI & Machine LearningData Engineering2026

From raw audio to
production-ready AI training data in 3 weeks

How we built Beruniy — Uzbekistan's first verified multi-speaker speech dataset engineered for commercial AI training

The problem

Uzbek is a low-resource language

AI companies building voice recognition, text-to-speech, and NLP systems had no reliable production-grade training data. Existing datasets were small, unverified, and required months of cleaning before they could be used. There was no commercial-grade option.

What we built

A dataset designed for production

Multi-speaker dataset sourced from native Uzbek podcast audio

Verified and segmented with speaker diarisation

Standardised metadata format — ready for ML pipelines immediately

Zero additional cleaning required by client teams

3 weeks

Project delivery

Multi-speaker

Verified voice coverage

0 hrs

Additional cleaning needed

Tech stack

Modern infrastructure for massive data

PythonPyTorchAWS S3Custom audio pipeline

The result

Immediate deployment

Beruniy launched as the reference Uzbek speech dataset for AI companies operating in Central Asia. The dataset is immediately deployable in production AI training pipelines — no preprocessing, no guesswork.

From raw audio toproduction-ready AI training data in 3 weeks

Uzbek is a low-resource language

A dataset designed for production

Modern infrastructure for massive data

Immediate deployment

Have a complex AI or data project?

From raw audio to
production-ready AI training data in 3 weeks