How we built Beruniy — Uzbekistan's first verified multi-speaker speech dataset engineered for commercial AI training
AI companies building voice recognition, text-to-speech, and NLP systems had no reliable production-grade training data. Existing datasets were small, unverified, and required months of cleaning before they could be used. There was no commercial-grade option.
Multi-speaker dataset sourced from native Uzbek podcast audio
Verified and segmented with speaker diarisation
Standardised metadata format — ready for ML pipelines immediately
Zero additional cleaning required by client teams
Beruniy launched as the reference Uzbek speech dataset for AI companies operating in Central Asia. The dataset is immediately deployable in production AI training pipelines — no preprocessing, no guesswork.