News, Science & Engineering, University

Building Tomorrow’s Sound: AUA’s Ongoing Success in Expanding Armenian Language Dataset

2 min read

YEREVAN, Armenia — On December 5, the American University of Armenia (AUA) Zaven P. & Sonia Akian College of Science and Engineering (CSE) successfully concluded the initial phase of the Armenian Language Speech-to-Text Data Collection Challenge. With the support of NVIDIA, the participants registering the highest recorded and validated speech data were recognized with special prizes for their active engagement in the initiative. Following the remarkable success of the preliminary stage, the challenge is open to interested participants to start contributing their voices online, thereby significantly enriching the Armenian language database.

The core idea of the initiative is to expand the Armenian language dataset, enabling scientists and students to effectively train new models using the NVIDIA NeMo toolkit, an open-source framework for conversational AI. The contributions are documented on Mozilla’s Common Voice website, an open-source database that periodically releases collected datasets under a free-to-use license (Creative Commons Zero CC0), with the next release date scheduled for mid-December.

Initially, the Armenian voice data amounted to 5 hours, significantly less than English (at 3,400 hours) and Georgian (at 154 hours). Thanks to the challenge, the voice data has quadrupled within just a few days, reaching 20 hours to date. However, a minimum of 200 hours of data would be necessary for the creation of new models.

Dr. Habet Madoyan, chair of the Bachelor of Science in Data Science program at CSE, remarked on the initiative’s current results and outlined future plans for reaching the goal: “As we witness the exponential growth of the Armenian voice dataset from 5 hours to 20 hours to date, we recognize the immense potential of our community. Our journey doesn’t end here. Beyond the current success lies a horizon where our collective voices, reaching well beyond the 200-hour milestone, will redefine the technological landscape for the Armenian language. Together, we are not just building datasets; we are solving an Armenian problem and shaping a future where our language will be more resilient in the digital era.”

As the initiative progresses, integrating the Armenian language into new technological advancements will become feasible, ensuring its sustained relevance among languages used in technology.

As the initiative progresses, integrating the Armenian language into new technological advancements will become feasible, ensuring its sustained relevance among languages used in technology. “NVIDIA is a committed participant within the global ecosystem of companies and researchers that are working to advance speech AI technology. Our open-source project NeMo is an important part of this effort.” Said Nikolay Karpov, senior research scientist at NVIDIA.”We hope to see a future in which every language, no matter how common and uncommon, can be accessed within speech AI solutions.”

In the process of further developing the database, individual contributions could significantly impact its enrichment, as every voice matters. Detailed steps for participating in voice contribution can be found on the official website.

Founded in 1991, the American University of Armenia (AUA) is a private, independent university located in Yerevan, Armenia, affiliated with the University of California, and accredited by the WASC Senior College and University Commission in the United States. AUA provides local and international students with Western-style education through top-quality undergraduate, graduate, and certificate programs, promotes research and innovation, encourages civic engagement and community service, and fosters democratic values.