Download PDFOpen PDF in browserVietnamese Automatic Speech Recognition: Self-Supervised and Semi-Supervised Learning Techniques CombinationEasyChair Preprint 116605 pages•Date: January 2, 2024AbstractThe speech recognition task in Vietnamese is increasingly being interested and invested in by researchers and organizations. With a small amount of training data, self-supervised models have performed better than supervised models in speech recognition. As a part of this study, I explored two different learning methods, self-supervised learning and semi-supervised learning, in combination to solve the speech recognition problem. In order to perform self-supervised learning, I use a HuBERT model, which combines offline clustering with a BERT-like prediction loss. On the HuBERT model, I use the Gradient Mask technique to perform semi-supervised learning. Approximately 500 hours of unlabeled data and 50 hours of labeled data are provided by the VLSP 2022 organizers for training. The approach performs third on the ASR-T1 test using the proposed methodology, with a Syllable Error Rate (SyER) of 14.28%. Keyphrases: pseudo-labeling, self-supervised learning, speech recognition
|