Vietnamese Automatic Speech Recognition: Self-Supervised and Semi-Supervised Learning Techniques Combination

EasyChair Preprint 11660

5 pages•Date: January 2, 2024

Abstract

The speech recognition task in Vietnamese is increasingly being interested and invested in by researchers and organizations. With a small amount of training data, self-supervised models have performed better than supervised models in speech recognition. As a part of this study, I explored two different learning methods, self-supervised learning and semi-supervised learning, in combination to solve the speech recognition problem. In order to perform self-supervised learning, I use a HuBERT model, which combines offline clustering with a BERT-like prediction loss. On the HuBERT model, I use the Gradient Mask technique to perform semi-supervised learning. Approximately 500 hours of unlabeled data and 50 hours of labeled data are provided by the VLSP 2022 organizers for training. The approach performs third on the ASR-T1 test using the proposed methodology, with a Syllable Error Rate (SyER) of 14.28%.

Keyphrases: pseudo-labeling, self-supervised learning, speech recognition

Links:

https://easychair.org/publications/preprint/vxmk

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:11660,
  author    = {Duong Trinh},
  title     = {Vietnamese Automatic Speech Recognition: Self-Supervised and Semi-Supervised Learning Techniques Combination},
  howpublished = {EasyChair Preprint 11660},
  year      = {EasyChair, 2024}}

Download PDF Open PDF in browser