Skip to main content

2024 | OriginalPaper | Buchkapitel

Accurate Semi-supervised Automatic Speech Recognition via Multi-hypotheses-Based Curriculum Learning

verfasst von : Junghun Kim, Ka Hyun Park, U Kang

Erschienen in: Advances in Knowledge Discovery and Data Mining

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

How can we accurately transcribe speech signals into texts when only a portion of them are annotated? ASR (Automatic Speech Recognition) systems are extensively utilized in many real-world applications including automatic translation systems and transcription services. Due to the exponential growth of available speech data without annotations and the significant costs of manual labeling, semi-supervised ASR approaches have garnered attention. Such scenarios include transcribing videos in streaming platforms, where a vast amount of content is uploaded daily but only a fraction of them are transcribed manually. Previous approaches for semi-supervised ASR use a pseudo labeling scheme to incorporate unlabeled examples during training. Nevertheless, their effectiveness is restricted as they do not take into account the uncertainty linked to the pseudo labels when using them as labels for unlabeled cases. In this paper, we propose MOCA ( https://static-content.springer.com/image/chp%3A10.1007%2F978-981-97-2262-4_4/MediaObjects/625135_1_En_4_Figa_HTML.gif ), an accurate framework for semi-supervised ASR. MOCA generates multiple hypotheses for each speech instance to consider the uncertainty of the pseudo label. Furthermore, MOCA considers the various degrees of uncertainty in pseudo labels across speech instances, enabling a robust training on the uncertain dataset. Extensive experiments on real-world speech datasets show that MOCA successfully improves the transcription performance of previous ASR models.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Al-Zakarya, M.A., Al-Irhaim, Y.F.: Unsupervised and semi-supervised speech recognition system: a review. AL-Rafidain J. Comput. Sci. Math. (2023) Al-Zakarya, M.A., Al-Irhaim, Y.F.: Unsupervised and semi-supervised speech recognition system: a review. AL-Rafidain J. Comput. Sci. Math. (2023)
2.
Zurück zum Zitat Baevski, A., Mohamed, A.: Effectiveness of self-supervised pre-training for ASR. In: ICASSP (2020) Baevski, A., Mohamed, A.: Effectiveness of self-supervised pre-training for ASR. In: ICASSP (2020)
3.
Zurück zum Zitat Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: Wav2Vec 2.0: a framework for self-supervised learning of speech representations. In: NeurIPS (2020) Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: Wav2Vec 2.0: a framework for self-supervised learning of speech representations. In: NeurIPS (2020)
4.
Zurück zum Zitat Cantiabela, Z.: Deep learning for robust speech command recognition using convolutional neural networks (CNN). In: IC3INA (2022) Cantiabela, Z.: Deep learning for robust speech command recognition using convolutional neural networks (CNN). In: IC3INA (2022)
6.
Zurück zum Zitat D’Haro, L.F., Banchs, R.E.: Automatic correction of ASR outputs by using machine translation. In: INTERSPEECH (2016) D’Haro, L.F., Banchs, R.E.: Automatic correction of ASR outputs by using machine translation. In: INTERSPEECH (2016)
7.
Zurück zum Zitat Higuchi, Y., Karube, K., Ogawa, T., Kobayashi, T.: Hierarchical conditional end-to-end ASR with CTC and multi-granular subword units. In: ICASSP (2022) Higuchi, Y., Karube, K., Ogawa, T., Kobayashi, T.: Hierarchical conditional end-to-end ASR with CTC and multi-granular subword units. In: ICASSP (2022)
8.
Zurück zum Zitat Javanmardi, F., Tirronen, S., Kodali, M., Kadiri, S.R., Alku, P.: Wav2Vec-based detection and severity level classification of dysarthria from speech. In: ICASSP (2023) Javanmardi, F., Tirronen, S., Kodali, M., Kadiri, S.R., Alku, P.: Wav2Vec-based detection and severity level classification of dysarthria from speech. In: ICASSP (2023)
9.
Zurück zum Zitat Korkut, C., Haznedaroglu, A., Arslan, L.: Comparison of deep learning methods for spoken language identification. In: SPECOM (2020) Korkut, C., Haznedaroglu, A., Arslan, L.: Comparison of deep learning methods for spoken language identification. In: SPECOM (2020)
10.
Zurück zum Zitat Kreyssig, F.L., Shi, Y., Guo, J., Sari, L., Mohamed, A., Woodland, P.C.: Biased self-supervised learning for ASR. CoRR (2022) Kreyssig, F.L., Shi, Y., Guo, J., Sari, L., Mohamed, A., Woodland, P.C.: Biased self-supervised learning for ASR. CoRR (2022)
11.
Zurück zum Zitat Liu, M., Ke, Y., Zhang, Y., Shao, W., Song, L.: Speech emotion recognition based on deep learning. In: TENCON (2022) Liu, M., Ke, Y., Zhang, Y., Shao, W., Song, L.: Speech emotion recognition based on deep learning. In: TENCON (2022)
12.
Zurück zum Zitat Long, Y., Li, Y., Wei, S., Zhang, Q., Yang, C.: Large-scale semi-supervised training in deep learning acoustic model for ASR. IEEE Access (2019) Long, Y., Li, Y., Wei, S., Zhang, Q., Yang, C.: Large-scale semi-supervised training in deep learning acoustic model for ASR. IEEE Access (2019)
13.
Zurück zum Zitat Nguyen, Q., Valizadegan, H., Hauskrecht, M.: Learning classification models with soft-label information. J. Am. Med. Inf. Assoc. (2014) Nguyen, Q., Valizadegan, H., Hauskrecht, M.: Learning classification models with soft-label information. J. Am. Med. Inf. Assoc. (2014)
14.
Zurück zum Zitat Nguyen, T.N., Pham, N.-Q., Waibel, A.: Accent conversion using pre-trained model and synthesized data from voice conversion. In: Interspeech (2022) Nguyen, T.N., Pham, N.-Q., Waibel, A.: Accent conversion using pre-trained model and synthesized data from voice conversion. In: Interspeech (2022)
15.
Zurück zum Zitat Novotney, S., Schwartz, R.M., Ma, J.Z.: Unsupervised acoustic and language model training with small amounts of labelled data. In: ICASSP (2009) Novotney, S., Schwartz, R.M., Ma, J.Z.: Unsupervised acoustic and language model training with small amounts of labelled data. In: ICASSP (2009)
16.
Zurück zum Zitat Daniel, S., et al.: Improved noisy student training for automatic speech recognition. In: INTERSPEECH (2020) Daniel, S., et al.: Improved noisy student training for automatic speech recognition. In: INTERSPEECH (2020)
17.
Zurück zum Zitat Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: ICML (2023) Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: ICML (2023)
18.
Zurück zum Zitat Rouhe, A., Virkkunen, A., Leinonen, J., Kurimo, M., et al.: Low resource comparison of attention-based and hybrid ASR exploiting Wav2Vec 2.0. In: Interspeech (2022) Rouhe, A., Virkkunen, A., Leinonen, J., Kurimo, M., et al.: Low resource comparison of attention-based and hybrid ASR exploiting Wav2Vec 2.0. In: Interspeech (2022)
19.
Zurück zum Zitat Schneider, S., Baevski, A., Collobert, R., Auli, M.: Wav2Vec: unsupervised pre-training for speech recognition. In: INTERSPEECH (2019) Schneider, S., Baevski, A., Collobert, R., Auli, M.: Wav2Vec: unsupervised pre-training for speech recognition. In: INTERSPEECH (2019)
20.
Zurück zum Zitat Shan, C., Zhang, J., Wang, Y., Xie, L.: Attention-based end-to-end speech recognition on voice search. In: ICASSP (2018) Shan, C., Zhang, J., Wang, Y., Xie, L.: Attention-based end-to-end speech recognition on voice search. In: ICASSP (2018)
21.
Zurück zum Zitat Vyas, A., Madikeri, S.R., Bourlard, H.: Comparing CTC and LFMMI for out-of-domain adaptation of Wav2Vec 2.0 acoustic model. In: Interspeech (2021) Vyas, A., Madikeri, S.R., Bourlard, H.: Comparing CTC and LFMMI for out-of-domain adaptation of Wav2Vec 2.0 acoustic model. In: Interspeech (2021)
22.
Zurück zum Zitat Weninger, F., Mana, F., Gemello, R., Andrés-Ferrer, J., Zhan, P.: Semi-supervised learning with data augmentation for end-to-end ASR. In: INTERSPEECH (2020) Weninger, F., Mana, F., Gemello, R., Andrés-Ferrer, J., Zhan, P.: Semi-supervised learning with data augmentation for end-to-end ASR. In: INTERSPEECH (2020)
23.
Zurück zum Zitat Xu, Q., Likhomanenko, T., Kahn, J., Hannun, A.Y., Synnaeve, G., Collobert, R.: Iterative pseudo-labeling for speech recognition. In: INTERSPEECH (2020) Xu, Q., Likhomanenko, T., Kahn, J., Hannun, A.Y., Synnaeve, G., Collobert, R.: Iterative pseudo-labeling for speech recognition. In: INTERSPEECH (2020)
24.
Zurück zum Zitat Zhao, X., et al.: Disentangling content and fine-grained prosody information via hybrid ASR bottleneck features for voice conversion. In: ICASSP (2022) Zhao, X., et al.: Disentangling content and fine-grained prosody information via hybrid ASR bottleneck features for voice conversion. In: ICASSP (2022)
Metadaten
Titel
Accurate Semi-supervised Automatic Speech Recognition via Multi-hypotheses-Based Curriculum Learning
verfasst von
Junghun Kim
Ka Hyun Park
U Kang
Copyright-Jahr
2024
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-97-2262-4_4

Premium Partner