Skip to main content

2024 | OriginalPaper | Buchkapitel

Layer-Wise Sparse Training of Transformer via Convolutional Flood Filling

verfasst von : Bokyeong Yoon, Yoonsang Han, Gordon Euhyun Moon

Erschienen in: Advances in Knowledge Discovery and Data Mining

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Sparsifying the Transformer has garnered considerable interest, as training the Transformer is very computationally demanding. Prior efforts to sparsify the Transformer have either used a fixed pattern or data-driven approach to reduce the number of operations involving the computation of multi-head attention, which is the main bottleneck of the Transformer. However, existing methods suffer from inevitable problems, including potential loss of essential sequence features and an increase in the model size. In this paper, we propose a novel sparsification scheme for the Transformer that integrates convolution filters and the flood filling method to efficiently capture the layer-wise sparse pattern in attention operations. Our sparsification approach significantly reduces the computational complexity and memory footprint of the Transformer during training. Efficient implementations of the layer-wise sparsified attention algorithm on GPUs are developed, demonstrating our SPION that achieves up to 2.78\(\times \) speedup over existing state-of-the-art sparse Transformer models and maintain high evaluation quality.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Ainslie, J., Ontanon, S., et al.: Etc: encoding long and structured inputs in transformers. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 268–284 (2020) Ainslie, J., Ontanon, S., et al.: Etc: encoding long and structured inputs in transformers. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 268–284 (2020)
3.
Zurück zum Zitat Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019) Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv preprint arXiv:​1904.​10509 (2019)
4.
Zurück zum Zitat Condevaux, C., Harispe, S.: Lsg attention: extrapolation of pretrained transformers to long sequences. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 443–454 (2023) Condevaux, C., Harispe, S.: Lsg attention: extrapolation of pretrained transformers to long sequences. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 443–454 (2023)
5.
Zurück zum Zitat Devlin, J., Chang, M.W., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2019) Devlin, J., Chang, M.W., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2019)
6.
Zurück zum Zitat Dosovitskiy, A., Beyer, L., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021) Dosovitskiy, A., Beyer, L., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
7.
Zurück zum Zitat Goldman, R.: Graphics gems, p. 304 (1990) Goldman, R.: Graphics gems, p. 304 (1990)
8.
Zurück zum Zitat iNaturalist 2018 competition dataset. (2018) iNaturalist 2018 competition dataset. (2018)
9.
Zurück zum Zitat Kitaev, N., Kaiser, Ł., Levskaya, A.: Reformer: the efficient transformer. In: Proceedings of the International Conference on Learning Representations (2020) Kitaev, N., Kaiser, Ł., Levskaya, A.: Reformer: the efficient transformer. In: Proceedings of the International Conference on Learning Representations (2020)
10.
Zurück zum Zitat Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
11.
12.
14.
Zurück zum Zitat Qiu, J., Ma, H., et al.: Blockwise self-attention for long document understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2555–2565 (2020) Qiu, J., Ma, H., et al.: Blockwise self-attention for long document understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2555–2565 (2020)
15.
Zurück zum Zitat Radev, D.R., Muthukrishnan, P., et al.: The ACL anthology network corpus. Lang. Res. Eval. 47, 919–944 (2013)CrossRef Radev, D.R., Muthukrishnan, P., et al.: The ACL anthology network corpus. Lang. Res. Eval. 47, 919–944 (2013)CrossRef
16.
Zurück zum Zitat Roy, A., Saffar, M., et al.: Efficient content-based sparse attention with routing transformers. Trans. Assoc. Comput. Linguist. 9, 53–68 (2021)CrossRef Roy, A., Saffar, M., et al.: Efficient content-based sparse attention with routing transformers. Trans. Assoc. Comput. Linguist. 9, 53–68 (2021)CrossRef
17.
Zurück zum Zitat Tay, Y., Dehghani, M., et al.: Efficient transformers: a survey. ACM Comput. Surv. 55(6), 1–28 (2022)CrossRef Tay, Y., Dehghani, M., et al.: Efficient transformers: a survey. ACM Comput. Surv. 55(6), 1–28 (2022)CrossRef
18.
Zurück zum Zitat Vaswani, A., Shazeer, N., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017) Vaswani, A., Shazeer, N., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
19.
20.
Zurück zum Zitat Zaheer, M., Guruganesh, G., et al.: Big bird: transformers for longer sequences. Adv. Neural. Inf. Process. Syst. 33, 17283–17297 (2020) Zaheer, M., Guruganesh, G., et al.: Big bird: transformers for longer sequences. Adv. Neural. Inf. Process. Syst. 33, 17283–17297 (2020)
21.
Zurück zum Zitat Zhang, H., Gong, Y., et al.: Poolingformer: long document modeling with pooling attention. In: International Conference on Machine Learning, pp. 12437–12446. PMLR (2021) Zhang, H., Gong, Y., et al.: Poolingformer: long document modeling with pooling attention. In: International Conference on Machine Learning, pp. 12437–12446. PMLR (2021)
22.
Zurück zum Zitat Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 28 (2015) Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 28 (2015)
Metadaten
Titel
Layer-Wise Sparse Training of Transformer via Convolutional Flood Filling
verfasst von
Bokyeong Yoon
Yoonsang Han
Gordon Euhyun Moon
Copyright-Jahr
2024
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-97-2253-2_13

Premium Partner