Skip to main content

2021 | OriginalPaper | Buchkapitel

JUWELS Booster – A Supercomputer for Large-Scale AI Research

verfasst von : Stefan Kesselheim, Andreas Herten, Kai Krajsek, Jan Ebert, Jenia Jitsev, Mehdi Cherti, Michael Langguth, Bing Gong, Scarlet Stadtler, Amirpasha Mozaffari, Gabriele Cavallaro, Rocco Sedona, Alexander Schug, Alexandre Strube, Roshni Kamath, Martin G. Schultz, Morris Riedel, Thomas Lippert

Erschienen in: High Performance Computing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this article, we present JUWELS Booster, a recently commissioned high-performance computing system at the Jülich Supercomputing Center. With its system architecture, most importantly its large number of powerful Graphics Processing Units (GPUs) and its fast interconnect via InfiniBand, it is an ideal machine for large-scale Artificial Intelligence (AI) research and applications. We detail its system architecture, parallel, distributed model training, and benchmarks indicating its outstanding performance. We exemplify its potential for research application by presenting large-scale AI research highlights from various scientific fields that require such a facility.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
3
PyTorch allows AD for distributing tensors across computational devices based on the remote procedure call (RPC) protocol [9]. However, the RPC framework does not compete with communication frameworks like NCCL or MPI with respect to performance.
 
Literatur
1.
Zurück zum Zitat Intel Math Kernel Library. Reference Manual. Intel Corporation (2009) Intel Math Kernel Library. Reference Manual. Intel Corporation (2009)
5.
Zurück zum Zitat Agarwal, S., Wang, H., Venkataraman, S., Papailiopoulos, D.: On the utility of gradient compression in distributed training systems. ArXiv abs/2103.00543 (2021) Agarwal, S., Wang, H., Venkataraman, S., Papailiopoulos, D.: On the utility of gradient compression in distributed training systems. ArXiv abs/2103.00543 (2021)
6.
Zurück zum Zitat Amodei, D., Hernandez, D., Sastry, G., Clark, J., Brockman, G., Sutskever, I.: AI and compute. Technical report, OpenAI Blog (2018) Amodei, D., Hernandez, D., Sastry, G., Clark, J., Brockman, G., Sutskever, I.: AI and compute. Technical report, OpenAI Blog (2018)
9.
Zurück zum Zitat Birrell, A.D., Nelson, B.J.: Implementing remote procedure calls. ACM Trans. Comput. Syst. 2(1), 39–59 (1984) Birrell, A.D., Nelson, B.J.: Implementing remote procedure calls. ACM Trans. Comput. Syst. 2(1), 39–59 (1984)
10.
Zurück zum Zitat Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020) Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020)
12.
Zurück zum Zitat Canty, M.: Image Analysis, Classification and Change Detection in Remote Sensing: With Algorithms for ENVI/IDL and Python, 3rd edn. Taylor & Francis, New York (2014). ISBN: 9781466570375 Canty, M.: Image Analysis, Classification and Change Detection in Remote Sensing: With Algorithms for ENVI/IDL and Python, 3rd edn. Taylor & Francis, New York (2014). ISBN: 9781466570375
13.
Zurück zum Zitat Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.: Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029 (2020) Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.: Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:​2006.​10029 (2020)
14.
Zurück zum Zitat Cherti, M., Jitsev, J.: Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images. arXiv preprint arXiv:2106.00116 (2021) Cherti, M., Jitsev, J.: Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images. arXiv preprint arXiv:​2106.​00116 (2021)
15.
Zurück zum Zitat Chetlur, S., et al.: cuDNN: efficient primitives for deep learning (2014) Chetlur, S., et al.: cuDNN: efficient primitives for deep learning (2014)
16.
Zurück zum Zitat Cohen, J.P., Morrison, P., Dao, L., Roth, K., Duong, T.Q., Ghassemi, M.: Covid-19 image data collection: Prospective predictions are the future. J. Mach. Learn. Biomed. Imaging (2020) Cohen, J.P., Morrison, P., Dao, L., Roth, K., Duong, T.Q., Ghassemi, M.: Covid-19 image data collection: Prospective predictions are the future. J. Mach. Learn. Biomed. Imaging (2020)
18.
Zurück zum Zitat Dago, A.E., Schug, A., Procaccini, A., Hoch, J.A., Weigt, M., Szurmant, H.: Structural basis of histidine kinase autophosphorylation deduced by integrating genomics, molecular dynamics, and mutagenesis. Proc. Natl. Acad. Sci. 109(26), E1733–E1742 (2012)CrossRef Dago, A.E., Schug, A., Procaccini, A., Hoch, J.A., Weigt, M., Szurmant, H.: Structural basis of histidine kinase autophosphorylation deduced by integrating genomics, molecular dynamics, and mutagenesis. Proc. Natl. Acad. Sci. 109(26), E1733–E1742 (2012)CrossRef
23.
Zurück zum Zitat Ginsburg, B., et al.: Stochastic gradient methods with layer-wise adaptive moments for training of deep networks (2020) Ginsburg, B., et al.: Stochastic gradient methods with layer-wise adaptive moments for training of deep networks (2020)
25.
Zurück zum Zitat Goyal, P., et al.: Accurate, large minibatch SGD: training ImageNet in 1 hour (2018) Goyal, P., et al.: Accurate, large minibatch SGD: training ImageNet in 1 hour (2018)
26.
Zurück zum Zitat Götz, M., et al.: HeAT - a distributed and GPU-accelerated tensor framework for data analytics. In: Proceedings of the 19th IEEE International Conference on Big Data, pp. 276–288. IEEE, December 2020 Götz, M., et al.: HeAT - a distributed and GPU-accelerated tensor framework for data analytics. In: Proceedings of the 19th IEEE International Conference on Big Data, pp. 276–288. IEEE, December 2020
27.
29.
Zurück zum Zitat Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 37, pp. 448–456. PMLR, Lille, France, 7–9 July 2015. http://proceedings.mlr.press/v37/ioffe15.html Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 37, pp. 448–456. PMLR, Lille, France, 7–9 July 2015. http://​proceedings.​mlr.​press/​v37/​ioffe15.​html
33.
Zurück zum Zitat Kolesnikov, A., et al.: Big transfer (bit): general visual representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision - ECCV 2020, pp. 491–507. Springer, Cham (2020)CrossRef Kolesnikov, A., et al.: Big transfer (bit): general visual representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision - ECCV 2020, pp. 491–507. Springer, Cham (2020)CrossRef
34.
Zurück zum Zitat Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012)
35.
Zurück zum Zitat Kurth, T., et al.: Exascale deep learning for climate analytics. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 649–660. IEEE (2018) Kurth, T., et al.: Exascale deep learning for climate analytics. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 649–660. IEEE (2018)
37.
Zurück zum Zitat Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523 (2018) Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. arXiv preprint arXiv:​1804.​01523 (2018)
39.
Zurück zum Zitat Liu, H., Simonyan, K., Vinyals, O., Fernando, C., Kavukcuoglu, K.: Hierarchical representations for efficient architecture search. arXiv e-prints arXiv:1711.00436, November 2017 Liu, H., Simonyan, K., Vinyals, O., Fernando, C., Kavukcuoglu, K.: Hierarchical representations for efficient architecture search. arXiv e-prints arXiv:​1711.​00436, November 2017
40.
Zurück zum Zitat Lorenzo, P.R., Nalepa, J., Ramos, L., Ranilla, J.: Hyper-parameter selection in deep neural networks using parallel particle swarm optimization. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion (2017) Lorenzo, P.R., Nalepa, J., Ramos, L., Ranilla, J.: Hyper-parameter selection in deep neural networks using parallel particle swarm optimization. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion (2017)
41.
Zurück zum Zitat Mattson, P., et al.: MLPerf: an industry standard benchmark suite for machine learning performance. IEEE Micro 40(2), 8–16 (2020)CrossRef Mattson, P., et al.: MLPerf: an industry standard benchmark suite for machine learning performance. IEEE Micro 40(2), 8–16 (2020)CrossRef
44.
Zurück zum Zitat Orhan, E., Gupta, V., Lake, B.M.: Self-supervised learning through the eyes of a child. In: Advances in Neural Information Processing Systems, vol. 33 (2020) Orhan, E., Gupta, V., Lake, B.M.: Self-supervised learning through the eyes of a child. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
46.
Zurück zum Zitat Patton, R.M., et al.: Exascale deep learning to accelerate cancer research. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 1488–1496. IEEE (2019) Patton, R.M., et al.: Exascale deep learning to accelerate cancer research. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 1488–1496. IEEE (2019)
47.
Zurück zum Zitat Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 512–519, June 2014. https://doi.org/10.1109/CVPRW.2014.131 Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 512–519, June 2014. https://​doi.​org/​10.​1109/​CVPRW.​2014.​131
49.
Zurück zum Zitat Ren, J., et al.: Zero-offload: Democratizing billion-scale model training (2021) Ren, J., et al.: Zero-offload: Democratizing billion-scale model training (2021)
50.
Zurück zum Zitat Rocklin, M.: Dask: parallel computation with blocked algorithms and task scheduling. In: Huff, K., Bergstra, J. (eds.) Proceedings of the 14th Python in Science Conference (SciPy 2015), pp. 130–136 (2015) Rocklin, M.: Dask: parallel computation with blocked algorithms and task scheduling. In: Huff, K., Bergstra, J. (eds.) Proceedings of the 14th Python in Science Conference (SciPy 2015), pp. 130–136 (2015)
51.
Zurück zum Zitat Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)MathSciNetCrossRef Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)MathSciNetCrossRef
52.
53.
Zurück zum Zitat Schug, A., Weigt, M., Onuchic, J.N., Hwa, T., Szurmant, H.: High-resolution protein complexes from integrating genomic information with molecular simulation. Proc. Natl. Acad. Sci. 106(52), 22124–22129 (2009)CrossRef Schug, A., Weigt, M., Onuchic, J.N., Hwa, T., Szurmant, H.: High-resolution protein complexes from integrating genomic information with molecular simulation. Proc. Natl. Acad. Sci. 106(52), 22124–22129 (2009)CrossRef
55.
56.
Zurück zum Zitat Shallue, C.J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., Dahl, G.E.: Measuring the effects of data parallelism on neural network training. J. Mach. Learn. Res. 20, 1–49 (2019)MathSciNet Shallue, C.J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., Dahl, G.E.: Measuring the effects of data parallelism on neural network training. J. Mach. Learn. Res. 20, 1–49 (2019)MathSciNet
57.
Zurück zum Zitat Shi, X., et al.: Convolutional lstm network: A machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems (2015) Shi, X., et al.: Convolutional lstm network: A machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems (2015)
58.
Zurück zum Zitat Sriram, A., et al.: Covid-19 deterioration prediction via self-supervised representation learning and multi-image prediction. arXiv preprint arXiv:2101.04909 (2021) Sriram, A., et al.: Covid-19 deterioration prediction via self-supervised representation learning and multi-image prediction. arXiv preprint arXiv:​2101.​04909 (2021)
59.
Zurück zum Zitat Stodden, V., et al.: Enhancing reproducibility for computational methods. Science 354(6317), 1240–1241 (2016)CrossRef Stodden, V., et al.: Enhancing reproducibility for computational methods. Science 354(6317), 1240–1241 (2016)CrossRef
63.
Zurück zum Zitat Uguzzoni, G., Lovis, S.J., Oteri, F., Schug, A., Szurmant, H., Weigt, M.: Large-scale identification of coevolution signals across homo-oligomeric protein interfaces by direct coupling analysis. Proc. Natl. Acad. Sci. 114(13), E2662–E2671 (2017)CrossRef Uguzzoni, G., Lovis, S.J., Oteri, F., Schug, A., Szurmant, H., Weigt, M.: Large-scale identification of coevolution signals across homo-oligomeric protein interfaces by direct coupling analysis. Proc. Natl. Acad. Sci. 114(13), E2662–E2671 (2017)CrossRef
67.
Zurück zum Zitat Weigt, M., White, R.A., Szurmant, H., Hoch, J.A., Hwa, T.: Identification of direct residue contacts in protein-protein interaction by message passing. Proc. Nat. Acad. Sci. 106(1), 67–72 (2009)CrossRef Weigt, M., White, R.A., Szurmant, H., Hoch, J.A., Hwa, T.: Identification of direct residue contacts in protein-protein interaction by message passing. Proc. Nat. Acad. Sci. 106(1), 67–72 (2009)CrossRef
68.
Zurück zum Zitat Zerihun, M.B., Pucci, F., Peter, E.K., Schug, A.: pydca v1.0: a comprehensive software for direct coupling analysis of RNA and protein sequences. Bioinformatics 36(7), 2264–2265 (2020)CrossRef Zerihun, M.B., Pucci, F., Peter, E.K., Schug, A.: pydca v1.0: a comprehensive software for direct coupling analysis of RNA and protein sequences. Bioinformatics 36(7), 2264–2265 (2020)CrossRef
69.
Zurück zum Zitat Zerihun, M.B., Pucci, F., Schug, A.: Coconet: boosting RNA contact prediction by convolutional neural networks. bioRxiv (2020) Zerihun, M.B., Pucci, F., Schug, A.: Coconet: boosting RNA contact prediction by convolutional neural networks. bioRxiv (2020)
70.
Zurück zum Zitat Zhang, D., et al.: The AI index 2021 annual report, Technical report. AI Index Steering Committee, Human-Centered AI Institute, Stanford University, Stanford, CA (2021) Zhang, D., et al.: The AI index 2021 annual report, Technical report. AI Index Steering Committee, Human-Centered AI Institute, Stanford University, Stanford, CA (2021)
Metadaten
Titel
JUWELS Booster – A Supercomputer for Large-Scale AI Research
verfasst von
Stefan Kesselheim
Andreas Herten
Kai Krajsek
Jan Ebert
Jenia Jitsev
Mehdi Cherti
Michael Langguth
Bing Gong
Scarlet Stadtler
Amirpasha Mozaffari
Gabriele Cavallaro
Rocco Sedona
Alexander Schug
Alexandre Strube
Roshni Kamath
Martin G. Schultz
Morris Riedel
Thomas Lippert
Copyright-Jahr
2021
DOI
https://doi.org/10.1007/978-3-030-90539-2_31

Premium Partner