nach oben

Erschienen in:

2021 | OriginalPaper | Buchkapitel

JUWELS Booster – A Supercomputer for Large-Scale AI Research

verfasst von : Stefan Kesselheim, Andreas Herten, Kai Krajsek, Jan Ebert, Jenia Jitsev, Mehdi Cherti, Michael Langguth, Bing Gong, Scarlet Stadtler, Amirpasha Mozaffari, Gabriele Cavallaro, Rocco Sedona, Alexander Schug, Alexandre Strube, Roshni Kamath, Martin G. Schultz, Morris Riedel, Thomas Lippert

Erschienen in: High Performance Computing

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In this article, we present JUWELS Booster, a recently commissioned high-performance computing system at the Jülich Supercomputing Center. With its system architecture, most importantly its large number of powerful Graphics Processing Units (GPUs) and its fast interconnect via InfiniBand, it is an ideal machine for large-scale Artificial Intelligence (AI) research and applications. We detail its system architecture, parallel, distributed model training, and benchmarks indicating its outstanding performance. We exemplify its potential for research application by presenting large-scale AI research highlights from various scientific fields that require such a facility.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Differentiated Performance in NoSQL Database Access for Hybrid Cloud-HPC Workloads

Nächstes Kapitel In Situ Visualization of WRF Data Using Universal Data Junction

See e.g. https://github.com/EleutherAI/the-pile.

https://docs.nvidia.com/deeplearning/nccl/index.html.

PyTorch allows AD for distributing tensors across computational devices based on the remote procedure call (RPC) protocol [9]. However, the RPC framework does not compete with communication frameworks like NCCL or MPI with respect to performance.

https://github.com/helmholtz-analytics/mpi4torch.

https://gitlab.version.fz-juelich.de/kesselheim1/mlperf_juwelsbooster.

https://tinyurl.com/CovidNetXHelmholtz.

http://bigearth.net/.

Intel Math Kernel Library. Reference Manual. Intel Corporation (2009)

NVIDIA CUBLAS Library Documentation (2017). https://docs.nvidia.com/cuda/cublas/. Accessed 14 Apr 2021

Pucci, F., Schug, A.: Shedding light on the dark matter of the biomolecular structural universe: Progress in RNA 3D structure prediction. Methods 162–163, 68–73 (2019). https://doi.org/10.1016/j.ymeth.2019.04.012

Abadi, M., et al.: TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems (2015). http://tensorflow.org/, Software available from tensorflow.org

Agarwal, S., Wang, H., Venkataraman, S., Papailiopoulos, D.: On the utility of gradient compression in distributed training systems. ArXiv abs/2103.00543 (2021)

Amodei, D., Hernandez, D., Sastry, G., Clark, J., Brockman, G., Sutskever, I.: AI and compute. Technical report, OpenAI Blog (2018)

Bauer, P., Thorpe, A., Brunet, G.: Nature. https://doi.org/10.1038/nature14956

Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. U.S.A. 116, 15849–15854 (2019). https://doi.org/10.1073/pnas.1903070116MathSciNetCrossRefMATH

Birrell, A.D., Nelson, B.J.: Implementing remote procedure calls. ACM Trans. Comput. Syst. 2(1), 39–59 (1984)

10.

Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020)

11.

Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)

12.

Canty, M.: Image Analysis, Classification and Change Detection in Remote Sensing: With Algorithms for ENVI/IDL and Python, 3rd edn. Taylor & Francis, New York (2014). ISBN: 9781466570375

13.

Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.: Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029 (2020)

14.

Cherti, M., Jitsev, J.: Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images. arXiv preprint arXiv:2106.00116 (2021)

15.

Chetlur, S., et al.: cuDNN: efficient primitives for deep learning (2014)

16.

Cohen, J.P., Morrison, P., Dao, L., Roth, K., Duong, T.Q., Ghassemi, M.: Covid-19 image data collection: Prospective predictions are the future. J. Mach. Learn. Biomed. Imaging (2020)

17.

Cuturello, F., Tiana, G., Bussi, G.: Assessing the accuracy of direct-coupling analysis for RNA contact prediction (2020). https://doi.org/10.1261/rna.074179.119

18.

Dago, A.E., Schug, A., Procaccini, A., Hoch, J.A., Weigt, M., Szurmant, H.: Structural basis of histidine kinase autophosphorylation deduced by integrating genomics, molecular dynamics, and mutagenesis. Proc. Natl. Acad. Sci. 109(26), E1733–E1742 (2012)CrossRef

19.

Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, June 2009. https://doi.org/10.1109/CVPR.2009.5206848

20.

Deng, L., Yu, D., Platt, J.: Scalable stacking and learning for building deep architectures. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2133–2136 (2012). https://doi.org/10.1109/ICASSP.2012.6288333

21.

Dettmers, T.: 8-bit approximations for parallelism in deep learning (2015). arxiv:1511.04561

22.

De Leonardis, E., et al.: Direct-Coupling Analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction. Nucl. Acids Res. 43(21), 10444–10455 (2015). https://doi.org/10.1093/nar/gkv932CrossRef

23.

Ginsburg, B., et al.: Stochastic gradient methods with layer-wise adaptive moments for training of deep networks (2020)

24.

Goyal, P., et al.: Accurate, large minibatch SGD: training Imagenet in 1 hour. CoRR abs/1706.02677 (2017). http://arxiv.org/abs/1706.02677

25.

Goyal, P., et al.: Accurate, large minibatch SGD: training ImageNet in 1 hour (2018)

26.

Götz, M., et al.: HeAT - a distributed and GPU-accelerated tensor framework for data analytics. In: Proceedings of the 19th IEEE International Conference on Big Data, pp. 276–288. IEEE, December 2020

27.

Hernandez, D., Kaplan, J., Henighan, T., McCandlish, S.: Scaling laws for transfer. arXiv preprint arXiv:2102.01293 (2021)

28.

Hersbach, H., et al.: The ERA5 global reanalysis. Q. J. R. Meteorol. Soc. 146(730), 1999–2049 (2020). https://doi.org/10.1002/qj.3803CrossRef

29.

Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 37, pp. 448–456. PMLR, Lille, France, 7–9 July 2015. http://proceedings.mlr.press/v37/ioffe15.html

30.

Jülich Supercomputing Centre: JUWELS: Modular Tier-0/1 Supercomputer at the Jülich Supercomputing Centre. J. Large-Scale Res. Facil. 5(A171) (2019). http://dx.doi.org/10.17815/jlsrf-5-171

31.

Kalvari, I., et al.: RFAM 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res. 46(D1), D335–D342 (2017). https://doi.org/10.1093/nar/gkx1038

32.

Kaplan, J., et al.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)

33.

Kolesnikov, A., et al.: Big transfer (bit): general visual representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision - ECCV 2020, pp. 491–507. Springer, Cham (2020)CrossRef

34.

Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012)

35.

Kurth, T., et al.: Exascale deep learning for climate analytics. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 649–660. IEEE (2018)

36.

Laanait, N., et al.: Exascale deep learning for scientific inverse problems. arXiv preprint arXiv:1909.11150 (2019)

37.

Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523 (2018)

38.

Lee, S., Purushwalkam, S., Cogswell, M., Crandall, D.J., Batra, D.: Why M heads are better than one: Training a diverse ensemble of deep networks. CoRR abs/1511.06314 (2015). http://arxiv.org/abs/1511.06314

39.

Liu, H., Simonyan, K., Vinyals, O., Fernando, C., Kavukcuoglu, K.: Hierarchical representations for efficient architecture search. arXiv e-prints arXiv:1711.00436, November 2017

40.

Lorenzo, P.R., Nalepa, J., Ramos, L., Ranilla, J.: Hyper-parameter selection in deep neural networks using parallel particle swarm optimization. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion (2017)

41.

Mattson, P., et al.: MLPerf: an industry standard benchmark suite for machine learning performance. IEEE Micro 40(2), 8–16 (2020)CrossRef

42.

Message Passing Interface Forum: MPI: A Message-Passing Interface Standard, Version 3.1. High Performance Computing Center Stuttgart (HLRS) (2015). https://fs.hlrs.de/projects/par/mpi//mpi31/

43.

Muller, U.A., Gunzinger, A.: Neural net simulation on parallel computers. In: Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN 1994), vol. 6, pp. 3961–3966 (1994). https://doi.org/10.1109/ICNN.1994.374845

44.

Orhan, E., Gupta, V., Lake, B.M.: Self-supervised learning through the eyes of a child. In: Advances in Neural Information Processing Systems, vol. 33 (2020)

45.

Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32, pp. 8024–8035. Curran Associates, Inc. (2019). http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

46.

Patton, R.M., et al.: Exascale deep learning to accelerate cancer research. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 1488–1496. IEEE (2019)

47.

Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 512–519, June 2014. https://doi.org/10.1109/CVPRW.2014.131

48.

Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., Carvalhais, N.: Prabhat: deep learning and process understanding for data-driven Earth system science. Nature (2019). https://doi.org/10.1038/s41586-019-0912-1CrossRef

49.

Ren, J., et al.: Zero-offload: Democratizing billion-scale model training (2021)

50.

Rocklin, M.: Dask: parallel computation with blocked algorithms and task scheduling. In: Huff, K., Bergstra, J. (eds.) Proceedings of the 14th Python in Science Conference (SciPy 2015), pp. 130–136 (2015)

51.

Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)MathSciNetCrossRef

52.

Schmitt, M., Hughes, L.: Sen12ms

53.

Schug, A., Weigt, M., Onuchic, J.N., Hwa, T., Szurmant, H.: High-resolution protein complexes from integrating genomic information with molecular simulation. Proc. Natl. Acad. Sci. 106(52), 22124–22129 (2009)CrossRef

54.

Senior, A.W., et al.: Improved protein structure prediction using potentials from deep learning. Nature 577(7792), 706–710 (2020). https://doi.org/10.1038/s41586-019-1923-7CrossRef

55.

Sergeev, A., Balso, M.D.: Horovod: Fast and Easy Distributed Deep Learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)

56.

Shallue, C.J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., Dahl, G.E.: Measuring the effects of data parallelism on neural network training. J. Mach. Learn. Res. 20, 1–49 (2019)MathSciNet

57.

Shi, X., et al.: Convolutional lstm network: A machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems (2015)

58.

Sriram, A., et al.: Covid-19 deterioration prediction via self-supervised representation learning and multi-image prediction. arXiv preprint arXiv:2101.04909 (2021)

59.

Stodden, V., et al.: Enhancing reproducibility for computational methods. Science 354(6317), 1240–1241 (2016)CrossRef

60.

Subramoney, A., et al.: Igitugraz/l2l: v1.0.0-beta, March 2019. https://doi.org/10.5281/zenodo.2590760

61.

Sumbul, G., Charfuelan, M., Demir, B., Markl, V.: BigEarthNet: a large-scale benchmark archive for remote sensing image understanding. In: Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS) (2019). https://doi.org/10.1109/igarss.2019.8900532

62.

Sumbul, G., Kang, J., Kreuziger, T., Marcelino, F., Costa, H., et al.: BigEarthNet dataset with a new class-nomenclature for remote sensing image understanding (2020). http://arxiv.org/abs/2001.06372

63.

Uguzzoni, G., Lovis, S.J., Oteri, F., Schug, A., Szurmant, H., Weigt, M.: Large-scale identification of coevolution signals across homo-oligomeric protein interfaces by direct coupling analysis. Proc. Natl. Acad. Sci. 114(13), E2662–E2671 (2017)CrossRef

64.

Vogels, T., Karimireddy, S.P., Jaggi, M.: PowerSGD: practical low-rank gradient compression for distributed optimization. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/d9fbed9da256e344c1fa46bb46c34c5f-Paper.pdf

65.

Wang, L., Lin, Z.Q., Wong, A.: COVID-net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest x-ray images. Sci. Rep. 10, 19549 (2020). https://doi.org/10.1038/s41598-020-76550-zCrossRef

66.

Wehbe, R.M., et al.: DeepCOVID-XR: an artificial intelligence algorithm to detect COVID-19 on chest radiographs trained and tested on a large U.S. clinical data set. Radiology 299, E167–E176 (2021). https://doi.org/10.1148/radiol.2020203511

67.

Weigt, M., White, R.A., Szurmant, H., Hoch, J.A., Hwa, T.: Identification of direct residue contacts in protein-protein interaction by message passing. Proc. Nat. Acad. Sci. 106(1), 67–72 (2009)CrossRef

68.

Zerihun, M.B., Pucci, F., Peter, E.K., Schug, A.: pydca v1.0: a comprehensive software for direct coupling analysis of RNA and protein sequences. Bioinformatics 36(7), 2264–2265 (2020)CrossRef

69.

Zerihun, M.B., Pucci, F., Schug, A.: Coconet: boosting RNA contact prediction by convolutional neural networks. bioRxiv (2020)

70.

Zhang, D., et al.: The AI index 2021 annual report, Technical report. AI Index Steering Committee, Human-Centered AI Institute, Stanford University, Stanford, CA (2021)

71.

Zhang, S., Choromanska, A.E., LeCun, Y.: Deep learning with elastic averaging SGD. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015). https://proceedings.neurips.cc/paper/2015/file/d18f655c3fce66ca401d5f38b48c89af-Paper.pdf

Titel: JUWELS Booster – A Supercomputer for Large-Scale AI Research
verfasst von: Stefan Kesselheim
Andreas Herten
Kai Krajsek
Jan Ebert
Jenia Jitsev
Mehdi Cherti
Michael Langguth
Bing Gong
Scarlet Stadtler
Amirpasha Mozaffari
Gabriele Cavallaro
Rocco Sedona
Alexander Schug
Alexandre Strube
Roshni Kamath
Martin G. Schultz
Morris Riedel
Thomas Lippert
Verlag: Springer International Publishing
Buch: High Performance Computing
Print ISBN: 978-3-030-90538-5

Electronic ISBN: 978-3-030-90539-2

Copyright-Jahr: 2021
DOI: https://doi.org/10.1007/978-3-030-90539-2_31

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner