Skip to main content

2021 | Buch

High Performance Computing

ISC High Performance Digital 2021 International Workshops, Frankfurt am Main, Germany, June 24 – July 2, 2021, Revised Selected Papers

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed post-conference proceedings of 9 workshops held at the 35th International ISC High Performance 2021 Conference, in Frankfurt, Germany, in June-July 2021:

Second International Workshop on the Application of Machine Learning Techniques to Computational Fluid Dynamics and Solid Mechanics Simulations and Analysis; HPC-IODC: HPC I/O in the Data Center Workshop; Compiler-assisted Correctness Checking and Performance Optimization for HPC; Machine Learning on HPC Systems;4th International Workshop on Interoperability of Supercomputing and Cloud Technologies;2nd International Workshop on Monitoring and Operational Data Analytics;16th Workshop on Virtualization in High­-Performance Cloud Computing; Deep Learning on Supercomputers; 5th International Workshop on In Situ Visualization.

The 35 papers included in this volume were carefully reviewed and selected. They cover all aspects of research, development, and application of large-scale, high performance experimental and commercial systems. Topics include high-performance computing (HPC), computer architecture and hardware, programming models, system software, performance analysis and modeling, compiler analysis and optimization techniques, software sustainability, scientific applications, deep learning.

Inhaltsverzeichnis

Frontmatter
Correction to: An Explainable Model for Fault Detection in HPC Systems

The chapter was inadvertently published with the spelling error in the first author’s name. It has been corrected to “Martin Molan”.

Martin Molan, Andrea Borghesi, Francesco Beneventi, Massimiliano Guarrasi, Andrea Bartolini

Open Access

Correction to: Machine-Learning-Based Control of Perturbed and Heated Channel Flows

Chapter “Machine-Learning-Based Control of Perturbed and Heated Channel Flows” was previously published non-open access. It has now been changed to open access under a CC BY 4.0 license and the copyright holder updated to ‘The Author(s)’.

Mario Rüttgers, Moritz Waldmann, Wolfgang Schröder, Andreas Lintermann

Second International Workshop on the Application of Machine Learning Techniques to Computational Fluid Dynamics and Solid Mechanics Simulations and Analysis

Frontmatter

Open Access

Machine-Learning-Based Control of Perturbed and Heated Channel Flows

A reinforcement learning algorithm is coupled to a thermal lattice-Boltzmann method to control flow through a two-dimensional heated channel narrowed by a bump. The algorithm is allowed to change the disturbance factor of the bump and receives feedback in terms of the pressure loss and temperature increase between the inflow and outflow region of the channel. It is trained to modify the bump such that both fluid mechanical properties are rated equally important. After a modification, a new simulation is initialized using the modified geometry and the flow field computed in the previous run. The thermal lattice-Boltzmann method is validated for a fully developed isothermal channel flow. After 265 simulations, the trained algorithm predicts an averaged disturbance factor that deviates by less than $$1\%$$ 1 % from the reference solution obtained from 3,400 numerical simulations using a parameter sweep over the disturbance factor. The error is reduced to less than $$0.1\%$$ 0.1 % after 1,450 simulations. A comparison of the temperature, pressure, and streamwise velocity distributions of the reference solution with the solution after 1,450 simulations along the line of the maximum velocity component in streamwise direction shows only negligible differences. The presented method is hence a valid method for avoiding expensive parameter space explorations and promises to be effective in supporting shape optimizations for more complex configurations, e.g., in finding optimal nasal cavity shapes.

Mario Rüttgers, Moritz Waldmann, Wolfgang Schröder, Andreas Lintermann
Novel DNNs for Stiff ODEs with Applications to Chemically Reacting Flows

Chemically reacting flows are common in engineering, such as hypersonic flow, combustion, explosions, manufacturing processes and environmental assessments. For combustion, the number of reactions can be significant (over 100) and due to the very large CPU requirements of chemical reactions (over 99%) a large number of flow and combustion problems are presently beyond the capabilities of even the largest supercomputers.Motivated by this, novel Deep Neural Networks (DNNs) are introduced to approximate stiff ODEs. Two approaches are compared, i.e., either learn the solution or the derivative of the solution to these ODEs. These DNNs are applied to multiple species and reactions common in chemically reacting flows. Experimental results show that it is helpful to account for the physical properties of species while designing DNNs. The proposed approach is shown to generalize well.

Thomas S. Brown, Harbir Antil, Rainald Löhner, Fumiya Togashi, Deepanshu Verma
Lettuce: PyTorch-Based Lattice Boltzmann Framework

The lattice Boltzmann method (LBM) is an efficient simulation technique for computational fluid mechanics and beyond. It is based on a simple stream-and-collide algorithm on Cartesian grids, which is easily compatible with modern machine learning architectures. While it is becoming increasingly clear that deep learning can provide a decisive stimulus for classical simulation techniques, recent studies have not addressed possible connections between machine learning and LBM. Here, we introduce Lettuce, a PyTorch-based LBM code with a threefold aim. Lettuce enables GPU accelerated calculations with minimal source code, facilitates rapid prototyping of LBM models, and enables integrating LBM simulations with PyTorch’s deep learning and automatic differentiation facility. As a proof of concept for combining machine learning with the LBM, a neural collision model is developed, trained on a doubly periodic shear layer and then transferred to a different flow, a decaying turbulence. We also exemplify the added benefit of PyTorch’s automatic differentiation framework in flow control and optimization. To this end, the spectrum of a forced isotropic turbulence is maintained without further constraining the velocity field. The source code is freely available from https://github.com/lettucecfd/lettuce .

Mario Christopher Bedrunka, Dominik Wilde, Martin Kliemank, Dirk Reith, Holger Foysi, Andreas Krämer
Reservoir Computing in Reduced Order Modeling for Chaotic Dynamical Systems

The mathematical concept of chaos was introduced by Edward Lorenz in the early 1960s while attempting to represent atmospheric convection through a two-dimensional fluid flow with an imposed temperature difference in the vertical direction. Since then, chaotic dynamical systems are accepted as the foundation of the meteorological sciences and represent an indispensable testbed for weather and climate forecasting tools. Operational weather forecasting platforms rely on costly partial differential equations (PDE)-based models that run continuously on high performance computing architectures. Machine learning (ML)-based low-dimensional surrogate models can be viewed as a cost-effective solution for such high-fidelity simulation platforms. In this work, we propose an ML method based on Reservoir Computing - Echo State Neural Network (RC-ESN) to accurately predict evolutionary states of chaotic systems. We start with the baseline Lorenz-63 and 96 systems and show that RC-ESN is extremely effective in consistently predicting time series using Pearson’s cross correlation similarity measure. RC-ESN can accurately forecast Lorenz systems for many Lyapunov time units into the future. In a practical numerical example, we applied RC-ESN combined with space-only proper orthogonal decomposition (POD) to build a reduced order model (ROM) that produces sequential short-term forecasts of pollution dispersion over the continental USA region. We use GEOS-CF simulated data to assess our RC-ESN ROM. Numerical experiments show reasonable results for such a highly complex atmospheric pollution system.

Alberto C. Nogueira Jr., Felipe C. T. Carvalho, João Lucas S. Almeida, Andres Codas, Eloisa Bentivegna, Campbell D. Watson
Film Cooling Prediction and Optimization Based on Deconvolution Neural Network

For film cooling in high pressure turbines, it is vital to predict the temperature distribution on the blade surface downstream of the cooling hole. This temperature distribution depends on the interaction between the hot mainstream and the coolant jet. Deep learning techniques have been widely applied in predicting physical problems such as complex fluids dynamics. A theoretic model based on Deconvolutional Neural Network (Deconv NN) was developed to model the non-linear and high-dimensional mapping between coolant jet parameters and the surface temperature distribution. Computational Fluid Dynamics (CFD) was utilized to provide data for the training models. The input of the model includes blowing ratio, density ratio, hole inclination angle and hole diameters etc. Comparison against different methods and data set size for accuracy is conducted and the result shows that the Deconv NN is capable of predicting film cooling effectiveness on the surface in validation group with quoted error (QE) less than 0.62%. With rigorous testing and validation, it is found that the predicted results are in good agreement with results from CFD. At the end, the Sparrow Search Algorithm (SSA) is applied to optimize coolant jet parameters using the validated neural networks. The results of the optimization show that the film cooling effectiveness has been successfully improved with QE 7.35% when compared with the reference case.

Yaning Wang, Shirui Luo, Wen Wang, Guocheng Tao, Xinshuai Zhang, Jiahuan Cui
Turbomachinery Blade Surrogate Modeling Using Deep Learning

Recent work has shown that deep learning provides an alternative solution as an efficient function approximation technique for airfoil surrogate modeling. In this paper we present the feasibility of convolutional neural network (CNN) techniques for aerodynamic performance evaluation. CNN approach will enable designer to fully utilize the ability of computers and statistics to interrogate and interpolate the nonlinear relationship between shapes and flow quantities, and rapidly perform a thorough optimization of the wide design space. The principal idea behind the current effort is to uncover the latent constructs and underlying cross-sectional relationships among the shape parameters, categories of flow field features, and quantities of interest in turbo-machinery blade design. The proposed CNN method is proved to automatically detect essential features and effectively estimate the pressure loss and deviation much faster than CFD solver.

Shirui Luo, Jiahuan Cui, Vignesh Sella, Jian Liu, Seid Koric, Volodymyr Kindratenko
A Data-Driven Wall-Shear Stress Model for LES Using Gradient Boosted Decision Trees

With the recent advances in machine learning, data-driven strategies could augment wall modeling in large eddy simulation (LES). In this work, a wall model based on gradient boosted decision trees is presented. The model is trained to learn the boundary layer of a turbulent channel flow so that it can be used to make predictions for significantly different flows where the equilibrium assumptions are valid. The methodology of building the model is presented in detail. The experiment conducted to choose the data for training is described. The trained model is tested a posteriori on a turbulent channel flow and the flow over a wall-mounted hump. The results from the tests are compared with that of an algebraic equilibrium wall model, and the performance is evaluated. The results show that the model has succeeded in learning the boundary layer, proving the effectiveness of our methodology of data-driven model development, which is extendable to complex flows.

Sarath Radhakrishnan, Lawrence Adu Gyamfi, Arnau Miró, Bernat Font, Joan Calafell, Oriol Lehmkuhl
Nonlinear Mode Decomposition and Reduced-Order Modeling for Three-Dimensional Cylinder Flow by Distributed Learning on Fugaku

Nonlinear modes of the three-dimensional flow field around a cylinder were extracted by distributed learning on Fugaku. Mode decomposition is an approach used to decompose flow fields into physically important flow structures known as modes. In this study, convolutional neural network-based mode decomposition was applied to the three-dimensional flow field. However, because this process is costly in terms of calculation and memory usage for even a small flow field problem, the enormous computational and memory resources of the supercomputer Fugaku were employed. A hybrid parallelism method combining the distribution of network structure (model parallelism) and the input data (data parallelism) using up to 10,500 nodes on Fugaku was employed for learning. Further, we constructed a reduced-order model to predict the time evolution of latent vector, using the long short-term memory networks. Finally, we compared the reproduced flow field of the model with that of the original full-order model. In addition, we evaluated the execution performance of the learning process. Using a single core memory group, the whole learning process indicates a value of 129.50 GFLOPS being achieved, 7.57% of the single-precision floating-point arithmetic peak performance. Notably, the convolution calculation for backward-propagation achieved 1103.09 GFLOPS, which is 65.39% of the peak. Furthermore, with the weak scaling test, the whole learning process indicates 72.9% with 25,250 nodes (1,212,000 cores) relative to 750 nodes, the sustained performance is 7.8 PFLOPS. In particular, the convolution calculation for backward-propagation indicates a result of 113 PFLOPS (66.2% of the peak performance).

Kazuto Ando, Keiji Onishi, Rahul Bale, Makoto Tsubokura, Akiyoshi Kuroda, Kazuo Minami
Using Physics-Informed Enhanced Super-Resolution Generative Adversarial Networks to Reconstruct Mixture Fraction Statistics of Turbulent Jet Flows

This work presents the full reconstruction of coarse-grained turbulence fields in a planar turbulent jet flow by a deep learning framework for large-eddy simulations (LES). Turbulent jet flows are characterized by complex phenomena such as intermittency and external interfaces. These phenomena are strictly non-universal and conventional LES models have shown only limited success in modeling turbulent mixing in such configurations. Therefore, a deep learning approach based on physics-informed enhanced super-resolution generative adversarial networks (Bode et al., Proceedings of the Combustion Institute, 2021) is utilized to reconstruct turbulence and mixture fraction fields from coarse-grained data. The usability of the deep learning model is validated by applying it to data obtained from direct numerical simulations (DNS) with more than 78 Billion degrees of freedom. It is shown that statistics of the mixture fraction field can be recovered from coarse-grained data with good accuracy.

Michael Gauding, Mathis Bode

HPC I/O in the Data Center

Frontmatter
Toward a Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis

One goal of support staff at a data center is to identify inefficient jobs and to improve their efficiency. Therefore, a data center deploys monitoring systems that capture the behavior of the executed jobs. While it is easy to utilize statistics to rank jobs based on the utilization of computing, storage, and network, it is tricky to find patterns in 100,000 jobs, i.e., is there a class of jobs that aren’t performing well. Similarly, when support staff investigates a specific job in detail, e.g., because it is inefficient or highly efficient, it is relevant to identify related jobs to such a blueprint. This allows staff to understand the usage of the exhibited behavior better and to assess the optimization potential.In this article, our goal is to identify jobs similar to an arbitrary reference job. In particular, we sketch a methodology that utilizes temporal I/O similarity to identify jobs related to the reference job. Practically, we apply several previously developed time series algorithms. A study is conducted to explore the effectiveness of the approach by investigating related jobs for a reference job. The data stem from DKRZ’s supercomputer Mistral and include more than 500,000 jobs that have been executed for more than 6 months of operation. Our analysis shows that the strategy and algorithms bear the potential to identify similar jobs, but more testing is necessary.

Julian Kunkel, Eugen Betke
H3: An Application-Level, Low-Overhead Object Store

H3 is an embedded object store, backed by a high-performance key-value store. H3 provides a user-friendly object API, similar to Amazon’s S3, but is especially tailored for use in “converged” Cloud-HPC environments, where HPC applications expect from the underlying storage services to meet strict latency requirements—even for high-level object operations. By embedding the object store in the application, thus avoiding the REST layer, we show that data operations gain significant performance benefits, especially for smaller sized objects. Additionally, H3’s pluggable back-end architecture allows adapting the object store’s scale and performance to a variety of deployment requirements. H3 supports several key-value stores, ranging from in-memory services to distributed, RDMA-based implementations. The core of H3 is H3lib, a C library with Python and Java bindings. The H3 ecosystem also includes numerous utilities and compatibility layers: The H3 FUSE filesystem allows object access using file semantics, the CSI H3 implementation uses H3 FUSE for attaching H3-backed persistent volumes in Docker and Kubernetes, while an S3proxy plug-in offers an S3 protocol-compatible endpoint for legacy applications.

Antony Chazapis, Efstratios Politis, Giorgos Kalaentzis, Christos Kozanitis, Angelos Bilas

Compiler-assisted Correctness Checking and Performance Optimization for HPC

Frontmatter
Automatic Partitioning of MPI Operations in MPI+OpenMP Applications

The new MPI 4.0 standard includes a new chapter about partitioned point-to-point communication operations. These partitioned operations allow multiple actors of one MPI process (e.g. multiple threads) to contribute data to one communication operation. These operations are designed to mitigate current problems in multithreaded MPI programs, with some work suggesting a substantial performance benefit (up to 26%) when using these operations compared to their existing non-blocking counterparts.In this work, we explore the possibility for the compiler to automatically partition sending operations across multiple OpenMP threads. For this purpose, we developed an LLVM compiler pass that partitions MPI sending operations across the different iterations of OpenMP for loops. We demonstrate the feasibility of this approach by applying it to 2D stencil codes, observing very little overhead while the correctness of the codes is sustained. Therefore, this approach facilitates the usage of these new additions to the MPI standard for existing codes.Our code is available on github: https://github.com/tudasc/CommPart .

Tim Jammer, Christian Bischof
heimdallr: Improving Compile Time Correctness Checking for Message Passing with Rust

Message passing is the foremost parallelization method used in high-performance computing (HPC). Parallel programming in general and especially message passing strongly increase the complexity and susceptibility to errors of programs. The de-facto standard technologies used to realize message passing applications in HPC are MPI with C/C++ or Fortran code. These technologies offer high performance but do not come with many compile-time correctness guarantees and are quite error-prone. This paper presents our work on a message passing library implemented in Rust that focuses on compile-time correctness checks. In our design, we apply Rust’s memory and concurrency safety features to a message passing context and show how common error classes from MPI applications can be avoided with this approach.Problems with the type safety of transmitted messages can be mitigated through the use of generic programming concepts at compile time and completely detected during runtime using data serialization methods. Our library is able to use Rust’s memory safety features to achieve data buffer safety for non-blocking message passing operations at compile time.A performance comparison between our proof of concept implementation and MPI is included to evaluate the practicality of our approach. While the performance of MPI could not be beaten, the results still are promising. Moreover, we are able to achieve clear improvements in the aspects of correctness and usability.

Michael Blesel, Michael Kuhn, Jannek Squar
Potential of Interpreter Specialization for Data Analysis

Scientists frequently implement data analyses in high-level programming languages such as Python, Perl, Lu, and R. Many of these languages are inefficient due to the overhead of being dynamically typed and interpreted. In this paper, we report the potential performance improvement of domain-specific interpreter specialization for data analysis workloads and evaluate how the characteristics of data analysis workloads affect the specialization, both positively and negatively. Assisted by compilers, we specialize the Lu and CPython interpreters at source-level using the script being interpreted and the data types during the interpretation as invariants for five common tasks from real data analysis workloads. Through experiments, we measure 9.0–39.6% performance improvement for Lu and 11.0–17.2% performance improvement for CPython for benchmarks that perform data loading, histogram computation, data filtering, data transformation, and dataset shuffle. This specialization does not include misspeculation checks of data types at possible type conversion code that may be necessary for other workloads. We report the details of our evaluation and present a semi-automatic method for specializing the interpreters.

Wei He, Michelle Mills Strout
Refactoring for Performance with Semantic Patching: Case Study with Recipes

Development of an HPC simulation code may take years of a domain scientists’ work. Over that timespan, the computing landscape evolves, efficient programming best practices change, APIs of performance libraries change, etc. A moment then comes when the entire codebase requires a thorough performance lift. In the luckiest case, the required intervention is limited to a few hot loops. In practice, much more is needed. This paper describes an activity of programmatic refactoring of $$\approx $$ ≈ 200k lines of C code by means of source-to-source translation. The context is that of a so-called high level support provided to the domain scientist community by a HPC service center. The motivation of this short paper is the immediate reuse potential of these techniques.

Michele Martone, Julia Lawall
Negative Perceptions About the Applicability of Source-to-Source Compilers in HPC: A Literature Review

A source-to-source compiler is a type of translator that accepts the source code of a program written in a programming language as its input and produces an equivalent source code in the same or different programming language. S2S techniques are commonly used to enable fluent translation between high-level programming languages, to perform large-scale refactoring operations, and to facilitate instrumentation for dynamic analysis. Negative perceptions about S2S’s applicability in High Performance Computing (HPC) are studied and evaluated here. This is a first study that brings to light reasons why scientists do not use source-to-source techniques for HPC. The primary audience for this paper are those considering S2S technology in their HPC application work.

Reed Milewicz, Peter Pirkelbauer, Prema Soundararajan, Hadia Ahmed, Tony Skjellum

Machine Learning on HPC Systems

Frontmatter
Automatic Tuning of Tensorflow’s CPU Backend Using Gradient-Free Optimization Algorithms

Modern deep learning (DL) applications are built using DL libraries and frameworks such as TensorFlow and PyTorch. These frameworks have complex parameters and tuning them to obtain good training and inference performance is challenging for typical users, such as DL developers and data scientists. Manual tuning requires deep knowledge of the user-controllable parameters of DL frameworks as well as the underlying hardware. It is a slow and tedious process, and it typically delivers sub-optimal solutions. In this paper, we treat the problem of tuning parameters of DL frameworks to improve training and inference performance as a black-box optimization problem. We then investigate applicability and effectiveness of Bayesian optimization, genetic algorithm, and Nelder-Mead simplex to tune the parameters of TensorFlow’s CPU backend. While prior work has already investigated the use of Nelder-Mead simplex for a similar problem, it does not provide insights into the applicability of other more popular algorithms. Towards that end, we provide a systematic comparative analysis of all three algorithms in tuning TensorFlow’s CPU backend on a variety of DL models. Our findings reveal that Bayesian optimization performs the best on the majority of models. There are, however, cases where it does not deliver the best results.

Derssie Mebratu, Niranjan Hasabnis, Pietro Mercati, Gaurit Sharma, Shamima Najnin
MSM: Multi-stage Multicuts for Scalable Image Clustering

Correlation Clustering, also called the minimum cost Multicut problem, is the process of grouping data by pairwise similarities. It has proven to be effective on clustering problems, where the number of classes is unknown. However, not only is the Multicut problem NP-hard, an undirected graph G with n vertices representing single images has at most $$\frac{n(n-1)}{2}$$ n ( n - 1 ) 2 edges, thus making it challenging to implement correlation clustering for large datasets. In this work, we propose Multi-Stage Multicuts (MSM) as a scalable approach for image clustering. Specifically, we solve minimum cost Multicut problems across multiple distributed compute units. Our approach not only allows to solve problem instances which are too large to fit into the shared memory of a single compute node, but it also achieves significant speedups while preserving the clustering accuracy at the same time. We evaluate our proposed method on the CIFAR10 and CelebA image datasets. Furthermore, we also provide the proof for the theoretical speedup.

Kalun Ho, Avraam Chatzimichailidis, Margret Keuper, Janis Keuper
OmniOpt – A Tool for Hyperparameter Optimization on HPC

Hyperparameter optimization is a crucial task in numerous applications of numerical modelling techniques. Methods as diverse as classical simulations and the great variety of machine learning techniques used nowadays, require an appropriate choice of their hyperparameters (HPs). While for classical simulations, calibration to measured data by numerical optimization techniques has a long tradition, the HPs of neural networks are often chosen by a mixture of grid search, random search and manual tuning.In the present study the expert tool “OmniOpt” is introduced, which allows to optimize the HPs of a wide range of problems, ranging from classical simulations to different kinds of neural networks. Thereby, the emphasis is on versatility and flexibility for the user in terms of the applications and the choice of its HPs to be optimized. Moreover, the optimization procedure – which is usually a very time-consuming task – should be performed in a highly parallel way on the HPC system Taurus at TU Dresden. To this end, a Bayesian stochastic optimization algorithm (TPE) has been implemented on the Taurus system and connected to a user-friendly graphical user interface (GUI). In addition to the automatic optimization service, there is a variety of tools for analyzing and graphically displaying the results of the optimization.The application of OmniOpt to a practical problem from material science is presented as an example.

Peter Winkler, Norman Koch, Andreas Hornig, Johannes Gerritzen
Parallel/Distributed Intelligent Hyperparameters Search for Generative Artificial Neural Networks

This article presents a parallel/distributed methodology for the intelligent search of the hyperparameters configuration for generative artificial neural networks (GANs). Finding the configuration that best fits a GAN for a specific problem is challenging because GANs simultaneously train two deep neural networks. Thus, in general, GANs have more configuration parameters than other deep learning methods. The proposed system applies the iterated racing approach taking advantage of parallel/distributed computing for the efficient use of resources for configuration. The main results of the experimental evaluation performed on the MNIST dataset showed that the parallel system is able to efficiently use the GPU, achieving a high level of parallelism and reducing the computational wall clock time by 78%, while providing competitive comparable results to the sequential hyperparameters search.

Mathias Esteban, Jamal Toutouh, Sergio Nesmachnow
Machine Learning for Generic Energy Models of High Performance Computing Resources

This article presents an study of the generalization capabilities of forecasting techniques of empirical energy consumption models of high performance computing resources. This is a relevant subject, considering the large energy utilization of modern supercomputing facilities. Different energy models are built, considering several forecasting techniques and using information from the execution of a benchmark over different hardware. A cross-evaluation is performed and the training information of each model is gradually extended with information about other hardware. Each model is analyzed to evaluate how new information impacts on the prediction capabilities. The main results indicate that neural network approaches achieve the highest quality results when the training data of the models is expanded with minimal information from new scenarios.

Jonathan Muraña, Carmen Navarrete, Sergio Nesmachnow

Fourth International Workshop on Interoperability of Supercomputing and Cloud Technologies

Frontmatter
Automation for Data-Driven Research with the NERSC Superfacility API

The Superfacility API brings automation to the use of High Performance Computing (HPC) systems. Our aim is to enable scientists to reliably automate their interactions with computational resources at the National Energy Research Scientific Computing Center (NERSC), removing human intervention from the process of transferring, analyzing, and managing data. In this paper, we describe the science use cases that drive the API design, our schema of API endpoints, and implementation details and considerations, including authentication and authorization. We also discuss future plans, working toward our vision of supporting entirely automated experiment-network-HPC workflows.

Deborah J. Bard, Mark R. Day, Bjoern Enders, Rebecca J. Hartman–Baker, John Riney III, Cory Snavely, Gabor Torok
A Middleware Supporting Data Movement in Complex and Software-Defined Storage and Memory Architectures

Among the broad variety of challenges that arise from workloads in a converged HPC and Cloud infrastructure, data movement is of paramount importance, especially oncoming exascale systems featuring multiple tiers of memory and storage. While the focus has, for years, been primarily on optimizing computations, the importance of improving data handling on such architectures is now well understood. As optimization techniques can be applied at different stages (operating system, run-time system, programming environment, and so on), a middleware providing a uniform and consistent data awareness becomes necessary. In this paper, we introduce a novel memory- and data-aware middleware called Maestro, designed for data orchestration.

Christopher Haine, Utz-Uwe Haus, Maxime Martinasso, Dirk Pleiter, François Tessier, Domokos Sarmany, Simon Smart, Tiago Quintino, Adrian Tate

Second International Workshop on Monitoring and Operational Data Analytics

Frontmatter
An Operational Data Collecting and Monitoring Platform for Fugaku: System Overviews and Case Studies in the Prelaunch Service Period

After a seven-year long-term development process, the supercomputer Fugaku was officially launched as the successor to the K computer in March 2021. During this development process, we upgraded various system components and the data center infrastructure for official service in Fugaku. It was also necessary to upgrade the K computer operational data collection/monitoring platform for use in Fugaku. As a result, we are now in the process of developing and deploying an operational data collection/monitoring platform based on a three-tier pipeline architecture. In the first stage, the HPC system produces various types of log/metric data that are used to identify and monitor troubleshooting issues. Additionally, several thousand sensors operated by the building management system (BMS) generate metrics for power supply and cooling equipment. In the second stage, we aggregate the data into time-series databases and then visualize the results via a dashboard in the third stage. The dashboard provides an interactive interface for multiple data of the HPC system and data center infrastructure. During the course of this project, we resolved some issues found in the previous K computer platform. By using the redundant cores of the A64FX to allocate agents, it was determined that the new platform takes less than 20 s to collect metrics from over 150k compute nodes and finally write them to persistent storage. This paper introduces the design of the system architecture and reports on the current state of the platform renewal project, and provides overviews of two use cases encountered during the prelaunch service period.

Masaaki Terai, Keiji Yamamoto, Shin’ichi Miura, Fumiyoshi Shoji
An Explainable Model for Fault Detection in HPC Systems

Large supercomputers are composed of numerous components that risk to break down or behave in unwanted manners. Identifying broken components is a daunting task for system administrators. Hence an automated tool would be a boon for the systems resiliency. The wealth of data available in a supercomputer can be used for this task. In this work we propose an approach to take advantage of holistic data centre monitoring, system administrator node status labeling and an explainable model for fault detection in supercomputing nodes. The proposed model aims at classifying the different states of the computing nodes thanks to the labeled data describing the supercomputer behaviour, data which is typically collected by system administrators but not integrated in holistic monitoring infrastructure for data center automation. In comparison the other method, the one proposed here is robust and provide explainable predictions. The model has been trained and validated on data gathered from a tier-0 supercomputer in production.

Martin Molan, Andrea Borghesi, Francesco Beneventi, Massimiliano Guarrasi, Andrea Bartolini

Sixteenth Workshop on Virtualization in High–Performance Cloud Computing

Frontmatter
A Scalable Cloud Deployment Architecture for High-Performance Real-Time Online Applications

We study high-performance Real-Time Online Interactive Applications (ROIAs), with use cases like product configurators in the Configure-Price-Quote market, e-learning, multiplayer online gaming, and digital twins of production facilities for the Industry 4.0 market. While core components of ROIAs, e.g., interactive real-time 3D rendering, still widely run on local devices, it is very desirable to run them on cloud resources to benefit from the advantages of cloud computing, e.g., better quality provided by high-performance compute resources and accessibility. In this paper, we design and implement a novel cloud service deployment architecture for ROIAs, which addresses three major challenges: meeting the high Quality of Service (QoS) requirements, auto-scalability, and resource usage optimization. Compared to previous work, our deployment approach is based on the concept of session slots that combines a high level of QoS with the economic use of resources like CPU, GPU, and memory. We describe a prototype implementation of a ROIA use case - a car configurator running on a Kubernetes cluster. Experimental evaluation demonstrates that our architecture avoids the traffic and latency bottleneck of a classical cloud load balancer, provides significantly more efficient resource usage, and can autoscale well.

Sezar Jarrous-Holtrup, Folker Schamel, Kerstin Hofer, Sergei Gorlatch
Leveraging HW Approximation for Exploiting Performance-Energy Trade-offs Within the Edge-Cloud Computing Continuum

Today, the need for real-time analytics and faster decision making mechanisms has led to the adoption of hardware accelerators, such as GPUs and FPGAs, within the edge-cloud computing continuum. Moreover, the need for energy-, yet performance-efficient solutions both in the edge and cloud has led to the rise of approximate computing as a promising paradigm, where “acceptable errors” are introduced to error-tolerant applications, thus, providing significant power-saving gains. In this work, we leverage approximate computing for exploiting performance-energy trade-offs of FPGA accelerated kernels with faster design time though an extended source-to-source HLS compiler based on Xilinx Vitis framework. We introduce a novel programming interface that operates at a high level of abstraction, thus, enabling automatic optimizations to the existing HLS design flow supporting both embedded and cloud devices through a common API. We evaluate our approach over three different application from DSP and machine learning domains and show that a decrease of 27% and 28% in power consumption, 61% and 69% in DSP utilization and 7% in clock period is achieved for Alveo U200 and ZCU104 FPGA platforms, on average.

Argyris Kokkinis, Aggelos Ferikoglou, Dimitrios Danopoulos, Dimosthenis Masouros, Kostas Siozios
Datashim and Its Applications in Bioinformatics

Bioinformatics pipelines depend on shared POSIX filesystems for its input, output and intermediate data storage. Containerization makes it more difficult for the workloads to access the shared file systems. In our previous study, we were able to run both ML and non-ML pipelines on Kubeflow successfully. However, the storage solutions were complex and less optimal.In this article, we are introducing a new concept of Dataset and its corresponding resource as a native Kubernetes object. We have implemented the concept with a new framework Datashim which takes care of all the low-level details about data access in Kubernetes pods. Its pluggable architecture is designed for the development of caching, scheduling and governance plugins. Together, they manage the entire lifecycle of the custom resource Dataset.We use Datashim to serve data from object stores to both ML and non-ML pipelines on Kubeflow. We feed training data into ML models directly with Datashim instead of downloading it to the local disks, which makes the input scalable. We have enhanced the durability of training metadata by storing it into a dataset, which also simplifies the setup of the TensorBoard, independent of the notebook server. For the non-ML pipeline, we have simplified the 1000 Genome Project pipeline with datasets injected into the pipeline dynamically. We have now established a new resource type Dataset to represent the concept of data source on Kubernetes with our novel framework Datashim to manage its lifecycle.

Yiannis Gkoufas, David Yu Yuan, Christian Pinto, Panagiotis Koutsovasilis, Srikumar Venugopal
FaaS and Curious: Performance Implications of Serverless Functions on Edge Computing Platforms

Serverless is an emerging paradigm that greatly simplifies the usage of cloud resources providing unprecedented auto-scaling, simplicity, and cost-efficiency features. Thus, more and more individuals and organizations adopt it, to increase their productivity and focus exclusively on the functionality of their application. Additionally, the cloud is expanding towards the deep edge, forming a continuum in which the event-driven nature of the serverless paradigm seems to make a perfect match. The extreme heterogeneity introduced, in terms of diverse hardware resources and frameworks available, requires systematic approaches for evaluating serverless deployments. In this paper, we propose a methodology for evaluating serverless frameworks deployed on hybrid edge-cloud clusters. Our methodology focuses on key performance knobs of the serverless paradigm and applies a systematic way for evaluating these aspects in hybrid edge-cloud environments. We apply our methodology on three open-source serverless frameworks, OpenFaaS, Openwhisk, and Lean Openwhisk respectively, and we provide key insights regarding their performance implications over resource-constrained edge devices.

Achilleas Tzenetopoulos, Evangelos Apostolakis, Aphrodite Tzomaka, Christos Papakostopoulos, Konstantinos Stavrakakis, Manolis Katsaragakis, Ioannis Oroutzoglou, Dimosthenis Masouros, Sotirios Xydis, Dimitrios Soudris
Differentiated Performance in NoSQL Database Access for Hybrid Cloud-HPC Workloads

In recent years, the demand for cloud-based high-performance computing applications and services has grown in order to sustain the computational and statistical challenges of big-data analytics scenarios. In this context, there is a growing need for reliable large-scale NoSQL data stores capable of efficiently serving mixed high-performance and interactive cloud workloads. This paper deals with the problem of designing such NoSQL database service: to this purpose, a set of modifications to the popular MongoDB software are presented. The modified MongoDB lets clients submit individual requests or even carry out whole sessions at different priority levels, so that the higher-priority requests are served with shorter response times that exhibit less variance, with respect to lower-priority requests. Experimental results carried out on two big multi-core servers using synthetic workload scenarios demonstrate the effectiveness of the proposed approach in providing differentiated performance levels, highlighting what trade-offs are available between maximum achievable throughput for the platform, and the response-time reduction for higher-priority requests.

Remo Andreoli, Tommaso Cucinotta

Deep Learning on Supercomputers

Frontmatter
JUWELS Booster – A Supercomputer for Large-Scale AI Research

In this article, we present JUWELS Booster, a recently commissioned high-performance computing system at the Jülich Supercomputing Center. With its system architecture, most importantly its large number of powerful Graphics Processing Units (GPUs) and its fast interconnect via InfiniBand, it is an ideal machine for large-scale Artificial Intelligence (AI) research and applications. We detail its system architecture, parallel, distributed model training, and benchmarks indicating its outstanding performance. We exemplify its potential for research application by presenting large-scale AI research highlights from various scientific fields that require such a facility.

Stefan Kesselheim, Andreas Herten, Kai Krajsek, Jan Ebert, Jenia Jitsev, Mehdi Cherti, Michael Langguth, Bing Gong, Scarlet Stadtler, Amirpasha Mozaffari, Gabriele Cavallaro, Rocco Sedona, Alexander Schug, Alexandre Strube, Roshni Kamath, Martin G. Schultz, Morris Riedel, Thomas Lippert

Fifth International Workshop on in Situ Visualization

Frontmatter
In Situ Visualization of WRF Data Using Universal Data Junction

An in situ co-processing visualization pipeline based on the Universal Data Junction (UDJ) library and Inshimtu is presented and used for processing data from Weather Research and Forecasting (WRF) simulations. For the common case of analyzing just a number of fields during simulation, UDJ transfers and redistributes the data in approximately $$6\%$$ 6 % of the time needed by WRF for a MPI-IO output of all variables upon which a previous method with Inshimtu is based. The relative cost of transport and redistribution compared to IO remains approximately constant up to the highest considered node count without obvious impediments to scale further.

Aniello Esposito, Glendon Holst
Catalyst Revised: Rethinking the ParaView in Situ Analysis and Visualization API

As in situ analysis goes mainstream, ease of development, deployment, and maintenance becomes essential, perhaps more so than raw capabilities. In this paper, we present the design and implementation of Catalyst, an API for in situ analysis using ParaView, which we refactored with these objectives in mind. Our implementation combines design ideas from in situ frameworks and HPC tools, like Ascent and MPICH.

Utkarsh Ayachit, Andrew C. Bauer, Ben Boeckel, Berk Geveci, Kenneth Moreland, Patrick O’Leary, Tom Osika
Fides: A General Purpose Data Model Library for Streaming Data

Data models are required to provide the semantics of the underlying data stream for in situ visualization. In this paper we describe a set of metrics for such a data model that are useful in meeting the needs of the scientific community for visualization. We then present Fides, a library that provides a schema for the VTK-m data model, and uses the ADIOS middleware library for access to streaming data. We present four use cases of Fides in different scientific workflows, and provide an evaluation of each use case against our metrics.

David Pugmire, Caitlin Ross, Nicholas Thompson, James Kress, Chuck Atkins, Scott Klasky, Berk Geveci
Including in Situ Visualization and Analysis in PDI

The goal of this work was to integrate in situ possibilities into the general-purpose code-coupling library PDI [1]. This is done using the simulation code Alya as an example. Here, an open design is taken into account to later create possibilities to extend this to other simulation codes, that are using PDI.Here, an in transit solution was chosen to separate the simulation as much as possible from the analysis and visualization. To implement this, ADIOS2 is used for data transport. However, to prevent too strong a commitment to one tool, SENSEI is interposed between simulation and ADIOS2 as well as in the in-transit endpoint between ADIOS2 and the visualization software. This allows a user who wants a different solution to easily implement it. However, the visualization with ParaView Catalyst was chosen as default for the time being.

Christian Witzler, J. Miguel Zavala-Aké, Karol Sierociński, Herbert Owen
Backmatter
Metadaten
Titel
High Performance Computing
herausgegeben von
Heike Jagode
Hartwig Anzt
Hatem Ltaief
Piotr Luszczek
Copyright-Jahr
2021
Electronic ISBN
978-3-030-90539-2
Print ISBN
978-3-030-90538-5
DOI
https://doi.org/10.1007/978-3-030-90539-2

Premium Partner