关键词:
Computer Science
Communication Middlewares
Optimized Design
Process Mapping
HPC
Multi-core
Many-core
Parallel Computing
High-End Systems
Cloud Computing
Zero-copy Communication
Shared Address Space Communication
Emerging Architectures
MPI
摘要:
Modern High-Performance Computing (HPC) systems are enabling scientists from different research domains such as astrophysics, climate simulations, computational fluid dynamics, drugs discovery, and others, to model and simulate computation-heavy problems at different scales. In recent years, the resurgence of Artificial Intelligence (AI), particularly Deep Learning (DL) algorithms, has been made possible by the evolution of these HPC systems. The diversity of applications ranging from traditional scientific computing to the training and inference of neural-networks are driving the evolution of processor and interconnect technologies as well as communication middlewares. Today’s multi-petaflop HPC systems are powered by dense multi-/many-core architectures and this trend is expected to grow for next-generation systems. This rapid adoption of these high core-density architectures by the current- and next-generation HPC systems, driven by emerging application trends, are putting more emphasis on the middleware designers to optimize various communication primitives to meet the diverse needs of the applications. While these novelties in the processor architectures have led to increased on-chip parallelism, they come at the cost of rendering traditional designs, employed by the communication middlewares, to suffer from higher intra-node communication costs. Tackling the computation and communication challenges that accompany these dense multi-/manycores garner special design considerations. Scientific and AI applications that rely on such large-scale HPC systems to achieve higher performance and scalability often use Message Passing Interface (MPI), Partition Global Address Space (PGAS), or a hybrid of both as underlying communication substrate. These applications use various communication primitives (e.g., point-to-point, collectives, RMA) and often use custom data layouts (e.g., derived datatypes), spending a fair bit of time in communication and synchronization. The pe