TY - GEN
T1 - Optimizing the Bruck Algorithm for Non-uniform All-to-all Communication
AU - Fan, Ke
AU - Gilray, Thomas
AU - Pascucci, Valerio
AU - Huang, Xuan
AU - Micinski, Kristopher
AU - Kumar, Sidharth
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/6/27
Y1 - 2022/6/27
N2 - In MPI, collective routines MPI_Alltoall and MPI_Alltoallv play an important role in facilitating all-to-all inter-process data exchange. MPI_Alltoallv is a generalization of MPI_Alltoall, supporting the exchange of non-uniform distributions of data. Popular implementations of MPI, such as MPICH and OpenMPI, implement MPI_Alltoall using a combination of techniques such as the Spread-out algorithm and the Bruck algorithm. Spread-out has a linear complexity in P, compared to Bruck's logarithmic complexity (P: process count); a selection between these two techniques is made at runtime based on the data block size. However, MPI_Alltoallv is typically implemented using only variants of the spread-out algorithm, and therefore misses out on the performance benefits that the log-time Bruck algorithm offers (especially for smaller data loads). In this paper, we first implement and empirically evaluate all existing variants of the Bruck algorithm for uniform and non-uniform data loads-this forms the basis for our own Bruck-based non-uniform all-to-all algorithms. In particular, we developed two open-source implementations, padded Bruck and two-phase Bruck, that efficiently generalize Bruck algorithm to non-uniform all-to-all data exchange. We empirically validate the techniques on three supercomputers: Theta, Cori, and Stampede, using both microbenchmarks and two real-world applications: graph mining and program analysis. We perform weak and strong scaling studies for a range of average message sizes, degrees of imbalance, and distribution schemes, and demonstrate that our techniques outperform vendor-optimized Cray's MPI_Alltoallv by as much as 50% for some workloads and scales.
AB - In MPI, collective routines MPI_Alltoall and MPI_Alltoallv play an important role in facilitating all-to-all inter-process data exchange. MPI_Alltoallv is a generalization of MPI_Alltoall, supporting the exchange of non-uniform distributions of data. Popular implementations of MPI, such as MPICH and OpenMPI, implement MPI_Alltoall using a combination of techniques such as the Spread-out algorithm and the Bruck algorithm. Spread-out has a linear complexity in P, compared to Bruck's logarithmic complexity (P: process count); a selection between these two techniques is made at runtime based on the data block size. However, MPI_Alltoallv is typically implemented using only variants of the spread-out algorithm, and therefore misses out on the performance benefits that the log-time Bruck algorithm offers (especially for smaller data loads). In this paper, we first implement and empirically evaluate all existing variants of the Bruck algorithm for uniform and non-uniform data loads-this forms the basis for our own Bruck-based non-uniform all-to-all algorithms. In particular, we developed two open-source implementations, padded Bruck and two-phase Bruck, that efficiently generalize Bruck algorithm to non-uniform all-to-all data exchange. We empirically validate the techniques on three supercomputers: Theta, Cori, and Stampede, using both microbenchmarks and two real-world applications: graph mining and program analysis. We perform weak and strong scaling studies for a range of average message sizes, degrees of imbalance, and distribution schemes, and demonstrate that our techniques outperform vendor-optimized Cray's MPI_Alltoallv by as much as 50% for some workloads and scales.
KW - alltoallv
KW - bruck algorithm
KW - collective communication
KW - mpi
UR - http://www.scopus.com/inward/record.url?scp=85134163367&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85134163367&partnerID=8YFLogxK
U2 - 10.1145/3502181.3531468
DO - 10.1145/3502181.3531468
M3 - Conference contribution
AN - SCOPUS:85134163367
T3 - HPDC 2022 - Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing
SP - 172
EP - 184
BT - HPDC 2022 - Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing
PB - Association for Computing Machinery, Inc
T2 - 31st International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2022
Y2 - 27 June 2022 through 30 June 2022
ER -