TY - GEN
T1 - Optimizing the Bruck Algorithm for Non-uniform All-to-all Communication
AU - Fan, Ke
AU - Gilray, Thomas
AU - Pascucci, Valerio
AU - Huang, Xuan
AU - Micinski, Kristopher
AU - Kumar, Sidharth
N1 - Funding Information:
This work was funded in part by NSF RII Track-4 award 2132013. We are thankful to the ALCF’s Director’s Discretionary (DD) program for providing us with compute hours to run our experiments on the Theta Supercomputer located at the Argonne National Laboratory.
Publisher Copyright:
© 2022 ACM.
PY - 2022/6/27
Y1 - 2022/6/27
N2 - In MPI, collective routines MPI_Alltoall and MPI_Alltoallv play an important role in facilitating all-to-all inter-process data exchange. MPI_Alltoallv is a generalization of MPI_Alltoall, supporting the exchange of non-uniform distributions of data. Popular implementations of MPI, such as MPICH and OpenMPI, implement MPI_Alltoall using a combination of techniques such as the Spread-out algorithm and the Bruck algorithm. Spread-out has a linear complexity in P, compared to Bruck's logarithmic complexity (P: process count); a selection between these two techniques is made at runtime based on the data block size. However, MPI_Alltoallv is typically implemented using only variants of the spread-out algorithm, and therefore misses out on the performance benefits that the log-time Bruck algorithm offers (especially for smaller data loads). In this paper, we first implement and empirically evaluate all existing variants of the Bruck algorithm for uniform and non-uniform data loads-this forms the basis for our own Bruck-based non-uniform all-to-all algorithms. In particular, we developed two open-source implementations, padded Bruck and two-phase Bruck, that efficiently generalize Bruck algorithm to non-uniform all-to-all data exchange. We empirically validate the techniques on three supercomputers: Theta, Cori, and Stampede, using both microbenchmarks and two real-world applications: graph mining and program analysis. We perform weak and strong scaling studies for a range of average message sizes, degrees of imbalance, and distribution schemes, and demonstrate that our techniques outperform vendor-optimized Cray's MPI_Alltoallv by as much as 50% for some workloads and scales.
AB - In MPI, collective routines MPI_Alltoall and MPI_Alltoallv play an important role in facilitating all-to-all inter-process data exchange. MPI_Alltoallv is a generalization of MPI_Alltoall, supporting the exchange of non-uniform distributions of data. Popular implementations of MPI, such as MPICH and OpenMPI, implement MPI_Alltoall using a combination of techniques such as the Spread-out algorithm and the Bruck algorithm. Spread-out has a linear complexity in P, compared to Bruck's logarithmic complexity (P: process count); a selection between these two techniques is made at runtime based on the data block size. However, MPI_Alltoallv is typically implemented using only variants of the spread-out algorithm, and therefore misses out on the performance benefits that the log-time Bruck algorithm offers (especially for smaller data loads). In this paper, we first implement and empirically evaluate all existing variants of the Bruck algorithm for uniform and non-uniform data loads-this forms the basis for our own Bruck-based non-uniform all-to-all algorithms. In particular, we developed two open-source implementations, padded Bruck and two-phase Bruck, that efficiently generalize Bruck algorithm to non-uniform all-to-all data exchange. We empirically validate the techniques on three supercomputers: Theta, Cori, and Stampede, using both microbenchmarks and two real-world applications: graph mining and program analysis. We perform weak and strong scaling studies for a range of average message sizes, degrees of imbalance, and distribution schemes, and demonstrate that our techniques outperform vendor-optimized Cray's MPI_Alltoallv by as much as 50% for some workloads and scales.
KW - alltoallv
KW - bruck algorithm
KW - collective communication
KW - mpi
UR - http://www.scopus.com/inward/record.url?scp=85134163367&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85134163367&partnerID=8YFLogxK
U2 - 10.1145/3502181.3531468
DO - 10.1145/3502181.3531468
M3 - Conference contribution
AN - SCOPUS:85134163367
T3 - HPDC 2022 - Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing
SP - 172
EP - 184
BT - HPDC 2022 - Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing
PB - Association for Computing Machinery, Inc
T2 - 31st International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2022
Y2 - 27 June 2022 through 30 June 2022
ER -