TY - GEN
T1 - A scalable method for predicting network performance in heterogeneous clusters
AU - Katramatos, Dimitrios
AU - Chapin, Steve J.
PY - 2005
Y1 - 2005
N2 - An important requirement for the effective scheduling of parallel applications on large heterogeneous clusters is a current view of system resource availability. Maintaining such a view is a time consuming problem, potentially O(N2). Although CPU availability is relatively easy to monitor, interconnecting network bandwidth varies not only with network topology, but also with message size and even with respect to the load of the communicating nodes. This paper describes a method for predicting a cluster's network performance for the purpose of scheduling parallel applications. The method generates a cluster-specific network model which can predict the latency of communications between any pair of nodes in linear time and under any computational and/or communication load conditions. The paper also presents the models generated for the Centurion cluster at the University of Virginia and the Orange Grove cluster at Syracuse University. A study of the prediction accuracy of the method under various load conditions by comparison to experimental measurements indicates an average prediction error of approximately 5% with the maximum encountered prediction error of less than 9%.
AB - An important requirement for the effective scheduling of parallel applications on large heterogeneous clusters is a current view of system resource availability. Maintaining such a view is a time consuming problem, potentially O(N2). Although CPU availability is relatively easy to monitor, interconnecting network bandwidth varies not only with network topology, but also with message size and even with respect to the load of the communicating nodes. This paper describes a method for predicting a cluster's network performance for the purpose of scheduling parallel applications. The method generates a cluster-specific network model which can predict the latency of communications between any pair of nodes in linear time and under any computational and/or communication load conditions. The paper also presents the models generated for the Centurion cluster at the University of Virginia and the Orange Grove cluster at Syracuse University. A study of the prediction accuracy of the method under various load conditions by comparison to experimental measurements indicates an average prediction error of approximately 5% with the maximum encountered prediction error of less than 9%.
UR - http://www.scopus.com/inward/record.url?scp=33846981127&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33846981127&partnerID=8YFLogxK
U2 - 10.1109/ISPAN.2005.11
DO - 10.1109/ISPAN.2005.11
M3 - Conference contribution
AN - SCOPUS:33846981127
SN - 0769525091
SN - 9780769525099
T3 - Proceedings of the International Symposium on Parallel Architectures, Algorithms and Networks, I-SPAN
SP - 8
EP - 15
BT - Proceedings - 8th International Symposium on Parallel Architectures, Algorithms and Networks, I-Span 2005
PB - IEEE Computer Society
T2 - 8th International Symposium on Parallel Architectures, Algorithms and Networks, I-SPAN 2005
Y2 - 7 December 2005 through 9 December 2005
ER -