TY - GEN
T1 - Predicting Application Failure in Cloud
T2 - 1st IEEE International Conference on Cognitive Computing, ICCC 2017
AU - Islam, Tariqul
AU - Manivannan, Dakshnamoorthy
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/9/7
Y1 - 2017/9/7
N2 - Despite employing the architectures designed for high service reliability and availability, cloud computing systems do experience service outages and performance slowdown. In addition to these, large-scale cloud systems experience failures in their hardware and software components which often result in node and application (e.g., jobs and tasks) failures. Therefore, to build a reliable cloud system, it is important to understand and characterize the observed failures. The goal of this work is to identify the key features that correlate to application failures in cloud and present a failure prediction model that can correctly predict the outcome of a task or job before it actually finishes, fails or gets killed. To accomplish this, we perform a failure characterization study of the Google cluster workload trace. Our analysis reveals that, there is a significant consumption of resources due to failed and killed jobs. We further explore the potential for failure prediction in cloud applications so that we can reduce the wastage of resources by better managing the jobs and tasks that ultimately fail or get killed. For this, we propose a prediction method based on a special type of Recurrent NeuralNetwork (RNN) named Long Short-Term Memory Network(LSTM) to identify application failures in cloud. It takes resource usage measurements or performance data for each job and task, and the goal is to predict the termination status (e.g., failed and finished etc.) of them. Our algorithm can predict task failures with 87%accuracy and achieves a true positive rate of 85% and false positive rate of 11%.
AB - Despite employing the architectures designed for high service reliability and availability, cloud computing systems do experience service outages and performance slowdown. In addition to these, large-scale cloud systems experience failures in their hardware and software components which often result in node and application (e.g., jobs and tasks) failures. Therefore, to build a reliable cloud system, it is important to understand and characterize the observed failures. The goal of this work is to identify the key features that correlate to application failures in cloud and present a failure prediction model that can correctly predict the outcome of a task or job before it actually finishes, fails or gets killed. To accomplish this, we perform a failure characterization study of the Google cluster workload trace. Our analysis reveals that, there is a significant consumption of resources due to failed and killed jobs. We further explore the potential for failure prediction in cloud applications so that we can reduce the wastage of resources by better managing the jobs and tasks that ultimately fail or get killed. For this, we propose a prediction method based on a special type of Recurrent NeuralNetwork (RNN) named Long Short-Term Memory Network(LSTM) to identify application failures in cloud. It takes resource usage measurements or performance data for each job and task, and the goal is to predict the termination status (e.g., failed and finished etc.) of them. Our algorithm can predict task failures with 87%accuracy and achieves a true positive rate of 85% and false positive rate of 11%.
UR - http://www.scopus.com/inward/record.url?scp=85032305496&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85032305496&partnerID=8YFLogxK
U2 - 10.1109/IEEE.ICCC.2017.11
DO - 10.1109/IEEE.ICCC.2017.11
M3 - Conference contribution
AN - SCOPUS:85032305496
T3 - Proceedings - 2017 IEEE 1st International Conference on Cognitive Computing, ICCC 2017
SP - 24
EP - 31
BT - Proceedings - 2017 IEEE 1st International Conference on Cognitive Computing, ICCC 2017
A2 - Maglio, Paul P.
A2 - Chou, Wu
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 25 June 2017 through 30 June 2017
ER -