TY - GEN
T1 - FaCS
T2 - 6th IEEE International Conference on Cyber Security and Cloud Computing and 5th IEEE International Conference on Edge Computing and Scalable Cloud, CSCloud/EdgeCom 2019
AU - Islam, Tariqul
AU - Manivannan, Dakshnamoorthy
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/6
Y1 - 2019/6
N2 - Large-scale cloud datacenters often experience reduced performance and service outage. Due to the inherent complexity, heterogeneity, and multitenant architecture of these datacenters, applications (i.e., jobs and tasks) running on them are susceptible to various types of failures. In this paper, we first characterize the application failures in Google cluster trace and then propose a prediction model which can forecast the termination status of a task. Then, we introduce a task scheduler that dynamically reschedules tasks based on the predicted results. This proactive fault-tolerant scheduler improves system reliability and ensures timely execution of the applications. Simulation results show that our scheduler reduces makespan and failure rates of tasks substantially while balancing load at the same time. Moreover, early prediction along with quick scheduling adjustment improves overall resource utilization and reduces resource wastage.
AB - Large-scale cloud datacenters often experience reduced performance and service outage. Due to the inherent complexity, heterogeneity, and multitenant architecture of these datacenters, applications (i.e., jobs and tasks) running on them are susceptible to various types of failures. In this paper, we first characterize the application failures in Google cluster trace and then propose a prediction model which can forecast the termination status of a task. Then, we introduce a task scheduler that dynamically reschedules tasks based on the predicted results. This proactive fault-tolerant scheduler improves system reliability and ensures timely execution of the applications. Simulation results show that our scheduler reduces makespan and failure rates of tasks substantially while balancing load at the same time. Moreover, early prediction along with quick scheduling adjustment improves overall resource utilization and reduces resource wastage.
KW - Failure Prediction
KW - Fault-Tolerance
KW - Job and Task Scheduler
KW - Long Short-Term Memory Network.
UR - http://www.scopus.com/inward/record.url?scp=85074145124&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85074145124&partnerID=8YFLogxK
U2 - 10.1109/CSCloud/EdgeCom.2019.00010
DO - 10.1109/CSCloud/EdgeCom.2019.00010
M3 - Conference contribution
AN - SCOPUS:85074145124
T3 - Proceedings - 6th IEEE International Conference on Cyber Security and Cloud Computing, CSCloud 2019 and 5th IEEE International Conference on Edge Computing and Scalable Cloud, EdgeCom 2019
SP - 1
EP - 6
BT - Proceedings - 6th IEEE International Conference on Cyber Security and Cloud Computing, CSCloud 2019 and 5th IEEE International Conference on Edge Computing and Scalable Cloud, EdgeCom 2019
A2 - Qiu, Meikang
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 21 June 2019 through 23 June 2019
ER -