FaCS: Toward a Fault-Tolerant Cloud Scheduler Leveraging Long Short-Term Memory Network

Tariqul Islam, Dakshnamoorthy Manivannan

Research output: Chapter in Book/Entry/PoemConference contribution

3 Scopus citations

Abstract

Large-scale cloud datacenters often experience reduced performance and service outage. Due to the inherent complexity, heterogeneity, and multitenant architecture of these datacenters, applications (i.e., jobs and tasks) running on them are susceptible to various types of failures. In this paper, we first characterize the application failures in Google cluster trace and then propose a prediction model which can forecast the termination status of a task. Then, we introduce a task scheduler that dynamically reschedules tasks based on the predicted results. This proactive fault-tolerant scheduler improves system reliability and ensures timely execution of the applications. Simulation results show that our scheduler reduces makespan and failure rates of tasks substantially while balancing load at the same time. Moreover, early prediction along with quick scheduling adjustment improves overall resource utilization and reduces resource wastage.

Original languageEnglish (US)
Title of host publicationProceedings - 6th IEEE International Conference on Cyber Security and Cloud Computing, CSCloud 2019 and 5th IEEE International Conference on Edge Computing and Scalable Cloud, EdgeCom 2019
EditorsMeikang Qiu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1-6
Number of pages6
ISBN (Electronic)9781728116600
DOIs
StatePublished - Jun 2019
Externally publishedYes
Event6th IEEE International Conference on Cyber Security and Cloud Computing and 5th IEEE International Conference on Edge Computing and Scalable Cloud, CSCloud/EdgeCom 2019 - Paris, France
Duration: Jun 21 2019Jun 23 2019

Publication series

NameProceedings - 6th IEEE International Conference on Cyber Security and Cloud Computing, CSCloud 2019 and 5th IEEE International Conference on Edge Computing and Scalable Cloud, EdgeCom 2019

Conference

Conference6th IEEE International Conference on Cyber Security and Cloud Computing and 5th IEEE International Conference on Edge Computing and Scalable Cloud, CSCloud/EdgeCom 2019
Country/TerritoryFrance
CityParis
Period6/21/196/23/19

Keywords

  • Failure Prediction
  • Fault-Tolerance
  • Job and Task Scheduler
  • Long Short-Term Memory Network.

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'FaCS: Toward a Fault-Tolerant Cloud Scheduler Leveraging Long Short-Term Memory Network'. Together they form a unique fingerprint.

Cite this