An energy-aware fault tolerant scheduling framework for soft error resilient cloud computing systems

Yue Gao, Sandeep K. Gupta, Yanzhi Wang, Massoud Pedram

Research output: Chapter in Book/Entry/PoemConference contribution

19 Scopus citations

Abstract

For modern high performance systems, aggressive technology and voltage scaling has drastically increased their susceptibility to soft errors. At the grand scale of cloud computing, it is clear that soft error induced failures will occur far more frequently, but it is unclear as to how to effectively apply current error detection and fault tolerance techniques in scale. In this paper, we focus on energy-aware fault tolerant scheduling in public, multi-user cloud systems, and explore the three-way tradeoff between reliability (in terms of soft error resiliency), performance and energy. Through a systematically optimized resource allocation, error detection approach selection, virtual machine placement, spatial/temporal redundancy augmentation and task scheduling process, the cloud service provider can achieve high error coverage and fault tolerance confidence while minimizing global energy costs under user deadline constraints. Our scheduling algorithm includes a static scheduling phase that operates on task graph based workload inputs prior to execution, and a light-weight dynamic scheduler that migrates tasks during execution in case of excessive reexecutions. All schedules are evaluated on a runtime simulation engine that (1) mimics the performance fluctuations in cloud systems, and (2) supports the injection of arbitrary fault patterns. Compared to current virtual machine or task replication techniques, we are able to reduce overall application failure rates by over 50% with approximately 76% total energy overhead.

Original languageEnglish (US)
Title of host publicationProceedings - Design, Automation and Test in Europe, DATE 2014
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Print)9783981537024
DOIs
StatePublished - 2014
Externally publishedYes
Event17th Design, Automation and Test in Europe, DATE 2014 - Dresden, Germany
Duration: Mar 24 2014Mar 28 2014

Publication series

NameProceedings -Design, Automation and Test in Europe, DATE
ISSN (Print)1530-1591

Other

Other17th Design, Automation and Test in Europe, DATE 2014
Country/TerritoryGermany
CityDresden
Period3/24/143/28/14

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'An energy-aware fault tolerant scheduling framework for soft error resilient cloud computing systems'. Together they form a unique fingerprint.

Cite this