TY - GEN
T1 - An energy-aware fault tolerant scheduling framework for soft error resilient cloud computing systems
AU - Gao, Yue
AU - Gupta, Sandeep K.
AU - Wang, Yanzhi
AU - Pedram, Massoud
PY - 2014
Y1 - 2014
N2 - For modern high performance systems, aggressive technology and voltage scaling has drastically increased their susceptibility to soft errors. At the grand scale of cloud computing, it is clear that soft error induced failures will occur far more frequently, but it is unclear as to how to effectively apply current error detection and fault tolerance techniques in scale. In this paper, we focus on energy-aware fault tolerant scheduling in public, multi-user cloud systems, and explore the three-way tradeoff between reliability (in terms of soft error resiliency), performance and energy. Through a systematically optimized resource allocation, error detection approach selection, virtual machine placement, spatial/temporal redundancy augmentation and task scheduling process, the cloud service provider can achieve high error coverage and fault tolerance confidence while minimizing global energy costs under user deadline constraints. Our scheduling algorithm includes a static scheduling phase that operates on task graph based workload inputs prior to execution, and a light-weight dynamic scheduler that migrates tasks during execution in case of excessive reexecutions. All schedules are evaluated on a runtime simulation engine that (1) mimics the performance fluctuations in cloud systems, and (2) supports the injection of arbitrary fault patterns. Compared to current virtual machine or task replication techniques, we are able to reduce overall application failure rates by over 50% with approximately 76% total energy overhead.
AB - For modern high performance systems, aggressive technology and voltage scaling has drastically increased their susceptibility to soft errors. At the grand scale of cloud computing, it is clear that soft error induced failures will occur far more frequently, but it is unclear as to how to effectively apply current error detection and fault tolerance techniques in scale. In this paper, we focus on energy-aware fault tolerant scheduling in public, multi-user cloud systems, and explore the three-way tradeoff between reliability (in terms of soft error resiliency), performance and energy. Through a systematically optimized resource allocation, error detection approach selection, virtual machine placement, spatial/temporal redundancy augmentation and task scheduling process, the cloud service provider can achieve high error coverage and fault tolerance confidence while minimizing global energy costs under user deadline constraints. Our scheduling algorithm includes a static scheduling phase that operates on task graph based workload inputs prior to execution, and a light-weight dynamic scheduler that migrates tasks during execution in case of excessive reexecutions. All schedules are evaluated on a runtime simulation engine that (1) mimics the performance fluctuations in cloud systems, and (2) supports the injection of arbitrary fault patterns. Compared to current virtual machine or task replication techniques, we are able to reduce overall application failure rates by over 50% with approximately 76% total energy overhead.
UR - http://www.scopus.com/inward/record.url?scp=84903842258&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84903842258&partnerID=8YFLogxK
U2 - 10.7873/DATE2014.107
DO - 10.7873/DATE2014.107
M3 - Conference contribution
AN - SCOPUS:84903842258
SN - 9783981537024
T3 - Proceedings -Design, Automation and Test in Europe, DATE
BT - Proceedings - Design, Automation and Test in Europe, DATE 2014
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 17th Design, Automation and Test in Europe, DATE 2014
Y2 - 24 March 2014 through 28 March 2014
ER -