TY - GEN
T1 - Reliable state monitoring in cloud datacenters
AU - Meng, Shicong
AU - Iyengar, Arun K.
AU - Rouvellou, Isabelle M.
AU - Liu, Ling
AU - Lee, Kisung
AU - Palanisamy, Balaji
AU - Tang, Yuzhe
PY - 2012
Y1 - 2012
N2 - State monitoring is widely used for detecting critical events and abnormalities of distributed systems. As the scale of such systems grows and the degree of workload consolidation increases in Cloud data centers, node failures and performance interferences, especially transient ones, become the norm rather than the exception. Hence, distributed state monitoring tasks are often exposed to impaired communication caused by such dynamics on different nodes. Unfortunately, existing distributed state monitoring approaches are often designed under the assumption of always-online distributed monitoring nodes and reliable inter-node communication. As a result, these approaches often produce misleading results which in turn introduce various problems to Cloud users who rely on state monitoring results to perform automatic management tasks such as auto-scaling. This paper introduces a new state monitoring approach that tackles this challenge by exposing and handling communication dynamics such as message delay and loss in Cloud monitoring environments. Our approach delivers two distinct features. First, it quantitatively estimates the accuracy of monitoring results to capture uncertainties introduced by messaging dynamics. This feature helps users to distinguish trustworthy monitoring results from ones heavily deviated from the truth, yet significantly improves monitoring utility compared with simple techniques that invalidate all monitoring results generated with the presence of messaging dynamics. Second, our approach also adapts to non-transient messaging issues by reconfiguring distributed monitoring algorithms to minimize monitoring errors. Our experimental results show that, even under severe message loss and delay, our approach consistently improves monitoring accuracy, and when applied to Cloud application auto-scaling, outperforms existing state monitoring techniques in terms of the ability to correctly trigger dynamic provisioning.
AB - State monitoring is widely used for detecting critical events and abnormalities of distributed systems. As the scale of such systems grows and the degree of workload consolidation increases in Cloud data centers, node failures and performance interferences, especially transient ones, become the norm rather than the exception. Hence, distributed state monitoring tasks are often exposed to impaired communication caused by such dynamics on different nodes. Unfortunately, existing distributed state monitoring approaches are often designed under the assumption of always-online distributed monitoring nodes and reliable inter-node communication. As a result, these approaches often produce misleading results which in turn introduce various problems to Cloud users who rely on state monitoring results to perform automatic management tasks such as auto-scaling. This paper introduces a new state monitoring approach that tackles this challenge by exposing and handling communication dynamics such as message delay and loss in Cloud monitoring environments. Our approach delivers two distinct features. First, it quantitatively estimates the accuracy of monitoring results to capture uncertainties introduced by messaging dynamics. This feature helps users to distinguish trustworthy monitoring results from ones heavily deviated from the truth, yet significantly improves monitoring utility compared with simple techniques that invalidate all monitoring results generated with the presence of messaging dynamics. Second, our approach also adapts to non-transient messaging issues by reconfiguring distributed monitoring algorithms to minimize monitoring errors. Our experimental results show that, even under severe message loss and delay, our approach consistently improves monitoring accuracy, and when applied to Cloud application auto-scaling, outperforms existing state monitoring techniques in terms of the ability to correctly trigger dynamic provisioning.
KW - Cloud Monitoring
KW - Distributed Thresholds
KW - Message Delay and Loss
KW - Reliability
KW - State Monitoring
UR - http://www.scopus.com/inward/record.url?scp=84866769714&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84866769714&partnerID=8YFLogxK
U2 - 10.1109/CLOUD.2012.10
DO - 10.1109/CLOUD.2012.10
M3 - Conference contribution
AN - SCOPUS:84866769714
SN - 9780769547558
T3 - Proceedings - 2012 IEEE 5th International Conference on Cloud Computing, CLOUD 2012
SP - 951
EP - 958
BT - Proceedings - 2012 IEEE 5th International Conference on Cloud Computing, CLOUD 2012
T2 - 2012 IEEE 5th International Conference on Cloud Computing, CLOUD 2012
Y2 - 24 June 2012 through 29 June 2012
ER -