Technical Report

Threshold-Based Mechanisms to Discriminate Transient from Intermittent Faults


A. Bondavalli, S. Chiaradonna, F. Di Giandomenico, F. Grandoni


Abstract

A class of count-and threshold mechanisms, collectively dubbed acount, able to discriminate between transient faults and intermittent faults in computing systems is presented. Transient faults discrimination has long been pursued in commercial systems: threshold-based techniques have been practiced for several years for this purpose. The present work aims to contribute to the usefulness of count-and-threshold schemes, through analysis of the behaviour and exploration of the effects on the system. A mathematically defined structure simple enough to be analysed by means of standard tools is adopted. acount is equipped with internal parameters, designed to be tuned to suit environmental variables (such as transient fault rate, intermittent fault occurrence patterns). Extensive behaviour analysis for two embodiments of the scheme, both under the usual assumption of exponentially distributed fault rates and with more realistic fault patterns is carried out.

References

[1] P. Agrawal, "Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy," IEEE Transactions on Computers, Vol. C-37, pp. 358-362, 1988.

[2] H. E. Ascher, T-T. Y. Lin and D. P. Siewiorek, "Modification of: Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis," IEEE Transactions on Reliability, Vol. 41, pp. 599-601, 1992.

[3] A. Bondavalli, S. Chiaradonna, F. Di Giandomenico and F. Grandoni, "Discriminating Fault Rate and Persistency to Improve Fault Treatment," in Proc. 27th IEEE FTCS - International Symposium on Fault-Tolerant Computing, Seattle, USA, 1997, pp. 354-362.

[4] R. K. Iyer, L. T. Young and P. V. K. Iyer, "Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data," IEEE Transactions on Computers, Vol. C-39, pp. 525-537, 1990.

[5] J. H. Lala and L. S. Alger, "Hardware and Software Fault Tolerance: A Unified Architectural Approach," in Proc. 18th International Symposium on Fault-Tolerant Computing, Tokyo, Japan, 1988, pp. 240-245.

[6] J. C. Laprie, "Dependability - its Attributes, Impairments and Means," in "Predictably Dependable Computing Systems", B. Randell, J. C. Laprie, H. Kopetz and B. Littlewood Ed., Springer-Verlag, 1995, pp. 1-28.

[7] T.-T. Y. Lin and D. P. Siewiorek, "Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis," IEEE Transactions on Reliability, Vol. 39, pp. 419-432, 1990.

[8] G. Mongardi, "Dependable Computing for Railway Control Systems," in Proc. DCCA-3, Mondello, Italy, 1993, pp. 255-277.

[9] M. Nelli, A. Bondavalli and L. Simoncini, "Dependability Modelling and Analysis of Complex Control Systems: an Application to Railway Interlocking," in Proc. EDCC-2 European Dependable Computing Conference, Taormina, Italy, 1996, pp. 93-110.

[10] W. H. Sanders and J. F. Meyer, "A Unified Approach for Specifying Measures of Performance, Dependability and Performability," in "Dependable Computing for Critical Applications, Vol. 4: of Dependable Computing and Fault-Tolerant Systems", A. Avizienis and J. Laprie Ed., Springer-Verlag, 1991, pp. 215-237.

[11] W. H. Sanders, W. D. Obal, M. A. Qureshi and F. K. Widjanarko, "The UltraSAN Modeling Environment," Performance Evaluation Journal, special issue on Performance Modeling Tools, Vol. 24, pp. 89-115, 1995.

[12] D. P. Siewiorek and R. S. Swarz, "Reliable Computer System - Design and Evaluation," Digital Press, 1992.

[13] L. Spainhower, J. Isenberg, R. Chillarege and J. Berding, "Design for Fault-Tolerance in System ES/9000 Model 900," in Proc. 22th International Symposium on Fault-Tolerant Computing, Boston, Massachusetts, USA, 1992, pp. 38-47.

[14] N. N. Tendolkar and R. L. Swann, "Automated Diagnostic Methodology for the IBM 3081 Processor Complex," IBM J. Res. Develop., Vol. 26, pp. 78-88, 1982.

[15] M. M. Tsao and D. P. Siewiorek, "Trend Analysis on System Error Files," in Proc. 13th International Symposium on Fault-Tolerant Computing, Milano, Italy, 1983, pp. 116-119.