IEI/CNR, Pisa, Italy, Internal Report B4-33, December 1996.


Discriminating Fault Rate and Persistency to Improve Fault Treatment


A. Bondavalli*, S. Chiaradonna**, F. Di Giandomenico** and F. Grandoni**

 
*  CNUCE/CNR, via S.Maria 36, 56126 Pisa, Italy.
   Ph. +39 50 593 111, Fax +39 50 904052, E-mail A.Bondavalli@cnuce.cnr.it

** IEI/CNR, via S.Maria 46, 56126 Pisa, Italy.
   Ph. +39 50 593 400, Fax +39 50 554342, E-mail (digiandomenico, grandoni)@iei.pi.cnr.it
 


Abstract

In this paper the consolidate identification of faults, distinguished as transient or permanent/intermittent, is approached, through the definition of a fault identification mechanism, called a-count. The goal is to allow continued use of parts being hit by transient faults, which may lead to better overall system performance if proper handling is provided. Transient faults discrimination is especially important in all those dependability-qualified applications where replacing and repairing failed components is costly, difficult or impossible at all (as on computer-guided space probes). a-count tries to balance between two conflicting requirements: the first is to keep in the system those components that have experienced just transient faults; the other is to quickly remove those affected by permanent or intermittent faults. The delay in spotting faulty components and the probability of improperly blaming correct ones are evaluated, as a-count's figures of merit. The approach is compared with some heuristics developed to deal with the same problem. 
 

Keywords: Fault Persistency Discrimination, Fault Treatment, Scoring Functions, Threshold-based Identification, Modelling and Evaluation.
 

References

[1] P. Agrawal, "Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy," IEEE Transactions on Computers, Vol. 37, pp. 358-362, 1988.

[2] M. Barborak, M. Malek and A. T. Dahbura, "The Consensus Problem in Fault-Tolerant Computing," ACM Computing Surveys, Vol. pp. 171-220, 1993.

[3] A. Bondavalli, S. Chiaradonna, F. Di Giandomenico and L. Strigini, "Rational Design of Multiple-Redundant Systems: Adjudication and Fault Treatment," in "Predictably Dependable Computing Systems", B. Randell, J. C. Laprie, H. Kopetz and B. Littlewood Ed., Springer-Verlag, 1995, pp. 141-154.

[4] J. H. Lala and L. S. Alger, "Hardware and Software Fault Tolerance: A Unified Architectural Approach," in Proc. 18th International Symposium on Fault-Tolerant Computing, Tokyo, Japan, 1988, pp. 240-245.

[5] J. C. Laprie, "Dependability - its Attributes, Impairments and Means," in "Predictably Dependable Computing Systems", B. Randell, J. C. Laprie, H. Kopetz and B. Littlewood Ed., Springer-Verlag, 1995, pp. 1-28.

[6] T.-T. Y. Lin and D. P. Siewiorek, "Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis," IEEE Transactions on Reliability, Vol. 39, pp. 419-432, 1990.

[7] G. Mongardi, "Dependable Computing for Railway Control Systems," in Proc. DCCA-3, Mondello, Italy, 1993, pp. 255-277.

[8] M. Nelli, A. Bondavalli and L. Simoncini, "Dependability Modelling and Analysis of Complex Control Systems: an Application to Railway Interlocking," in Proc. EDCC-2 European Dependable Computing Conference, Taormina, Italy, 1996, pp. 93-110.

[9] W. H. Sanders and J. F. Meyer, "A Unified Approach for Specifying Measures of Performance, Dependability and Performability," in "Dependable Computing for Critical Applications, Vol. 4: of Dependable Computing and Fault-Tolerant Systems", A. Avizienis and J. Laprie Ed., Springer-Verlag, 1991, pp. 215-237.

[10] W. H. Sanders, W. D. Obal, M. A. Qureshi and F. K. Widjanarko, "The UltraSAN Modeling Environment," Performance Evaluation Journal, special issue on Performance Modeling Tools, Vol. 24, pp. 89-115, 1995.

[11] D. P. Siewiorek and R. S. Swarz, "Reliable Computer System - Design and Evaluation," Digital Press, 1992.


For more information on this paper/report contact: S. Chiaradonna