Fault isolation in object oriented control systems

Poster from 2004 ISIS workshop

The poster as pdf-file.

Selected publications

Fault Isolation in Object Oriented Control Systems, M. Larsson, I. Klein, D. Lawesson, U. Nilsson, In Proc 4th IFAC Symposium On Fault Detection Supervision and Safety for Technical Processes (IFAC SAFEPROCESS 2000), Jun 2000.

Model-Checking Based Fault Isolation in UML, D. Lawesson, U. Nilsson, I. Klein, 12th International Workshop on Principles of Diagnosis (DX 2001), March 2001.

Fault Isolation using Process Algebra Models, D. Lawesson, U. Nilsson, I. Klein, 13th International Workshop on Principles of Diagnosis (DX 2002), May 2002.

Fault Isolation Using Automatic Abstraction To Avoid State Space Explosion., D. Lawesson, U. Nilsson, I. Klein, In Workshop on Model Checking and Artificial Intelligence (MoChArt-2003), Acapulco. 2003.

Fault Isolation in Discrete Event Systems by Observational Abstraction, D. Lawesson, U. Nilsson, I. Klein, In Proc of 42nd IEEE Conf on Decision and Control (CDC), Maui Hawaii. 2003.

Background

The project, which started in fall 1997, is carried out jointly by the Division of Automatic Control and the Theoretical Computer Science Laboratory in cooperation with ABB Robotics. Participants from the Division of Automatic Control are Magnus Larsson (working at ABB Robotics as of April 2000) and Inger Klein , and from the Theoretical Computer Science Laboratory Dan Lawesson and Ulf Nilsson.

The project is motivated by a real industrial robot control system developed by ABB Robotics. The system is large (the order of million lines of code), concurrent, has an object oriented architecture and is configurable, supporting different types of robots and cell configurations. The object oriented architecture implies better ability to cope with complexity and to facilitate maintenance and reuse. In object oriented design, a fundamental and important guideline is to design objects based on principles of encapsulation to facilitate reuse and maintenance. An often conflicting design goal lies in the need to suppress unnecessary, propagating, error messages and eventually give the operator a concise picture of a fault situation. Since the system is also time- and safety-critical the first priority, in case of a failure, is to bring the system to a safe state. Hence alarms that go off are logged and can be analyzed only when the system comes to a standstill.

There are two main objectives of our work: On the one hand we want to devise a method that can be used for operator support. The aim is then to single out the error message that explains the actual cause of the failure, or possibly an unobservable critical event explaining the observations. That is, we aim to discard error messages which are definitely effects of other error messages, while trying to isolate error messages (or critical events) which explain all other messages. On the other hand, our method can also be used at design time. At the design level, we want to find out, at design-time, if the error log design is sufficient, that is, if enough error messages are produced to be able to isolate all faults.

PROJECT OVERVIEW

In the first phase of the project (1997-2000) we studied a structural approach to fault isolation, proposing a fault isolation scheme by introducing an extra layer between the operator and the core control system. The purpose of the intermediate layer was to perform post-processing of the fault information from the system to achieve clear and concise fault information to the operator without violating encapsulation and modularity of the object oriented architecture (see [3]). The post-processing amounts to finding a cause-effect relation between the generated error messages in a fault scenario, and then to choose the most significant error message(s) according to this relation. To infer the cause-effect relationship we rely essentially on two pieces of information

the information provided by the error messages available (without order) in the log. We proposed an error message design with three types of error message. Those (1) signaling problems internal to an object, and those reporting problems in the interaction with other objects which may be either known (2.1) or unknown (2.2).
a structural model of the control system in the form of class diagrams and task diagrams in UML (Unified Modeling Language).

The basic idea of our approach is to use the partial information provided by the error messages supplemented with structural information, to build a so-called explanation graph with the aim to single out the error message that explains all other error messages.

A prototype implementation of the approach has been completed and tested on the ABB Robotics industrial robot control system. The approach is presented in [2], [4] and Magnus Larsson's Ph.D. thesis [1]. A patent application has been filed with the Swedish patent office (PRV) [5].

The structural method accurately and efficiently isolates the root error message in most scenarios supplied by ABB Robotics, but may fail in case of cyclic dependencies in the error log or in the class diagrams. The method also relies on the fact that there is only one instance of major system classes. In the second phase (2000 to present), described in some detail below, we have extended the structural approach by introducing the additional behavioral model of the system, in the form of UML state charts, allowing a more precise fault isolation process in those cases when the structural method fails to single out a unique message.

RESULTS AND DEVELOPMENT SINCE 2000

The main focus in the second phase of the project has been to extend the structural approach to fault isolation using behavioral methods—more precisely UML state machines—and class instances rather than classes. In [6] we introduced the concept of strong root candidate. A strong root candidate is an event that is known to have occurred, and there is a run (consistent with the log) where this event is the first critical event.

In a series of papers [6-10] we propose an approach to fault isolation based on model checking to locate strong root candidates (if they exist!). The property of being a strong root candidate is then expressed in the temporal logic CTL (normally used for verification). And we use an existing model checker to single out the strong root candidates.
However, a main obstacle in model checking is the so-called state-space explosion—the number of global system states typically increases exponentially with the number of subsystems. Techniques have been proposed to stretch the limits of model checking (e.g. symbolic model checking and partial order reduction). However, in our case we do not solve a general model checking problem but a more specific problem. Consequently there are more efficient abstraction mechanisms for our particular problem, and we propose such a method in [11-13]. The general idea is that we are only interested in the correlation between the first critical event and the set of messages that are logged during the execution. Hence, we can abstract away details not only about parallel object interleavings as in partial order reduction, but also ignore order of messages and dynamics that in the global system model does not change the set of messages sent or the order of critical events. For example, cyclic behavior where no critical events occur can be abstracted to a single state. Before applying model checking we perform abstraction, thus reducing the state space considerably and facilitating checking of the strong root candidates using model checking.

The result produced by our method is a table that maps all possible message logs to the corresponding strong root candidates. The table, called the fault isolation table, can of course be used for fault isolation; given a log and the fault isolation table, the strong root candidates can be found simply by table lookup. The primary use is in diagnosability analysis, though. The table partitions all possible system runs in equivalence classes of runs with the same logged messages. Each partition corresponds to a row in the table. If for such a row, there are several strong root candidates, we conclude that runs in the corresponding class are not diagnosable. If an error message is redundant, it will be evident from the table. If it depends on some other message, the two will only appear in certain configurations in consistent logs. The exponential size of the table indicates that it is not feasible to use it explicitly in general for systems with a large set of logged events. Then, abstractions of the table can be considered and presented to a user, for example the set of table rows that indicate non-diagnosability.

We have developed a prototype tool, StateTracer, that takes a description of a system as input and produces a fault isolation table as output along with visualizations of all merged objects. The system description is given in UML.

PUBLICATIONS

Publications 1997-2000

[1]	M. Larsson, Behavioral and Structural Model Based Approaches to Discrete Diagnosis, PhD. Thesis, 1999.
[2]	M. Larsson, I. Klein , D. Lawesson , U. Nilsson , Fault Isolation in Object Oriented Control Systems, In Proc 4th IFAC Symposium On Fault Detection Supervision and Safety for Technical Processes (IFAC SAFEPROCESS 2000), Jun 2000.
[3]	M. Larsson, I. Klein , D. Lawesson , U. Nilsson , The Need for Fault Isolation in Object-Oriented Control Systems, Report LiTH-ISY-R-2098, May 1999.
[4]	M. Larsson, I. Klein , D. Lawesson , U. Nilsson , Model Based Fault Isolation for Object-Oriented Control Systems, Report LiTH-ISY-R-2205, Nov 1999.
[5]	M. Larsson, P.Eriksson, Fault Isolation in Process, Swedish Patent Application No. 9904008-1, Nov 1999.

Publications after Sept 2000

[6]	D. Lawesson , Towards Behavioral Model Fault Isolation for Object Oriented Control Systems, Licentiate Thesis no.863, March 2001.
[7]	D. Lawesson , U. Nilsson , I. Klein , Model-Checking Based Fault Isolation in UML, 12th International Workshop on Principles of Diagnosis (DX 2001), March 2001.
[8]	D. Lawesson , U. Nilsson , I. Klein , Model-Checking Based Fault Isolation in UML, Report LiTH-ISY-R-2336, Feb 2001.
[9]	D. Lawesson , U. Nilsson , I. Klein , Fault Isolation using Process Algebra Models, 13th International Workshop on Principles of Diagnosis (DX 2002), May 2002.
[10]	D. Lawesson , U. Nilsson , I. Klein , Fault Isolation using Process Algebra Models, Workshop on Model Checking and Artificial Intelligence (MoChArt-2002), July 2002.
[11]	D. Lawesson , U. Nilsson , I. Klein , Model Checking Based Fault Isolation Using Automatic Abstraction. In 14th Intl Workshop of Principles of Diagnosis (DX 2003), Washington DC. 2003.
[12]	D. Lawesson , U. Nilsson , I. Klein , Fault Isolation Using Automatic Abstraction To Avoid State Space Explosion. In Workshop on Model Checking and Artificial Intelligence (MoChArt-2003, nonexisting link removed), Acapulco. 2003.
[13]	D. Lawesson , U. Nilsson , I. Klein , Fault Isolation in Discrete Event Systems by Observational Abstraction. In Proc of 42nd IEEE Conf on Decision and Control (CDC), Maui Hawaii. 2003.
[9]	D. Lawesson , U. Nilsson , I. Klein , Model Checking Based Fault Isolation Using Automatic Abstraction, Report LiTH-ISY-R-2637, Oct 2004.
[10]	D. Lawesson , U. Nilsson , I. Klein , Fault Isolation in Discrete Event Systems by Observational Abstraction, Report LiTH-ISY-R-2638, Oct 2004.