Services Monitoring with Probabilistic Fault Detection: Challenges and Future Prospects
- gooddealbmunste197
- Aug 12, 2023
- 6 min read
Failures are a usually a result of system errors that are derived from faults in the system.However, faults do not necessarily result in system errors if the erroneous system state is transient and can be 'corrected' before an error arises.Errors do not necessarily lead to system failures if the error is corrected by built-in error detection and recovery mechanism.
Services Monitoring with Probabilistic Fault Detection
Removing X% of the faults in a system will not necessarily improve the reliability by X%. Program defects may be in rarely executed sections of the code so may never be encountered by users. Removing these does not affect the perceived reliability.Users adapt their behavior to avoid system features that may fail for them.A program with known faults may therefore still be perceived as reliable by its users.
Reliability is a measurable system attribute so non-functional reliability requirements may be specified quantitatively. These define the number of failures that are acceptable during normal use of the system or the time in which the system must be available. Functional reliability requirements define system and software functions that avoid, detect or tolerate faults in the software and so ensure that these faults do not lead to system failure.Software reliability requirements may also be included to cope with hardware failure or operator error.
Protection system is a specialized system that is associated with some other control system, which can take emergency action if a failure occurs, e.g.a system to stop a train if it passes a red light, or a system to shut down a reactor if temperature/pressure are too high.Protection systems independently monitor the controlled system and the environment.If a problem is detected, it issues commands to take emergency action to shut down the system and avoid a catastrophe.Protection systems are redundant because they include monitoring and control capabilities that replicate those in the control software.Protection systems should be diverse and use different technology from the control software.They are simpler than the control system so more effort can be expended in validation and dependability assurance.Aim is to ensure that there is a low probability of failure on demand for the protection system.
Receiver Autonomous Integrity Monitoring (RAIM) method is an effective means to provide integrity monitoring for users in time. In order to solve the misjudgment caused by the interference of gross error to the least squares algorithm, this paper proposes a RAIM method based on M-estimation for multiconstellation GNSS. Based on five programs, BDS, GPS/BDS, and GPS/BDS/GLONASS at the current stage, the future Beidou Global Navigation Satellite System, and the future GPS/BDS/GLONASS/Galileo system, the new RAIM method is compared with the traditional least squares method by simulation. The simulation results show that, with the increase of constellations, RAIM availability, fault detection probability, and fault identification probability will be improved. Under the same simulation conditions, the fault detection and identification probabilities based on M-estimation are higher than those based on least squares estimation, and M-estimation is more sensitive to minor deviation than least squares estimation.
Therefore, the unit weight error of the pseudorange residual vector is calculated by the sum of squares of the pseudorange residuals. Under the normal circumstance of the system, the residual of the pseudorange is small, and the a posteriori unit weight error is also small. When the deviation of the measurement pseudorange is large, will become larger and it needs to be detected. Assuming that there is no fault, each component of the distance residual vector is independent of the normal distribution random error with the mean zero and variance . Because the residual sensitivity matrix is a real symmetric matrix whose rank is equal to , according to the statistical distribution theory, obeys chi-square distribution with freedom degree ; if there is a fault and the mean value of the distance residual vector is not zero, then obeys the noncentral chi-square distribution with freedom degree [9]. Therefore, can be used as a test statistic. Let the test statistic be
visible satellites can obtain test statistics. Given the total false alarm probability , then the false alarm probability of each test statistic is , and the detection threshold of fault identification can be calculated from the false alarm probability of each test statistic.where
From the above formula we can calculate the identification threshold corresponding to each test statistic . Compared to the test statistic with the identification threshold, if , then the th satellite is faulty and should be excluded.
is the noncentralization parameter, and is the a priori variance of the pseudorange residual; calculate the corresponding detection threshold from the corresponding false alarm probability. Then the fault detection and fault identification are carried out.
It can be seen from Figure 1 that both of two methods can be used for multiconstellation fault detection, but under the same circumstances, fault detection probability based on M-estimation is significantly higher than that based on least squares. Since the M-estimation is able to amplify the pseudorange residuals of the failed satellite in the test statistic, it is more sensitive than the least squares to the minor deviations; that is, the correct warning probability of the least squares to the minor deviations is lower than that of M-estimation.
This paper firstly introduces the RAIM method of multiconstellation based on traditional least squares and deduces the test statistic and threshold calculation process of fault detection and fault identification. Then, a multiconstellation RAIM method based on M-estimation is proposed in this paper; at the same time, the fault detection and identification process are deduced. Finally, for the five programs, including BD2, GPS/BD2, and GPS/BD2/GLONASS at the current stage and Beidou Global Navigation Satellite System and GPS/BDS/GLONASS/Galileo System in the future, the RAIM method based on M-estimation is compared with the traditional RAIM method based on least squares by simulation. The simulation results show that the availability of RAIM method increases with the number of constellations, and the fault detection probability and fault identification probability also increase. Under the same condition, the fault detection probability and fault identification probability based on M-estimation method are higher than those based on least squares method, and the M-estimation is more sensitive to the minor deviation than the least squares estimation.
The aforementioned diagnosis tools are not recommended for in situ diagnosis of deployed operational WSN because sensor nodes in these approaches continuously and vigorously generate a lot of traffic that ingests communication, computation and energy resources. Moreover, integrating these complex debugging tools with an application program at each sensor node creates difficulties in system implementation. Similarly, active fault diagnosis imposes a heavy traffic overhead on the network because a large amount of information is transferred regarding specific control commands or status information to the sink, such as in Sympathy [26] and Clairvoyant [27,28,29]. These active fault diagnosis approaches focused on determining and tracking the software faults of the sensor nodes, which tends to put the network under a heavy network traffic burden. To minimise the network traffic, passive fault diagnosis schemes have been suggested and practised as a solution (see Figure 2).
Passive fault diagnosis is motivated by the limitations of the previously described techniques. Recently, several ongoing sea monitoring projects [30,31,32] illustrate the importance of passive fault diagnosis. The project initiated a working prototype of a WSN consisting of tens of nodes floating on the surface of the sea to collect scientific information, such as depth, surrounding illumination, and pollution. In recent deployment tests, it has been observed that abnormal energy depletion would never occur in controlled laboratory settings, and such abnormalities only surface due to the utilisation of a multi-hop router component, which calculates the optimisation routing tree of the highly unstable environment of the sea and which also expects a high degree of delay and packet loss at the sink node. Moreover, faster and accurate determination of the root causes of a detected problem is required before further action can be taken, such as requesting reboot messages to some nodes or physically examining the suspicious links.
Zhiyang et al. [57] presented a fault diagnosis model that is based on [58,59,60]. In this model, each node forms a cluster with its neighboring nodes. A node is called a neighbor of any node if it is in its transition range. The diagnostic operation is periodically performed within the cluster by comparing the result with its neighboring nodes. The diagnostic protocol is divided into the following five steps:
The diagnosis session ends after receiving a Timer2 message from the node. The pseudo-code of the protocol is shown in Algorithm 1. In the following code, 1-lb is used to send a message to its neighbors. The protocol distinguishes itself from existing approaches by two advantages. First, it does not have any embedded agent that sends the status information to the sink and it also works in a local area, such as a node and its neighboring nodes. It is claimed that it does not put much computational cost on the sensor nodes or impose a large communication overhead on the network. Second, it is a proactive protocol and is suitable for applications that do not have continuous connections. Harte et al. [61] introduced a novel technique for the detection of faults in WSNs, which uses a tree-based heuristic reasoning technique. It infers the cause of faults by diagnosing the state of links and nodes, as well as other factors involved in optimising the recommendation of the most effective status information collected in real time. The problem with this method is that it uses a periodic sampling method that overloads the network under heavy traffic load. 2ff7e9595c
Comments