Availability

Availability is the proportion of time a system is in a functioning condition. It is closed link with notions of performance, reliability and robustness.

Robustness = Fault Tolerance + Recoverability

Scenario

Source Internal or external from the system
Stimulus
- Omission: a component fails to respond to an input
- Crash: a component repeatedly suffers omission faults
- Timing: a component responds too late/early
- Response: a component responds with an incorrect value
Artifact Processor, process, communication channel, storage
Environment Normal mode / save mode, first failure, repeated failure
Response Fault detection and isolation, recovery, user/system notifications
Measure Availability percentage/intervals, time to detect/repair

Tactics

Fault detection

Ping/echo
Monitoring: display errors in real-time (like car warning lights)
Heartbeat: send a health-check every at a certain frequency
Timestamp: detect incoherences in a sequence of event.
Sanity check: data-validation (e.g. checksum).
Voting: find aberrations according the mean of data coming from multiple sources.
Exception detection: variable length, data-type, division by 0, etc.
Self-test: component run process to regulate itself. It requires it to be isolated to avoid being influenced.

Faults recovery

Preparation & Repair

Redundancy: data duplication (hot spare, RAID, etc.)
Exception handling
Rollback: come back to last known functional state
Upgrades: patches and fixes on one, many or all nodes
Retry: sometimes, rebooting solve the problem, especially in instable domains (network, sensors, …)
Ignore: sometimes, error is not critical enough
Degradation: sacrifice components or features
Reconfiguration

Reintroduction

Shadowing: limit components or features (e.g. save boot mode on an OS)
State synchronization: backup and redundancy of functional state before crash
Escalating restart: reactivate components step-by-step, with different levels of granularity

Faults prevention

Removal from service: taking components offline
Transactions: canceling on error (e.g. ACID)
Prediction: anticipation (e.g. high workload among time)
Increase Competence Set: hire domain professionals

Sorting

Genetic algorithms

Graph algorithms

Problems

Representation model

Other

Sysml

UML

Behaviour-diagrams

Structural-diagrams

Paradigms

Assets

Quality Attributes

Binary

Data structure

Heap

NoSQL

Data types

Cloud

Glossary

Glossary

Operating System

Learning paradigms

Neural Network

Linear algebra

Tensor

Physics

Availability ​

Scenario ​

Tactics ​

Fault detection ​

Faults recovery ​

Preparation & Repair ​

Reintroduction ​

Faults prevention ​

Availability

Scenario

Tactics

Fault detection

Faults recovery

Preparation & Repair

Reintroduction

Faults prevention