The Spectrum of Failure Models
Failures are obvious in the world of distributed systems and can appear in various ways. They might come and go, or persist for a long period.
Failure models provide us a framework to reason about the impact of failures and possible ways to deal with them.
Here is an illustration that presents a spectrum of different failure models:
This is a spectrum of failure models. The difficulty level when dealing with a failure increases as we move to the right.
在分布式系统的世界中,故障是显而易见的,并且可能以各种形式出现。它们可能是短暂的,也可能持续较长时间。
故障模型为我们提供了一个框架,使我们能够分析故障的影响以及可能的应对方式。
以下是一个展示不同故障模型范围的示意图:
Fail-stop
In this type of failure, a node in the distributed system halts permanently. However, the other nodes can still detect that node by communicating with it.
From the perspective of someone who builds distributed systems, fail-stop failures are the simplest and the most convenient.
Fail-stop(失效停止) 在这种故障类型中,分布式系统中的某个节点会永久停止运行。然而,其他节点仍然可以通过与其通信来检测到该节点的故障。
从构建分布式系统的角度来看,失效停止故障是最简单且最方便处理的故障类型。
Crash
In this type of failure, a node in the distributed system halts silently, and the other nodes can’t detect that the node has stopped working.
Crash(崩溃) 在这种故障类型中,分布式系统中的某个节点会悄无声息地停止运行,其他节点无法检测到该节点已经停止工作。
Omission failures
In omission failures, the node fails to send or receive messages. There are two types of omission failures. If the node fails to respond to the incoming request, it’s said to be a send omission failure. If the node fails to receive the request and thus can’t acknowledge it, it’s said to be a receive omission failure.
Omission failures(遗漏故障) 在遗漏故障中,节点无法发送或接收消息。遗漏故障可以分为两种类型:
- 发送遗漏故障(Send omission failure):节点未能响应传入的请求。
- 接收遗漏故障(Receive omission failure):节点未能接收到请求,因此无法确认消息的到达。
Temporal failures
In temporal failures, the node generates correct results, but is too late to be useful. This failure could be due to bad algorithms, a bad design strategy, or a loss of synchronization between the processor clocks.
Temporal failures(时序故障) 在时序故障中,节点生成的结果是正确的,但由于延迟太久而变得无用。这种故障可能是由于糟糕的算法、不良的设计策略或处理器时钟之间失去同步导致的。
Byzantine failures
In Byzantine failures, the node exhibits random behavior like transmitting arbitrary messages at arbitrary times, producing wrong results, or stopping midway. This mostly happens due to an attack by a malicious entity or a software bug. A byzantine failure is the most challenging type of failure to deal with.
Byzantine failures(拜占庭故障) 在拜占庭故障中,节点表现出随机行为,例如在任意时间传输任意消息、生成错误的结果或在运行过程中突然停止。这通常是由于恶意实体的攻击或软件漏洞引起的。
拜占庭故障是最难处理的一种故障类型。