Fault Tolerance

What is fault tolerance?

Real-world, large-scale applications run hundreds of servers and databases to accommodate billions of users’ requests and store significant data. These applications need a mechanism that helps with data safety and eschews the recalculation of computationally intensive tasks by avoiding a single point of failure.

Fault tolerance refers to a system’s ability to execute persistently even if one or more of its components fail. Here, components can be software or hardware. Conceiving a system that is hundred percent fault-tolerant is practically very difficult.

Let’s discuss some important features for which fault-tolerance becomes a necessity.

Availability focuses on receiving every client’s request by being accessible 24/7.

Reliability is concerned with responding by taking specified action on every client’s request.

什么是容错性？现实世界中的大型应用通常运行着数百台服务器和数据库，以处理数十亿用户的请求并存储海量数据。这些应用需要一种机制来保障数据安全，并避免对计算密集型任务的重复计算，从而防止单点故障（Single Point of Failure, SPOF）。

容错性（Fault Tolerance）指的是系统在一个或多个组件发生故障时仍能持续执行的能力。这里的组件可以是软件或硬件。

要设计一个100% 容错的系统在实际中是极其困难的。

让我们探讨几个关键特性，看看为什么容错性至关重要。

可用性（Availability）关注的是系统是否能够 7×24 小时无间断地接收并处理每个客户端的请求。可靠性（Reliability）关注的是系统是否能够对每个客户端请求采取预期的操作并正确响应。

Fault tolerance techniques

Failure occurs at the hardware or software level, which eventually affects the data. Fault tolerance can be achieved by many approaches, considering the system structure. Let’s discuss the techniques that are significant and suitable for most designs.

故障可能发生在软件层面或硬件层面，最终影响数据安全。根据系统结构，我们可以使用多种方法来实现容错。以下是常见且适用于大多数系统设计的容错技术。

Replication

One of the most widely-used techniques is replication-based fault tolerance. With this technique, we can replicate both the services and data. We can swap out failed nodes with healthy ones and a failed data store with its replica. A large service can transparently make the switch without impacting the end customers.

We create multiple copies of our data in separate storage. All copies need to update regularly for consistency when any update occurs in the data. Updating data in replicas is a challenging job. When a system needs strong consistency, we can synchronously update data in replicas. However, this reduces the availability of the system. We can also asynchronously update data in replicas when we can tolerate eventual consistency, resulting in stale reads until all replicas converge. Thus, there is a trade-off between both consistency approaches. We compromise either on availability or on consistency under failures—a reality that is outlined in the CAP theorem.

Screenshot 2025-02-16 at 20.30.30

数据复制（Replication）基于复制的容错（Replication-Based Fault Tolerance）是最广泛使用的技术之一。通过这种技术，我们可以复制服务和数据，从而在某个节点发生故障时，用健康节点替换它，或者用其数据副本替换失效的数据存储。对于大型服务，系统可以透明地进行故障切换（failover），而不会影响终端用户的使用体验。

我们在独立的存储中创建多个数据副本。当数据发生更新时，所有副本都需要定期同步，以保持数据一致性。然而，更新副本数据是一个具有挑战性的任务。

强一致性（Strong Consistency）：如果系统需要严格的一致性，我们可以同步（synchronously）更新所有副本，但这会降低系统的可用性。最终一致性（Eventual Consistency）：如果系统可以接受一定程度的数据延迟，我们可以异步（asynchronously）更新副本数据。在这种情况下，可能会出现陈旧读（stale reads），直到所有副本最终收敛。因此，这两种一致性方法之间存在权衡（trade-off）。在发生故障时，我们必须在可用性（Availability）和一致性（Consistency）之间做出妥协——这正是 CAP 定理所揭示的现实。

Checkpointing

Checkpointing is a technique that saves the system’s state in stable storage for later retrieval in case of failures due to errors or service disruptions. Checkpointing is a fault tolerance technique performed in many stages at different time intervals. When a distributed system fails, we can get the last computed data from the previous checkpoint and start working from there.

检查点存储（Checkpointing） 是一种技术，它通过将系统状态存储在稳定的存储介质中，以便在发生错误或服务中断时可以恢复系统。检查点存储是一种分阶段执行的容错技术，通常在不同的时间间隔进行。当分布式系统发生故障时，我们可以从上一个检查点（Checkpoint） 恢复最后计算的数据，并从该状态继续运行，而无需完全重新计算。

Checkpointing is performed for different individual processes in a system in such a way that they represent a global state of the actual execution of the system. Depending on the state, we can divide checkpointing into two types:

在系统中，检查点存储适用于不同的进程，这些进程的检查点共同代表系统的全局执行状态。根据存储的状态不同，我们可以将检查点分为两种类型：

Consistent state: A state is consistent in which all the individual processes of a system have a consistent view of the shared state or sequence of events that have occurred in a system. Snapshots taken in consistent states have data in coherent states, representing a possible situation of the system. For a checkpoint to be consistent, typically, the following criteria are met:
All updates to data that were completed before the checkpoint are saved. Any updates to data that were in progress are rolled back as if they didn’t initiate.
Checkpoints include all the messages that have been sent or received up until the checkpoint. No messages are in transit (in-flight) to avoid cases of missing messages.
Relationships and dependencies between system components and their states match what would be expected during normal operation.
Inconsistent state: This is a state where there are discrepancies in the saved state of different processes of a system. In other words, the checkpoints across different processes are not coherent and coordinated.

一致状态（Consistent State）当系统中的所有进程对共享状态或已发生的事件序列具有一致的视图时，该状态称为一致状态。在一致状态下创建的快照（Snapshot）具有数据一致性，能正确反映系统的可能状态。

为了确保检查点处于一致状态，通常需要满足以下条件：

在检查点之前已完成的所有数据更新都被保存，而检查点时仍在进行的数据更新会被回滚，就像它们从未发生一样。检查点包含了所有已发送或接收的消息，确保没有未确认（In-flight）消息，避免出现消息丢失的情况。系统组件之间的关系和依赖性保持一致，与正常运行时的预期状态匹配。不一致状态（Inconsistent State）不一致状态指的是系统中不同进程的检查点状态不一致，即各个进程存储的检查点未能保持同步。换句话说，不同进程的检查点数据存在差异，导致系统状态无法协调恢复。

Let’s look at an example to understand consistent and inconsistent states in a better way. Consider three processes represented by i, j, and k. Two messages, \(m_1\) and \(m_2\), are exchanged between the processes. Other than that, we have one snapshot/checkpoint saved for each process represented by \(C_{1,i}\),\(C_{1,j}\), and \(C_{1,k}\), where 1 represents the number of snapshots for a process and the lowercase letter represents the process itself.

Screenshot 2025-02-16 at 20.31.04

Checkpointing in a consistent and inconsistent state.

In the illustration on the left, the first checkpoints at processes \(j\) and \(i\) are consistent because \(m1\) is sent and received after the checkpoints. On the contrary, in the right-hand illustration, the first checkpoint at process \(j\) doesn’t know about \(m_1\), while the first checkpoint at process ii recorded the reception of message \(m_1\). Therefore, it’s an inconsistent state.

The left-hand illustration represents a consistent state also because no communication is being performed among the processes when the system performs checkpointing. On the right side, we can see that the processes communicate through messages when the system performs checkpointing.

示例：一致状态 vs. 不一致状态考虑一个包含三个进程（i、j 和 k）的系统，它们之间通过消息 \(m_1\) 和 \(m_2\) 进行通信。每个进程在不同的时间点创建了一个检查点（Checkpoint），分别表示为 \(C_{1,i}\)、\(C_{1,j}\) 和 \(C_{1,k}\)，其中下标 1 表示该进程的第一个快照，小写字母代表进程本身。

一致状态（左图）进程 \(j\) 和 \(i\) 的第一个检查点是一致的，因为消息 \(m_1\) 在检查点之后才被发送和接收。该检查点也是一致的，因为在创建检查点时，进程之间没有正在进行的通信。不一致状态（右图）进程 \(j\) 的第一个检查点未记录 \(m_1\)，而进程 \(i\) 的第一个检查点记录了 \(m_1\) 的接收，导致状态不一致。由于系统在进程之间仍有消息传输时创建了检查点，因此检查点数据未能协调一致，最终导致数据不一致问题。总结：

左侧示意图：检查点存储时，进程之间没有正在进行的通信，因此状态是一致的。右侧示意图：检查点存储时，进程之间仍在传输消息，导致数据不一致。检查点存储（Checkpointing）是容错系统中的重要技术，但需要确保检查点在全局状态下一致，否则可能会导致系统恢复失败或数据不匹配的情况。