Reliability

What is reliability?

Reliability is the probability that the service will perform its functions for a specified time. It measures how the service performs under varying operating conditions.

We often use mean time between failures (MTBF) and mean time to repair (MTTR) as metrics to measure reliability. $$ MTBF = \frac{Total \ Elapsed \ Time−Sum \ of \ Downtime }{Total \ Number \ of \ Failures} $$

\[ MTTR = \frac{Total \ Maintenance \ Time}{Total \ Number \ of \ Repairs} \]

(We strive for a higher MTBF value and a lower MTTR value.)

Reliability and availability are often confused, but they measure different aspects of system performance. While reliability focuses on how consistently a service operates without failure, availability considers how often it is accessible when needed. Understanding both is crucial because even a highly reliable system can have low availability if downtime or repairs take too long. Let’s explore how these two concepts are interconnected:

可靠性（Reliability） 指的是服务在指定时间内正常执行其功能的概率。它衡量的是服务在不同运行条件下的表现。

我们通常使用 平均故障间隔时间（MTBF，Mean Time Between Failures） 和 平均修复时间（MTTR，Mean Time To Repair） 作为衡量可靠性的指标。

（我们希望 MTBF 越高越好，MTTR 越低越好。）

可靠性 和 可用性 经常被混淆，但它们衡量的是系统性能的不同方面。可靠性 关注的是系统在不发生故障的情况下持续运行的能力，而 可用性 关注的是系统在需要时是否可以访问。理解这两个概念至关重要，因为即使一个系统的可靠性很高，如果宕机时间或维修时间过长，其可用性仍然可能很低。

让我们进一步探讨这两个概念之间的关系。

Reliability and availability

Reliability and availability are two important metrics to measure compliance of service to agreed-upon service level objectives (SLO).

The measurement of availability is driven by time loss, whereas the frequency and impact of failures drive the measure of reliability. Availability and reliability are essential because they enable the stakeholders to assess the health of the service.

Reliability (R) and availability (A) are two distinct concepts, but they are related. Mathematically, A is a function of R. This means that the value of R can change independently, and the value of A depends on R. Therefore, it’s possible to have situations where we have:

low A, low R
low A, high R
high A, low R
high A, high R (desirable)

Screenshot 2025-02-16 at 20.22.27

Availability as a function of reliability

Note: There are many variations of the MTBF metric, such as mean time to failure (MTTF). Usually, we use MTTF instead of MTBF for those cases where a failed component is replaced due to irreparable problems. A bad disk or a failed bulb are examples of irreparable faults where a replacement is required.

Point to ponder.

What is the difference between reliability and availability?

Reliability measures how well a system performs its intended operations (functional requirements). We use averages for that (Mean Time to Failure, Mean Time to Repair, etc.)

Availability measures the percentage of time a system accepts requests and responds to clients.

Example 1: A certain system may be 90% available but only reliable 80% of the time.

Example 2: Suppose we consider our “system” the stuff inside a data center (hardware + software). Let’s assume this data center suffers a network failure such that no outsider traffic is coming in and no insider traffic is going out. In this case, instantaneous availability might be zero (because clients cannot reach the service) even though inside the data center, all systems are perfectly functioning (instantaneous reliability 100%).

We use both of them (reliability and availability) in different contexts. For example, storage vendors often quote MTTF for their disks. Most online services use uptime (as a measure of availability) in their SLAs. For example, the uptime of EC2 virtual machines is 99.95%.

可靠性 和 可用性 是衡量服务是否符合 服务级别目标（SLO, Service Level Objectives） 的两个重要指标。

可用性 受 时间损失 影响。

可靠性 受 故障的频率和影响 影响。

可用性和可靠性非常重要，它们帮助利益相关者评估服务的健康状况。

尽管 可靠性（R） 和 可用性（A） 是两个不同的概念，但它们是相关的。从数学上讲，可用性是可靠性的一个函数。换句话说，R 的值可以独立变化，而 A 的值取决于 R。因此，可能会出现以下情况：

低可用性（A），低可靠性（R）

低可用性（A），高可靠性（R）

高可用性（A），低可靠性（R）

高可用性（A），高可靠性（R）（最理想的情况）

关于 MTBF 和 MTTF 注意： MTBF 还有许多变体，比如平均失效时间（MTTF，Mean Time To Failure）。通常，当某个组件发生无法修复的故障并被更换时，我们使用 MTTF 而不是 MTBF。例如，损坏的磁盘或烧坏的灯泡就是无法修复的故障，必须进行更换。

思考点：可靠性与可用性的区别是什么？可靠性衡量的是系统执行其预期操作的能力（功能要求），通常使用平均值进行度量（如 MTTF、MTTR 等）。可用性衡量的是系统在任意时间点接受请求并响应客户端的时间占比。示例 1：某个系统的可用性为 90%，但其可靠性只有 80%。

示例 2：假设我们将数据中心内部的所有组件（硬件 + 软件）视为一个“系统”。如果该数据中心发生了网络故障，导致外部流量无法进入，内部流量无法流出，则瞬时可用性（A）为 0%（因为客户端无法访问服务），但瞬时可靠性（R）仍然是 100%（因为数据中心内部的所有系统仍然在正常运行）。

可靠性与可用性的应用场景我们在不同的背景下使用可靠性和可用性：

存储厂商通常使用 MTTF 作为硬盘的可靠性指标。大多数在线服务使用正常运行时间（Uptime）作为 SLA（服务级别协议）的衡量标准。例如，AWS EC2 虚拟机的正常运行时间为 99.95%。