Skip to content

Chapter 3 Back-of-the-envelope Estimation

In a system design interview, sometimes you are asked to estimate system capacity or performance requirements using a back-of-the-envelope estimation. According to Jeff Dean, Google Senior Fellow, “back-of-the-envelope calculations are estimates you create using a combination of thought experiments and common performance numbers to get a good feel for which designs will meet your requirements” [1].

You need to have a good sense of scalability basics to effectively carry out back-of-the-envelope estimation. The following concepts should be well understood: power of two [2], latency numbers every programmer should know, and availability numbers.

在系统设计面试中,有时会被要求使用“背包估算”来估算系统容量或性能需求。正如Google的高级研究员Jeff Dean所言,“背包估算是一种通过思维实验和常见的性能数值,估算设计是否满足需求的过程”[1]。

为了有效进行背包估算,需要对扩展性基础知识有一个清晰的理解。以下概念需理解透彻:2的幂次方、每位程序员应了解的延迟数值以及可用性数值。

3.1 Power of two

Although data volume can become enormous when dealing with distributed systems, calculation all boils down to the basics. To obtain correct calculations, it is critical to know the data volume unit using the power of 2. A byte is a sequence of 8 bits. An ASCII character uses one byte of memory (8 bits). Below is a table explaining the data volume unit (Table 1).

Power Approximate value Full name Short name
10 1 Thousand 1 Kilobyte 1 KB
20 1 Million 1 Megabyte 1 MB
30 1 Billion 1 Gigabyte 1 GB
40 1 Trillion 1 Terabyte 1 TB
50 1 Quadrillion 1 Petabyte 1 PB

尽管在分布式系统中数据量可能会非常庞大,但计算归根结底还是基础知识。为了获得准确的计算,了解基于2的幂的数据量单位是至关重要的。字节(byte)是8位(bit)的序列,一个ASCII字符使用一个字节的内存。以下是数据量单位表(表1)。

3.2 Latency numbers every programmer should know

Dr. Dean from Google reveals the length of typical computer operations in 2010 [1]. Some numbers are outdated as computers become faster and more powerful. However, those numbers should still be able to give us an idea of the fastness and slowness of different computer operations.

Operation name Time
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 100 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 10,000 ns = 10 µs
Send 2K bytes over 1 Gbps network 20,000 ns = 20 µs
Read 1 MB sequentially from memory 250,000 ns = 250 µs
Round trip within the same datacenter 500,000 ns = 500 µs
Disk seek 10,000,000 ns = 10 ms
Read 1 MB sequentially from the network 10,000,000 ns = 10 ms
Read 1 MB sequentially from disk 30,000,000 ns = 30 ms
Send packet CA (California) ->Netherlands->CA 150,000,000 ns = 150 ms

Notes

ns = nanosecond, µs = microsecond, ms = millisecond

1 ns = 10^-9 seconds

1 µs= 10^-6 seconds = 1,000 ns

1 ms = 10^-3 seconds = 1,000 µs = 1,000,000 ns

A Google software engineer built a tool to visualize Dr. Dean’s numbers. The tool also takes the time factor into consideration. Figures 2-1 shows the visualized latency numbers as of 2020 (source of figures: reference material [3]).

Figure 1

By analyzing the numbers in Figure 1, we get the following conclusions:

  • Memory is fast but the disk is slow.
  • Avoid disk seeks if possible.
  • Simple compression algorithms are fast.
  • Compress data before sending it over the internet if possible.
  • Data centers are usually in different regions, and it takes time to send data between them.

Google的Dean博士在2010年揭示了典型计算机操作的时间长度[1]。尽管随着计算机变得更快,有些数值已经过时,但这些数值依然可以帮助我们理解不同操作的快慢。

说明

ns = 纳秒,µs = 微秒,ms = 毫秒

1 ns = 10^-9 秒

1 µs = 10^-6 秒 = 1,000 ns

1 ms = 10^-3 秒 = 1,000 µs = 1,000,000 ns

Google的一位软件工程师构建了一个工具来可视化Dean博士的延迟数值,并且该工具考虑了时间因素。图2-1展示了截至2020年的延迟数值可视化(图的来源见参考资料[3])。

通过分析图1中的数值,我们得到以下结论:

  • 内存速度快,但磁盘速度慢。
  • 尽量避免磁盘寻道。
  • 简单的压缩算法速度很快。
  • 尽可能在将数据发送到互联网上之前对其进行压缩。
  • 数据中心通常位于不同地区,区域之间传输数据需要时间。

3.3 Availability numbers

High availability is the ability of a system to be continuously operational for a desirably long period of time. High availability is measured as a percentage, with 100% means a service that has 0 downtime. Most services fall between 99% and 100%.

A service level agreement (SLA) is a commonly used term for service providers. This is an agreement between you (the service provider) and your customer, and this agreement formally defines the level of uptime your service will deliver. Cloud providers Amazon [4], Google [5] and Microsoft [6] set their SLAs at 99.9% or above. Uptime is traditionally measured in nines. The more the nines, the better. As shown in Table 3, the number of nines correlate to the expected system downtime.

Availability % Downtime per day Downtime per week Downtime per month Downtime per year
99% 14.40 minutes 1.68 hours 7.31 hours 3.65 days
99.99% 8.64 seconds 1.01 minutes 4.38 minutes 52.60 minutes
99.999% 864.00 6.05 seconds 26.30 seconds 5.26 minutes
99.9999% 86.40 milliseconds 604.80 2.63 seconds 31.56 seconds

高可用性是指系统能够连续运行较长时间的能力。高可用性以百分比衡量,100%代表没有停机时间。大多数服务的可用性介于99%到100%之间。

服务等级协议(SLA)是服务提供商常用的术语,正式定义了服务的正常运行时间水平。亚马逊、谷歌和微软的SLA均设定在99.9%或更高。可用性通常以“几个九”来衡量,越多越好。表3展示了可用性与预期系统停机时间的关系。

3.4 Example: Estimate Twitter QPS and storage requirements

Please note the following numbers are for this exercise only as they are not real numbers from Twitter.

Assumptions:

  • 300 million monthly active users.
  • 50% of users use Twitter daily.
  • Users post 2 tweets per day on average.
  • 10% of tweets contain media.
  • Data is stored for 5 years.

Estimations:

Query per second (QPS) estimate:

  • Daily active users (DAU) = 300 million * 50% = 150 million
  • Tweets QPS = 150 million * 2 tweets / 24 hour / 3600 seconds = ~3500
  • Peek QPS = 2 * QPS = ~7000

We will only estimate media storage here.

  • Average tweet size:
  • tweet_id 64 bytes
  • text 140 bytes
  • media 1 MB
  • Media storage: 150 million * 2 * 10% * 1 MB = 30 TB per day
  • 5-year media storage: 30 TB * 365 * 5 = ~55 PB

假设:

  • 每月活跃用户300百万。
  • 50%的用户每天使用Twitter。
  • 用户每天平均发2条推文。
  • 10%的推文包含媒体。
  • 数据存储5年。

估算:

QPS估算:

  • 每日活跃用户(DAU)= 300百万 * 50% = 150百万
  • 推文QPS = 150百万 * 2推文 / 24小时 / 3600秒 ≈ 3500
  • 峰值QPS = 2 * QPS ≈ 7000

媒体存储:

  • 平均推文大小:
  • tweet_id 64字节
  • 文本140字节
  • 媒体1 MB
  • 媒体存储:150百万 * 2 * 10% * 1 MB = 30 TB每天
  • 5年媒体存储:30 TB * 365 * 5 ≈ 55 PB

3.5 Tips

Back-of-the-envelope estimation is all about the process. Solving the problem is more important than obtaining results. Interviewers may test your problem-solving skills. Here are a few tips to follow:

  • Rounding and Approximation. It is difficult to perform complicated math operations during the interview. For example, what is the result of “99987 / 9.1”? There is no need to spend valuable time to solve complicated math problems. Precision is not expected. Use round numbers and approximation to your advantage. The division question can be simplified as follows: “100,000 / 10”.
  • Write down your assumptions. It is a good idea to write down your assumptions to be referenced later.
  • Label your units. When you write down “5”, does it mean 5 KB or 5 MB? You might confuse yourself with this. Write down the units because “5 MB” helps to remove ambiguity.
  • Commonly asked back-of-the-envelope estimations: QPS, peak QPS, storage, cache, number of servers, etc. You can practice these calculations when preparing for an interview. Practice makes perfect.

Congratulations on getting this far! Now give yourself a pat on the back. Good job!

背包估算的关键在于过程。解决问题比获得结果更重要。面试官可能会考察您的问题解决能力。以下是一些提示:

  • 舍入和近似:在面试中进行复杂数学计算并不容易。使用舍入和近似来简化计算。
  • 写下假设:记录假设,便于参考。
  • 标记单位:写明单位,以消除歧义。
  • 常见估算项:QPS、峰值QPS、存储、缓存、服务器数量等。练习这些计算有助于面试准备。

恭喜您走到这里!现在给自己一个赞,干得好!

Reference materials

[1] J. Dean.Google Pro Tip: Use Back-Of-The-Envelope-Calculations To Choose The Best Design: http://highscalability.com/blog/2011/1/26/google-pro-tip-use-back-of-the-envelope-calculations-to-choo.html

[2] System design primer: https://github.com/donnemartin/system-design-primer

[3] Latency Numbers Every Programmer Should Know: https://colin-scott.github.io/personal_website/research/interactive_latency.html

[4] Amazon Compute Service Level Agreement: https://aws.amazon.com/compute/sla/

[5] Compute Engine Service Level Agreement (SLA): https://cloud.google.com/compute/sla

[6] SLA summary for Azure services: https://azure.microsoft.com/en-us/support/legal/sla/summary/