Key Concepts to Prepare for the System Design Interview

In a System Design Interview, interviewers ask the candidate to design a web-scale application. For example, they might ask to design platforms like Instagram, YouTube, or Uber backend.

Unlike a coding interview question, System Design Interviews are free-form discussions with no right or wrong answer. Instead, the interviewer is trying to evaluate the candidate’s ability to discuss the different aspects of the system and assess the solution based on the requirements that might evolve during the conversation.

The best way to imagine the conversation is that we and our colleagues are asked to design a large-scale system. We have hashed out the details on the whiteboard, and we understand the requirements, scope, and constraints before proposing a solution.

So, how do we design a system in an interview if we have never built one in real life? To crack the System Design interview, we’ll need to prepare in four areas:

Fundamental concepts in System Design interview
Fundamentals of distributed system
The architecture of large-scale web applications
Design of large-scale distributed systems

Each of these dimensions flows into the next.

Why is it important to prepare strategically?

How we prepare for an interview at Amazon will probably differ from how we’d prepare for one at Slack. While the overall interview process shares similarities across various companies, there are also distinct differences that we must prepare for. This is one of the reasons why preparing strategically is so important. We’ll feel more confident in the long run if we’re intentional and thorough when creating an interview prep plan.

If we don’t know the fundamentals, we won’t be prepared to architect a service; if we don’t know how to put those systems together, we won’t be able to design a specific solution; once we’ve designed large-scale systems, we can apply the lessons learned to enhance our base knowledge.

Let’s look at each of these dimensions.

Screenshot 2025-02-13 at 23.15.23

Fundamental concepts in System Design interview

In this lesson, we’ll explore some concepts that are important for the System Design interview.

PACELC theorem

The CAP theorem doesn’t answer the question: “What choices does a distributed system have when there are no network partitions?”. The PACELC theorem answers this question.

The PACELC theorem states the following about a system that replicates data:

if statement: A distributed system can tradeoff between availability and consistency if there’s a partition.
else statement: When the system normally runs without partitions, the system can tradeoff between latency and consistency.

Screenshot 2025-02-13 at 23.39.03

The first three letters of the theorem, PAC, are the same as the CAP theorem. The ELC is the extension here. The theorem assumes we maintain high availability by replication. When there’s a failure, the CAP theorem prevails. If there isn’t a failure, we still have to consider the tradeoff between consistency and latency of a replicated system.

Examples of a PC/EC system include BigTable and HBase. They’ll always choose consistency, giving up availability and lower latency. Examples of a PA/EL system include Dynamo and Cassandra. They choose availability over consistency when a partition occurs. Otherwise, they choose lower latency. An example of a PA/EC system is MongoDB. In the case of a partition, it chooses availability but otherwise guarantees consistency.

CAP 定理并不能回答这样一个问题：“当分布式系统没有网络分区时，它有哪些选择？”PACELC 定理回答了这个问题。

PACELC 定理针对复制数据的系统提出了以下观点：

if 语句：当发生分区时，分布式系统需要在可用性（Availability）和一致性（Consistency）之间进行权衡。

else 语句：当系统正常运行且没有分区时，系统需要在延迟（Latency）和一致性（Consistency）之间进行权衡。

该定理的前三个字母 PAC 与 CAP 定理相同，而 ELC 是扩展部分。该定理假设系统通过数据复制来维持高可用性。当发生故障时，CAP 定理仍然适用。而在没有故障的情况下，我们仍然需要权衡复制系统的一致性和延迟。

PC/EC（分区时选择一致性，正常运行时选择一致性）的系统示例包括 BigTable 和 HBase。它们始终选择一致性，放弃可用性和较低的延迟。 PA/EL（分区时选择可用性，正常运行时选择低延迟）的系统示例包括 Dynamo 和 Cassandra。它们在发生分区时选择可用性，而在正常运行时选择低延迟。 PA/EC（分区时选择可用性，正常运行时选择一致性）的系统示例是 MongoDB。在发生分区时，它选择可用性，而在正常运行时保证一致性。

Heartbeat

A heartbeat message is a mechanism that helps us detect failures in a distributed system. If there’s a central server, all servers periodically send a heartbeat message to it to show that it’s still alive and functioning. If there’s no central server, all servers randomly select a set of servers and send that set a heartbeat message every few seconds. This way, if there are no heartbeat messages received for awhile, the system can suspect there might be a failure or a crash.

心跳消息（Heartbeat Message） 是一种用于在分布式系统中检测故障的机制。

如果系统有一个中心服务器，那么所有服务器会定期向该中心服务器发送心跳消息，以表明它们仍然存活并正常运行。

如果系统没有中心服务器，那么所有服务器会随机选择一组服务器，并每隔几秒向这组服务器发送心跳消息。

通过这种方式，如果一段时间内没有收到心跳消息，系统就可以怀疑可能发生了故障或崩溃。

AJAX polling

Polling is a standard technique used by most AJAX apps. The idea is that the client repeatedly polls a server for data. The client makes a request and waits for the server to respond with data. If no data is available, the server returns an empty response.

Screenshot 2025-02-13 at 23.43.22

轮询是大多数 AJAX 应用程序使用的一种标准技术。其核心思想是客户端不断向服务器请求数据。

工作方式： 1. 客户端向服务器发送请求。 2. 服务器处理请求并返回数据。 3. 如果当前没有可用数据，服务器会返回一个空响应。 4. 客户端等待一段时间后再次发送请求，以检查是否有新数据。

这种方法虽然简单，但如果服务器端数据更新频率较低，频繁的请求可能会造成不必要的资源消耗。

HTTP long-polling

With long-polling, the client requests information from the server, but the server may not respond immediately. This technique is sometimes called hanging GET. If the server doesn’t have any available data for the client, it’ll hold the request and wait until there is data available instead of sending an empty response. Once the data becomes available, a full response is sent to the client. The client immediately re-requests information from the server so that the server will almost always have an available waiting request that it can use to deliver data in response to an event.

Screenshot 2025-02-13 at 23.43.44

长轮询 是一种改进的轮询技术，客户端向服务器请求信息，但服务器不会立即响应。这种技术有时被称为 Hanging GET（悬挂式 GET）。

工作方式： 1. 客户端向服务器发送请求。 2. 如果服务器没有可用数据，它不会立即返回空响应，而是保持连接，直到有数据可用。 3. 一旦有数据可用，服务器会立即将完整的响应发送给客户端。 4. 客户端收到响应后，会立刻再次发送请求，以确保服务器始终有一个等待的请求可以用来返回新数据。

相比于传统轮询，长轮询减少了无效请求的数量，提高了数据的实时性，但仍然会有一定的服务器负担，尤其是在高并发情况下。

WebSockets

WebSocket provides full-duplex communication channels over a single TCP connection. It provides a persistent connection between a client and a server. Both parties can use this connection to start sending data at any time. The client establishes a connection through a WebSocket handshake. If the process succeeds, the server and client can begin exchanging data in both directions at any time.

Screenshot 2025-02-13 at 23.48.45

WebSocket 提供了全双工通信通道，并且基于单个 TCP 连接运行。它在客户端和服务器之间建立持久连接，使双方可以随时主动发送数据。

客户端通过 WebSocket 握手（handshake） 来建立连接。如果握手成功，服务器和客户端就可以随时在双向通道中交换数据。

Server-sent events (SSEs)

A client can establish a long-term connection with a server using SSEs. The server uses this connection to send data to a client. If the client wants to send data to the server, it would require the use of another technology or protocol.

Screenshot 2025-02-14 at 00.59.37

客户端可以使用 SSE（Server-Sent Events，服务器推送事件） 与服务器建立长期连接。服务器可以通过该连接单向地向客户端发送数据。

如果客户端需要向服务器发送数据，则需要使用其他技术或协议（例如 WebSocket 或 AJAX 请求），因为 SSE 仅支持服务器向客户端的单向通信。

Fundamentals of distributed system

Like with anything else, it is important to start with the basics. The fundamentals of distributed systems can give us the framework of what’s possible and what’s not in a given system.

We can understand the limitations of specific architectures and the trade-offs needed to achieve particular goals (e.g., consistency vs. write throughput). At the most basic level, we must start with the strengths, weaknesses, and purposes of distributed systems. We need to be able to discuss topics like:

像任何其他技术一样，学习分布式系统应该从基础开始。掌握分布式系统的基本原理可以帮助我们理解系统的能力范围，以及哪些特性是可实现的，哪些是受限的。

理解不同架构的局限性，以及为了实现特定目标（例如一致性 vs. 写入吞吐量）所需的权衡至关重要。在最基本的层面上，我们需要从分布式系统的优势、劣势和用途入手，并能够讨论以下话题：

Data durability and consistency

We must understand the differences and impacts of storage solution failure and corruption rates in read-write processes.

数据持久性与一致性（Data Durability and Consistency）我们需要理解存储解决方案的故障和数据损坏率对读写过程的影响，以及如何确保数据的长期存储和一致性。

Replication

Replication is the key to unlocking data durability and consistency. It deals with backing up data but also with repeating processes at scale.

数据复制（Replication）数据复制是提高数据持久性和一致性的关键。它不仅涉及数据备份，还涉及在大规模系统中重复执行相同的操作，确保数据的可用性。

Partitioning

Also called sharding; partitions divide data across different nodes within our system. As replication distributes data across nodes, partitioning distributes processes across nodes, reducing the reliance on pure replication.

分区（也称分片，Sharding）是将数据划分到不同的节点上。

复制（Replication） 负责将数据分布到多个节点，提高可用性。

分区（Partitioning） 负责将计算和存储负载分散到多个节点，减少对单一复制机制的依赖，提高系统的可扩展性。

Consensus

One of our nodes is in Seattle, another is in Beijing, and another is in London. There is a system request at 7:05 a.m. Pacific Daylight Time. Given the travel time of data packets, can this be recorded and properly synchronized in the remote nodes, and can it be concurred? This is a simple problem of consensus—all the nodes need to agree, which will prevent faulty processes from running and ensure consistency and replication of data and processes across the system.

假设我们的系统有三个节点，分别位于 西雅图、北京、伦敦。如果在 太平洋夏令时间 7:05 AM 发生了一次系统请求，考虑数据包的传输延迟，那么这个请求是否可以被所有远程节点正确记录和同步？

这就是一个共识问题（Consensus Problem）。所有节点必须达成一致（Agree on the same state），以防止错误的进程运行，并确保数据和计算在整个系统中的一致性和复制正确执行。

Screenshot 2025-02-14 at 01.04.59

Distributed transactions

Once we’ve achieved consensus, now transactions from applications need to be committed across databases, with fault checks performed by each involved resource. Two-way and three-way communication to read, write, and commit are shared across participant nodes.

当我们达成共识（Consensus）后，接下来应用程序的事务需要在多个数据库之间提交，并且每个参与的资源都需要执行故障检查。

在分布式系统中，参与的节点需要通过双向或三向通信（Two-way & Three-way communication）来进行读、写、提交操作，以确保事务的一致性和可靠性。

The architecture of large-scale web applications

We already know that most large-scale applications are web applications. Even if it’s not, the major consumer platforms like Netflix, Twitter, and Amazon, enterprises are moving away from on-premises systems like Exchange to cloud solutions from Microsoft, Google, and AWS. That’s why it’s good to understand the architecture of such systems.

我们已经知道，大多数大规模应用都是 Web 应用。即使不是 Web 应用，像 Netflix、Twitter、Amazon 这样的主要消费平台，以及许多企业，也都在从本地部署系统（如 Exchange）迁移到云端解决方案（如 Microsoft Azure、Google Cloud、AWS）。这就是为什么理解这些系统的架构至关重要。

We need to learn about topics such as:

N-tier applications

Processing happens at various levels in a distributed system. Some processes are on the client, some on the server, and others on another server—all within one application. These processing layers are called tiers, and understanding how those tiers interact with each other and the specific processes they are responsible for is part of System Design for the web.

N 层应用（N-tier applications）在分布式系统中，不同层级（tier）负责不同的处理任务：

客户端负责部分处理，服务器负责部分处理，其他服务器可能也会承担部分任务。

这些处理层级被称为 tiers（层）。理解这些层如何交互，以及它们各自负责的具体处理过程，是 Web 系统设计（System Design）的重要组成部分。

HTTP and REST

HTTP is a foundational protocol on which the entire internet runs. It is the system through which we send every email, stream every Netflix movie, and browse every Amazon listing. REST is a set of design principles to directly interact with the API that is HTTP, allowing efficient, scalable systems with components isolated from each other’s assumptions. Using these principles and open API makes it easier for others to build on our work or extend our capabilities with extensions to their own apps and services.

HTTP 是互联网的基础协议，它支持：

发送电子邮件流式传输 Netflix 电影浏览 Amazon 商品列表 REST 是一组设计原则，用于直接与 HTTP API 交互，允许构建高效、可扩展的系统，其中各组件相互独立，不会有过多的相互依赖。

遵循 REST 原则和开放 API 设计，可以让其他开发者更轻松地基于我们的系统进行扩展，或者通过自定义扩展提升自身应用的能力

DNS and load balancing

If we have 99 simultaneous users, load-balancing through DNS routing can ensure that servers A, B, and C each handle 33 clients, rather than server A being overloaded with 99 and servers B and C sitting idle. Routing client requests to the right server, the right tier where processing happens, helps ensure system stability. We need to know how to do this.

假设我们有 99 名用户同时访问系统，如果没有负载均衡，服务器 A 可能会被全部请求压垮，而 服务器 B 和 C 可能完全空闲。

通过 DNS 负载均衡，我们可以确保：

服务器 A、B、C 各自处理 33 个用户请求，

避免单个服务器过载，

将客户端请求引导到正确的服务器或正确的层级进行处理，以确保系统稳定运行。

Caching

A cache makes our most frequently requested data and applications accessible to most users at high speeds. The questions for our web application are what needs to be stored in the cache, how we direct traffic to the cache, and what happens when we don’t have what we want in the cache.

缓存能够存储最常访问的数据，以提高访问速度。对于 Web 应用，我们需要思考：

哪些数据应该存入缓存？

如何将流量引导至缓存？

当缓存中没有数据时应该怎么办？

缓存优化是系统设计的重要环节，能够显著提高性能和响应速度。

Stream processing

Stream processing applies uniform processes to the data stream. If an application has continuous, consistent data passing through it, then stream processing allows efficient use of local resources within the application.

流式处理 适用于持续、稳定的数据流，它能够：

对数据流应用统一的处理逻辑，

提高系统资源的利用率，

减少数据存储和处理的延迟。

如果一个应用程序需要连续处理传输中的数据（如日志分析、金融交易、实时推荐等），那么流式处理是提高效率的重要技术。

Design of large-scale distributed systems

This can seem like a lot, but it honestly takes only a few weeks of prep—less if we have a solid foundation to build on.

Once we know the basics of distributed systems and web architecture, it is time to apply this learning and design real-world systems. Finding and optimizing potential solutions to these problems will give us the tools to approach the System Design interview with confidence.

Once we are ready to practice our skills, we can take on some sample problems from real-world interviews, and tips and approaches to build ten different web services.

Summary

The world is more connected than ever, with almost all devices utilizing System Design and distributed systems.

Technical interviews, especially at big tech companies, are leaning more and more toward System Design interview questions. We should be well prepared to tackle any questions that come our way. Common System Design interview questions include creating a URL shortener with web crawlers, understanding the CAP theorem, discussing SQL and NoSQL databases, identifying use cases for various data models, addressing latency issues, constructing algorithms and data structures, and so on.

Consumers and businesses alike are online, and even legacy programs are migrating to the cloud. Distributed systems are the present and future of the software engineering discipline. As System Design Interview questions make up a bigger part of the developer interview, having a working knowledge of distributed systems will pay dividends in our career.