Skip to content

Chapter 2 Scale From Zero To Millions Of Users

Designing a system that supports millions of users is challenging, and it is a journey that requires continuous refinement and endless improvement. In this chapter, we build a system that supports a single user and gradually scale it up to serve millions of users. After reading this chapter, you will master a handful of techniques that will help you to crack the system design interview questions.

设计一个支持数百万用户的系统充满挑战,这是一段需要持续优化和不断改进的旅程。在本章中,我们将构建一个支持单个用户的系统,并逐步扩展以服务数百万用户。阅读本章后,您将掌握一些有助于攻克系统设计面试问题的技巧。

2.1 Single server setup

A journey of a thousand miles begins with a single step, and building a complex system is no different. To start with something simple, everything is running on a single server. Figure 1 shows the illustration of a single server setup where everything is running on one server: web app, database, cache, etc.

千里之行,始于足下。构建复杂系统也是如此。我们从简单的架构开始,一切运行在一台服务器上。图1展示了单服务器配置的示意图,其中所有内容(如网页应用、数据库、缓存等)都运行在一台服务器上。

Screenshot 2024-11-14 at 15.49.35

To understand this setup, it is helpful to investigate the request flow and traffic source. Let us first look at the request flow (Figure 2). Screenshot 2024-11-14 at 15.49.53

  1. Users access websites through domain names, such as api.mysite.com. Usually, the Domain Name System (DNS) is a paid service provided by 3rd parties and not hosted by our servers.
  2. Internet Protocol (IP) address is returned to the browser or mobile app. In the example, IP address 15.125.23.214 is returned.
  3. Once the IP address is obtained, Hypertext Transfer Protocol (HTTP) [1] requests are sent directly to your web server.
  4. The web server returns HTML pages or JSON response for rendering.

Next, let us examine the traffic source. The traffic to your web server comes from two sources: web application and mobile application.

  • Web application: it uses a combination of server-side languages (Java, Python, etc.) to handle business logic, storage, etc., and client-side languages (HTML and JavaScript) for presentation.
  • Mobile application: HTTP protocol is the communication protocol between the mobile app and the web server. JavaScript Object Notation (JSON) is commonly used API response format to transfer data due to its simplicity. An example of the API response in JSON format is shown below:

GET /users/12 – Retrieve user object for id = 12:

{
   "id":12,
   "firstName":"John",
   "lastName":"Smith",
   "address":{
      "streetAddress":"21 2nd Street",
      "city":"New York",
      "state":"NY",
      "postalCode":10021
   },
   "phoneNumbers":[
      "212 555-1234",
      "646 555-4567"
   ]
}
  1. 用户通过域名访问网站,例如api.mysite.com。通常,域名系统(DNS)是由第三方提供的付费服务,不会托管在我们的服务器上。
  2. 浏览器或移动应用程序会收到互联网协议(IP)地址。在此示例中,返回的IP地址为15.125.23.214。
  3. 获取IP地址后,将直接向您的Web服务器发送超文本传输协议(HTTP)请求。
  4. Web服务器返回HTML页面或JSON响应以供渲染。

接下来,我们来看流量来源。流向Web服务器的流量来自两个来源:Web应用程序和移动应用程序。

  • Web应用程序:使用服务器端语言(如Java、Python等)处理业务逻辑、存储等功能,使用客户端语言(HTML和JavaScript)进行页面展示。
  • 移动应用程序:HTTP协议是移动应用与Web服务器之间的通信协议。由于其简洁性,JavaScript对象表示法(JSON)通常用作API响应格式。以下是JSON格式API响应的示例

2.2 Database

With the growth of the user base, one server is not enough, and we need multiple servers: one for web/mobile traffic, the other for the database (Figure 3). Separating web/mobile traffic (web tier) and database (data tier) servers allows them to be scaled independently.

随着用户基数的增长,一台服务器已经不够用了,我们需要多台服务器:一台用于处理网页/移动端流量,另一台用于数据库(图3)。将网页/移动流量(Web层)和数据库(数据层)服务器分离,允许它们独立扩展。

Screenshot 2024-11-14 at 15.51.54

2.2.1 Which databases to use?

You can choose between a traditional relational database and a non-relational database. Let us examine their differences.

Relational databases are also called a relational database management system (RDBMS) or SQL database. The most popular ones are MySQL, Oracle database, PostgreSQL, etc. Relational databases represent and store data in tables and rows. You can perform join operations using SQL across different database tables.

Non-Relational databases are also called NoSQL databases. Popular ones are CouchDB, Neo4j, Cassandra, HBase, Amazon DynamoDB, etc. [2]. These databases are grouped into four categories: key-value stores, graph stores, column stores, and document stores. Join operations are generally not supported in non-relational databases.

For most developers, relational databases are the best option because they have been around for over 40 years and historically, they have worked well. However, if relational databases are not suitable for your specific use cases, it is critical to explore beyond relational databases. Non-relational databases might be the right choice if:

  • Your application requires super-low latency.
  • Your data are unstructured, or you do not have any relational data.
  • You only need to serialize and deserialize data (JSON, XML, YAML, etc.).
  • You need to store a massive amount of data.

您可以选择传统的关系型数据库或非关系型数据库。让我们来看看它们的区别。

关系型数据库也称为关系数据库管理系统(RDBMS)或SQL数据库。最流行的有MySQL、Oracle数据库、PostgreSQL等。关系型数据库以表格和行的形式表示和存储数据,并可以使用SQL在不同的数据库表之间进行连接操作(join)。

非关系型数据库也称为NoSQL数据库。常见的有CouchDB、Neo4j、Cassandra、HBase、Amazon DynamoDB等。这些数据库分为四类:键值存储、图形存储、列存储和文档存储。非关系型数据库通常不支持连接操作。

对于大多数开发人员来说,关系型数据库是最佳选择,因为它们已经存在了40多年,历史上表现良好。然而,如果关系型数据库不适合您的特定用例,那么探索非关系型数据库是至关重要的。当您的应用程序符合以下情况时,非关系型数据库可能是合适的选择:

  • 您的应用程序需要超低延迟。
  • 您的数据是非结构化的,或没有任何关系数据。
  • 您只需要对数据进行序列化和反序列化(JSON、XML、YAML等)。
  • 您需要存储海量数据。

2.3 Vertical scaling vs horizontal scaling

Vertical scaling, referred to as “scale up”, means the process of adding more power (CPU, RAM, etc.) to your servers. Horizontal scaling, referred to as “scale-out”, allows you to scale by adding more servers into your pool of resources.

When traffic is low, vertical scaling is a great option, and the simplicity of vertical scaling is its main advantage. Unfortunately, it comes with serious limitations.

  • Vertical scaling has a hard limit. It is impossible to add unlimited CPU and memory to a single server.
  • Vertical scaling does not have failover and redundancy. If one server goes down, the website/app goes down with it completely.

Horizontal scaling is more desirable for large scale applications due to the limitations of vertical scaling.

In the previous design, users are connected to the web server directly. Users will unable to access the website if the web server is offline. In another scenario, if many users access the web server simultaneously and it reaches the web server’s load limit, users generally experience slower response or fail to connect to the server. A load balancer is the best technique to address these problems.

垂直扩展(“纵向扩展”或“scale up”)指的是为服务器增加更多资源(CPU、内存等)的过程。而水平扩展(“横向扩展”或“scale-out”)则是通过添加更多服务器到资源池中来实现扩展。

在流量较低时,垂直扩展是一个不错的选择,其主要优势在于其简单性。然而,它也存在一些严重的局限性:

  • 垂直扩展有一个硬性限制。无法在单台服务器上无限增加CPU和内存。
  • 垂直扩展没有故障转移和冗余功能。如果一台服务器宕机,整个网站或应用程序也会完全中断。

由于垂直扩展的限制,水平扩展对大型应用程序来说更为理想。

在先前的设计中,用户是直接连接到Web服务器的。如果Web服务器离线,用户将无法访问网站。在另一种情况下,如果许多用户同时访问Web服务器且服务器达到负载上限,用户通常会遇到响应变慢或无法连接服务器的问题。负载均衡是解决这些问题的最佳技术。

2.4 Load balancer

A load balancer evenly distributes incoming traffic among web servers that are defined in a load-balanced set. Figure 4 shows how a load balancer works.

Screenshot 2024-11-14 at 15.54.10

As shown in Figure 4, users connect to the public IP of the load balancer directly. With this setup, web servers are unreachable directly by clients anymore. For better security, private IPs are used for communication between servers. A private IP is an IP address reachable only between servers in the same network; however, it is unreachable over the internet. The load balancer communicates with web servers through private IPs.

In Figure 4, after a load balancer and a second web server are added, we successfully solved no failover issue and improved the availability of the web tier. Details are explained below:

  • If server 1 goes offline, all the traffic will be routed to server 2. This prevents the website from going offline. We will also add a new healthy web server to the server pool to balance the load.
  • If the website traffic grows rapidly, and two servers are not enough to handle the traffic, the load balancer can handle this problem gracefully. You only need to add more servers to the web server pool, and the load balancer automatically starts to send requests to them.

Now the web tier looks good, what about the data tier? The current design has one database, so it does not support failover and redundancy. Database replication is a common technique to address those problems. Let us take a look.

负载均衡器

负载均衡器将传入流量均匀分配到负载均衡集合中的Web服务器上。图4展示了负载均衡器的工作原理。如图4所示,用户直接连接到负载均衡器的公共IP。通过这种设置,客户端无法直接访问Web服务器。为了更好的安全性,服务器之间的通信使用私有IP。私有IP是一种只能在同一网络的服务器之间访问的IP地址,无法通过互联网访问。负载均衡器通过私有IP与Web服务器通信。

在图4中,添加了负载均衡器和第二台Web服务器后,我们成功解决了故障转移问题并提高了Web层的可用性。详细说明如下:

  • 如果服务器1离线,所有流量将被路由到服务器2,从而防止网站离线。我们还可以向服务器池中添加一个新的健康的Web服务器来平衡负载。
  • 如果网站流量快速增长,且两台服务器不足以处理流量,负载均衡器可以优雅地解决此问题。您只需向Web服务器池中添加更多服务器,负载均衡器会自动开始将请求分发给它们。

现在Web层看起来不错,那么数据层呢?当前设计只有一个数据库,因此不支持故障转移和冗余。数据库复制是一种常见的技术,用于解决这些问题。让我们来看看。

2.5 Database replication

Quoted from Wikipedia: “Database replication can be used in many database management systems, usually with a master/slave relationship between the original (master) and the copies (slaves)” [3].

A master database generally only supports write operations. A slave database gets copies of the data from the master database and only supports read operations. All the data-modifying commands like insert, delete, or update must be sent to the master database. Most applications require a much higher ratio of reads to writes; thus, the number of slave databases in a system is usually larger than the number of master databases. Figure 5 shows a master database with multiple slave databases.

Screenshot 2024-11-14 at 15.54.36

Advantages of database replication:

  • Better performance: In the master-slave model, all writes and updates happen in master nodes; whereas, read operations are distributed across slave nodes. This model improves performance because it allows more queries to be processed in parallel.
  • Reliability: If one of your database servers is destroyed by a natural disaster, such as a typhoon or an earthquake, data is still preserved. You do not need to worry about data loss because data is replicated across multiple locations.
  • High availability: By replicating data across different locations, your website remains in operation even if a database is offline as you can access data stored in another database server.

In the previous section, we discussed how a load balancer helped to improve system availability. We ask the same question here: what if one of the databases goes offline? The architectural design discussed in Figure 5 can handle this case:

  • If only one slave database is available and it goes offline, read operations will be directed to the master database temporarily. As soon as the issue is found, a new slave database will replace the old one. In case multiple slave databases are available, read operations are redirected to other healthy slave databases. A new database server will replace the old one.
  • If the master database goes offline, a slave database will be promoted to be the new master. All the database operations will be temporarily executed on the new master database. A new slave database will replace the old one for data replication immediately. In production systems, promoting a new master is more complicated as the data in a slave database might not be up to date. The missing data needs to be updated by running data recovery scripts. Although some other replication methods like multi-masters and circular replication could help, those setups are more complicated; and their discussions are beyond the scope of this course. Interested readers should refer to the listed reference materials [4] [5].

Figure 6 shows the system design after adding the load balancer and database replication.

Screenshot 2024-11-14 at 15.54.58

Let us take a look at the design:

  • A user gets the IP address of the load balancer from DNS.
  • A user connects the load balancer with this IP address.
  • The HTTP request is routed to either Server 1 or Server 2.
  • A web server reads user data from a slave database.
  • A web server routes any data-modifying operations to the master database. This includes write, update, and delete operations.

Now, you have a solid understanding of the web and data tiers, it is time to improve the load/response time. This can be done by adding a cache layer and shifting static content (JavaScript/CSS/image/video files) to the content delivery network (CDN).

引用自维基百科:“数据库复制可以在许多数据库管理系统中使用,通常在原始数据库(主库)和副本(从库)之间存在主/从关系”[3]。

主数据库通常只支持写操作。从数据库从主数据库获取数据副本,只支持读操作。所有修改数据的命令(如插入、删除或更新)必须发送到主数据库。大多数应用程序的读操作比写操作的需求高得多,因此系统中从数据库的数量通常多于主数据库的数量。图5展示了一个主数据库和多个从数据库的结构。

数据库复制的优势

  • 性能提升:在主从模式中,所有写入和更新操作都在主节点上进行,而读操作则分布在从节点上。这种模式提高了性能,因为允许更多查询并行处理。
  • 可靠性:如果数据库服务器之一因自然灾害(如台风或地震)损坏,数据仍然得以保留。无需担心数据丢失,因为数据已在多个位置进行复制。
  • 高可用性:通过在不同位置复制数据,即使一个数据库离线,您的网站仍能正常运行,因为可以从其他数据库服务器访问数据。

在上一节中,我们讨论了负载均衡器如何提高系统可用性。这里我们提出相同的问题:如果某个数据库离线了怎么办?图5中的架构设计可以应对此种情况:

  • 如果只有一个从数据库可用,且该数据库离线,读操作将暂时定向到主数据库。一旦问题被发现,将有一个新的从数据库替换旧数据库。如果有多个从数据库可用,读操作将重新定向到其他健康的从数据库上,并会有一个新服务器替换旧数据库。
  • 如果主数据库离线,则会将一个从数据库提升为新的主数据库。所有数据库操作将暂时在新的主数据库上执行。新的从数据库会立即替换旧数据库,以继续数据复制。在生产系统中,提升新主数据库较为复杂,因为从数据库中的数据可能并非最新。缺失的数据需要通过运行数据恢复脚本进行更新。虽然其他复制方法如多主模式和环状复制可以有所帮助,但这些配置更加复杂,超出了本课程的讨论范围。感兴趣的读者可以参考列出的参考文献 [4] [5]。

图6展示了添加负载均衡器和数据库复制后的系统设计。

让我们来看看设计:

  • 用户从DNS获取负载均衡器的IP地址。
  • 用户通过该IP地址连接负载均衡器。
  • HTTP请求被路由到服务器1或服务器2。
  • Web服务器从从数据库读取用户数据。
  • Web服务器将任何修改数据的操作路由到主数据库,包括写入、更新和删除操作。

现在,您已对Web层和数据层有了坚实的理解,接下来是通过添加缓存层和将静态内容(JavaScript/CSS/图片/视频文件)移至内容分发网络(CDN)来改善负载和响应时间。

2.6 Cache

A cache is a temporary storage area that stores the result of expensive responses or frequently accessed data in memory so that subsequent requests are served more quickly. As illustrated in Figure 6, every time a new web page loads, one or more database calls are executed to fetch data. The application performance is greatly affected by calling the database repeatedly. The cache can mitigate this problem.

缓存是一种临时存储区域,用于将代价高昂的响应结果或频繁访问的数据存储在内存中,从而加快后续请求的响应速度。如图6所示,每次加载新的网页时,都会执行一个或多个数据库调用以获取数据。频繁的数据库调用会严重影响应用程序的性能,而缓存可以有效缓解此问题。

2.6.1 Cache tier

The cache tier is a temporary data store layer, much faster than the database. The benefits of having a separate cache tier include better system performance, ability to reduce database workloads, and the ability to scale the cache tier independently. Figure 7 shows a possible setup of a cache server:

Screenshot 2024-11-14 at 16.20.09

After receiving a request, a web server first checks if the cache has the available response. If it has, it sends data back to the client. If not, it queries the database, stores the response in cache, and sends it back to the client. This caching strategy is called a read-through cache. Other caching strategies are available depending on the data type, size, and access patterns. A previous study explains how different caching strategies work [6].

Interacting with cache servers is simple because most cache servers provide APIs for common programming languages. The following code snippet shows typical Memcached APIs:

SECONDS = 1
cache.set('myKey, 'hi there', 3600 * SECONDS)
cache.get('myKey')

缓存层是一个临时的数据存储层,比数据库快得多。独立的缓存层带来的好处包括:提高系统性能、减少数据库工作负载,以及独立扩展缓存层的能力。图7展示了缓存服务器的一个可能设置

在接收到请求后,Web服务器首先检查缓存中是否存在可用的响应。如果存在,则直接将数据返回给客户端;如果不存在,则查询数据库,将响应存入缓存,然后再将其返回给客户端。这种缓存策略称为读通缓存(read-through cache)。根据数据类型、大小和访问模式,还可以选择其他缓存策略。之前的一项研究解释了不同缓存策略的工作原理 [6]。

与缓存服务器的交互非常简单,因为大多数缓存服务器为常见编程语言提供了API。以下代码片段展示了典型的Memcached API使用示例:

2.6.2 Considerations for using cache

Here are a few considerations for using a cache system:

  • Decide when to use cache. Consider using cache when data is read frequently but modified infrequently. Since cached data is stored in volatile memory, a cache server is not ideal for persisting data. For instance, if a cache server restarts, all the data in memory is lost. Thus, important data should be saved in persistent data stores.

  • Expiration policy. It is a good practice to implement an expiration policy. Once cached data is expired, it is removed from the cache. When there is no expiration policy, cached data will be stored in the memory permanently. It is advisable not to make the expiration date too short as this will cause the system to reload data from the database too frequently. Meanwhile, it is advisable not to make the expiration date too long as the data can become stale.

  • Consistency: This involves keeping the data store and the cache in sync. Inconsistency can happen because data-modifying operations on the data store and cache are not in a single transaction. When scaling across multiple regions, maintaining consistency between the data store and cache is challenging. For further details, refer to the paper titled “Scaling Memcache at Facebook” published by Facebook [7].

  • Mitigating failures: A single cache server represents a potential single point of failure (SPOF), defined in Wikipedia as follows: “A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working” [8]. As a result, multiple cache servers across different data centers are recommended to avoid SPOF. Another recommended approach is to overprovision the required memory by certain percentages. This provides a buffer as the memory usage increases.

Screenshot 2024-11-14 at 16.21.04

  • Eviction Policy: Once the cache is full, any requests to add items to the cache might cause existing items to be removed. This is called cache eviction. Least-recently-used (LRU) is the most popular cache eviction policy. Other eviction policies, such as the Least Frequently Used (LFU) or First in First Out (FIFO), can be adopted to satisfy different use cases.

以下是在使用缓存系统时需要考虑的一些因素:

  • 决定何时使用缓存:当数据频繁读取但很少修改时,可以考虑使用缓存。由于缓存数据存储在易失性内存中,缓存服务器并不适合持久化数据。例如,如果缓存服务器重启,内存中的所有数据都会丢失。因此,重要数据应保存在持久化的数据存储中。
  • 过期策略:实现过期策略是一个良好的实践。一旦缓存的数据过期,它将从缓存中移除。如果没有过期策略,缓存数据将永久存储在内存中。建议不过于缩短过期时间,因为这会导致系统过于频繁地从数据库重新加载数据。同时,也不要将过期时间设置得过长,因为数据可能会过时。
  • 一致性:这涉及保持数据存储与缓存之间的数据同步。由于在数据存储和缓存上的数据修改操作不在同一事务中,不一致的情况可能会发生。当在多个地区扩展时,维护数据存储和缓存之间的一致性变得具有挑战性。有关更多详细信息,请参阅Facebook发表的《Scaling Memcache at Facebook》论文[7]。
  • 故障缓解:单个缓存服务器是一个潜在的单点故障(SPOF),其定义如下:“单点故障(SPOF)是系统中的一部分,如果该部分故障,整个系统将无法工作”[8]。因此,建议在不同的数据中心部署多个缓存服务器以避免单点故障。另一个推荐的方法是根据所需内存的百分比进行超量配置,以便在内存使用增加时提供缓冲空间。
  • 淘汰策略:当缓存已满时,任何向缓存添加新项目的请求可能会导致现有项目被移除,这称为缓存淘汰。最常见的缓存淘汰策略是最近最少使用(LRU)。其他淘汰策略如最不常使用(LFU)或先进先出(FIFO)也可用于满足不同的使用场景。

2.7 Content delivery network (CDN)

A CDN is a network of geographically dispersed servers used to deliver static content. CDN servers cache static content like images, videos, CSS, JavaScript files, etc.

Dynamic content caching is a relatively new concept and beyond the scope of this course. It enables the caching of HTML pages that are based on request path, query strings, cookies, and request headers. Refer to the article mentioned in reference material [9] for more about this. This course focuses on how to use CDN to cache static content.

Here is how CDN works at the high-level: when a user visits a website, a CDN server closest to the user will deliver static content. Intuitively, the further users are from CDN servers, the slower the website loads. For example, if CDN servers are in San Francisco, users in Los Angeles will get content faster than users in Europe. Figure 9 is a great example that shows how CDN improves load time.

Screenshot 2024-11-14 at 16.24.11

Figure 10 demonstrates the CDN workflow.

Screenshot 2024-11-14 at 16.24.23

  1. User A tries to get image.png by using an image URL. The URL’s domain is provided by the CDN provider. The following two image URLs are samples used to demonstrate what image URLs look like on Amazon and Akamai CDNs: * https://mysite.cloudfront.net/logo.jpg * https://mysite.akamai.com/image-manager/img/logo.jpg

  2. If the CDN server does not have image.png in the cache, the CDN server requests the file from the origin, which can be a web server or online storage like Amazon S3.

  3. The origin returns image.png to the CDN server, which includes optional HTTP header Time-to-Live (TTL) which describes how long the image is cached.
  4. The CDN caches the image and returns it to User A. The image remains cached in the CDN until the TTL expires.
  5. User B sends a request to get the same image.
  6. The image is returned from the cache as long as the TTL has not expired.

CDN是一种由地理位置分散的服务器组成的网络,用于传递静态内容。CDN服务器会缓存静态内容,如图像、视频、CSS、JavaScript文件等。

动态内容缓存是一个相对较新的概念,超出了本课程的范围。它使得基于请求路径、查询字符串、cookies和请求头缓存HTML页面成为可能。有关详细信息,请参阅参考资料中的相关文章 [9]。本课程将重点放在如何使用CDN缓存静态内容上。

以下是CDN的基本工作原理:当用户访问网站时,距离用户最近的CDN服务器将提供静态内容。直观地讲,用户离CDN服务器越远,网站加载速度越慢。例如,如果CDN服务器位于旧金山,洛杉矶的用户获取内容的速度将比欧洲的用户更快。图9很好地展示了CDN如何提高加载速度。

图10展示了CDN的工作流程。

  1. 用户A尝试通过图像URL获取image.png。该URL的域名由CDN提供商提供。以下是两个用于演示Amazon和Akamai CDN上图像URL的示例: - https://mysite.cloudfront.net/logo.jpg - https://mysite.akamai.com/image-manager/img/logo.jpg
  2. 如果CDN服务器的缓存中没有image.png,CDN服务器会向源服务器请求该文件,源服务器可以是Web服务器或Amazon S3等在线存储。
  3. 源服务器将image.png返回给CDN服务器,并包含可选的HTTP头部字段Time-to-Live(TTL),用于指示图像的缓存时间。
  4. CDN缓存该图像并将其返回给用户A。该图像将在CDN中缓存,直到TTL过期。
  5. 用户B发送请求以获取相同的图像。
  6. 只要TTL未过期,图像将直接从缓存中返回。

2.7.1 Considerations of using a CDN

  • Cost: CDNs are run by third-party providers, and you are charged for data transfers in and out of the CDN. Caching infrequently used assets provides no significant benefits so you should consider moving them out of the CDN.
  • Setting an appropriate cache expiry: For time-sensitive content, setting a cache expiry time is important. The cache expiry time should neither be too long nor too short. If it is too long, the content might no longer be fresh. If it is too short, it can cause repeat reloading of content from origin servers to the CDN.
  • CDN fallback: You should consider how your website/application copes with CDN failure. If there is a temporary CDN outage, clients should be able to detect the problem and request resources from the origin.
  • Invalidating files: You can remove a file from the CDN before it expires by performing one of the following operations:
  • Invalidate the CDN object using APIs provided by CDN vendors.
  • Use object versioning to serve a different version of the object. To version an object, you can add a parameter to the URL, such as a version number. For example, version number 2 is added to the query string: image.png?v=2.

Figure 11 shows the design after the CDN and cache are added.

Screenshot 2024-11-14 at 16.25.42

  1. Static assets (JS, CSS, images, etc.,) are no longer served by web servers. They are fetched from the CDN for better performance.
  2. The database load is lightened by caching data.
  • 成本:CDN由第三方提供商运行,您需要为进入和传出的数据传输付费。缓存不常用的资源不会带来显著收益,因此应考虑将其移出CDN。
  • 设置合适的缓存过期时间:对于时间敏感的内容,设置缓存过期时间非常重要。过期时间不宜过长也不宜过短。如果过长,内容可能已不再新鲜;如果过短,可能会导致CDN频繁从源服务器重新加载内容。
  • CDN故障处理:应考虑在CDN发生故障时,您的网站或应用程序如何应对。如果出现临时的CDN中断,客户端应能够检测到问题,并从源服务器请求资源。
  • 文件失效:在文件过期之前,可以通过以下操作将其从CDN中移除:
  • 使用CDN供应商提供的API使CDN对象失效。
  • 使用对象版本控制提供不同版本的对象。可以在URL中添加一个参数(如版本号)来实现版本控制。例如,版本号2被添加到查询字符串中:image.png?v=2

图11展示了添加CDN和缓存后的系统设计。

  1. 静态资源(如JS、CSS、图像等)不再由Web服务器提供,而是从CDN获取,以提升性能。
  2. 通过缓存数据减轻了数据库负载。

2.8 Stateless web tier

Now it is time to consider scaling the web tier horizontally. For this, we need to move state (for instance user session data) out of the web tier. A good practice is to store session data in the persistent storage such as relational database or NoSQL. Each web server in the cluster can access state data from databases. This is called stateless web tier.

2.8 无状态Web层

现在是时候考虑水平扩展Web层了。为此,我们需要将状态(例如用户会话数据)从Web层中移出。一个好的做法是将会话数据存储在持久化存储中,例如关系型数据库或NoSQL数据库。集群中的每台Web服务器都可以从数据库访问状态数据,这称为无状态Web层。

2.8.1 Stateful architecture

A stateful server and stateless server has some key differences. A stateful server remembers client data (state) from one request to the next. A stateless server keeps no state information.

Figure 12 shows an example of a stateful architecture.

Screenshot 2024-11-14 at 16.26.15

In Figure 12, user A’s session data and profile image are stored in Server 1. To authenticate User A, HTTP requests must be routed to Server 1. If a request is sent to other servers like Server 2, authentication would fail because Server 2 does not contain User A’s session data. Similarly, all HTTP requests from User B must be routed to Server 2; all requests from User C must be sent to Server 3.

The issue is that every request from the same client must be routed to the same server. This can be done with sticky sessions in most load balancers [10]; however, this adds the overhead. Adding or removing servers is much more difficult with this approach. It is also challenging to handle server failures.

有状态架构 有状态服务器和无状态服务器有一些关键区别。有状态服务器会记住客户端数据(状态)以供后续请求使用,而无状态服务器不保留状态信息。

图12展示了一个有状态架构的示例。

在图12中,用户A的会话数据和个人资料图片存储在服务器1上。要对用户A进行身份验证,HTTP请求必须路由到服务器1。如果请求被发送到其他服务器(如服务器2),身份验证将失败,因为服务器2没有用户A的会话数据。同样,用户B的所有HTTP请求必须路由到服务器2;用户C的所有请求必须发送到服务器3。

问题在于,来自同一客户端的每个请求都必须路由到同一服务器。大多数负载均衡器通过粘性会话(sticky sessions)来实现这一点 [10],但这会增加开销。此外,使用这种方法后,添加或移除服务器变得更加困难,并且难以应对服务器故障。

2.8.2 Stateless architecture

Figure 13 shows the stateless architecture.

Screenshot 2024-11-14 at 16.26.41

In this stateless architecture, HTTP requests from users can be sent to any web servers, which fetch state data from a shared data store. State data is stored in a shared data store and kept out of web servers. A stateless system is simpler, more robust, and scalable.

Figure 14 shows the updated design with a stateless web tier.

Screenshot 2024-11-14 at 16.26.59

In Figure 14, we move the session data out of the web tier and store them in the persistent data store. The shared data store could be a relational database, Memcached/Redis, NoSQL, etc. The NoSQL data store is chosen as it is easy to scale. Autoscaling means adding or removing web servers automatically based on the traffic load. After the state data is removed out of web servers, auto-scaling of the web tier is easily achieved by adding or removing servers based on traffic load.

Your website grows rapidly and attracts a significant number of users internationally. To improve availability and provide a better user experience across wider geographical areas, supporting multiple data centers is crucial.

图13展示了无状态架构。

在这种无状态架构中,用户的HTTP请求可以发送到任何Web服务器,Web服务器从共享的数据存储中获取状态数据。状态数据存储在共享的数据存储中,与Web服务器分离。无状态系统更加简单、健壮且易于扩展。

图14展示了更新后的无状态Web层设计。

在图14中,我们将会话数据移出Web层,并存储在持久化数据存储中。共享数据存储可以是关系型数据库、Memcached/Redis、NoSQL等。NoSQL数据存储被选用,因为它易于扩展。自动扩展意味着根据流量负载自动添加或移除Web服务器。在将状态数据从Web服务器中移出后,Web层的自动扩展变得更加容易,可以根据流量负载添加或移除服务器。

随着您的网站快速增长并吸引大量国际用户,为了提高可用性并为更广泛的地域提供更好的用户体验,支持多个数据中心变得至关重要。

2.9 Data centers

Figure 15 shows an example setup with two data centers. In normal operation, users are geoDNS-routed, also known as geo-routed, to the closest data center, with a split traffic of x% in US-East and (100 – x)% in US-West. geoDNS is a DNS service that allows domain names to be resolved to IP addresses based on the location of a user.

Screenshot 2024-11-14 at 16.27.29

In the event of any significant data center outage, we direct all traffic to a healthy data center. In Figure 16, data center 2 (US-West) is offline, and 100% of the traffic is routed to data center 1 (US-East).

Screenshot 2024-11-14 at 16.27.43

Several technical challenges must be resolved to achieve multi-data center setup:

  • Traffic redirection: Effective tools are needed to direct traffic to the correct data center. GeoDNS can be used to direct traffic to the nearest data center depending on where a user is located.
  • Data synchronization: Users from different regions could use different local databases or caches. In failover cases, traffic might be routed to a data center where data is unavailable. A common strategy is to replicate data across multiple data centers. A previous study shows how Netflix implements asynchronous multi-data center replication [11].
  • Test and deployment: With multi-data center setup, it is important to test your website/application at different locations. Automated deployment tools are vital to keep services consistent through all the data centers [11].

To further scale our system, we need to decouple different components of the system so they can be scaled independently. Messaging queue is a key strategy employed by many real-world distributed systems to solve this problem.

图15展示了一个包含两个数据中心的示例设置。在正常操作中,用户通过geoDNS(地理DNS路由)被路由到最近的数据中心,流量按X%在美国东部(US-East)和(100 – X)%在美国西部(US-West)分配。geoDNS是一种允许根据用户位置将域名解析到IP地址的DNS服务。

如果发生重大数据中心中断,则将所有流量转向健康的数据中心。在图16中,数据中心2(US-West)离线,100%的流量被路由到数据中心1(US-East)。

实现多数据中心设置需要解决若干技术难题:

  • 流量重定向:需要有效的工具将流量导向正确的数据中心。GeoDNS可以用于根据用户所在位置将流量导向最近的数据中心。
  • 数据同步:来自不同区域的用户可能会使用不同的本地数据库或缓存。在故障转移情况下,流量可能被路由到一个没有相应数据的数据中心。一个常见的策略是在多个数据中心间复制数据。先前的研究展示了Netflix如何实现异步的多数据中心复制 [11]。
  • 测试和部署:在多数据中心设置下,重要的是在不同位置测试您的网站/应用程序。自动化部署工具对于保持所有数据中心的一致性服务至关重要 [11]。

为了进一步扩展系统,我们需要解耦系统的不同组件,使它们能够独立扩展。消息队列是许多现实分布式系统中解决此问题的关键策略。

2.10 Message queue

A message queue is a durable component, stored in memory, that supports asynchronous communication. It serves as a buffer and distributes asynchronous requests. The basic architecture of a message queue is simple. Input services, called producers/publishers, create messages, and publish them to a message queue. Other services or servers, called consumers/subscribers, connect to the queue, and perform actions defined by the messages. The model is shown in Figure 17.

Screenshot 2024-11-14 at 16.28.14

Decoupling makes the message queue a preferred architecture for building a scalable and reliable application. With the message queue, the producer can post a message to the queue when the consumer is unavailable to process it. The consumer can read messages from the queue even when the producer is unavailable.

Consider the following use case: your application supports photo customization, including cropping, sharpening, blurring, etc. Those customization tasks take time to complete. In Figure 18, web servers publish photo processing jobs to the message queue. Photo processing workers pick up jobs from the message queue and asynchronously perform photo customization tasks. The producer and the consumer can be scaled independently. When the size of the queue becomes large, more workers are added to reduce the processing time. However, if the queue is empty most of the time, the number of workers can be reduced.

Screenshot 2024-11-14 at 16.28.23

消息队列 消息队列是一种持久化组件,存储在内存中,支持异步通信。它充当缓冲区并分发异步请求。消息队列的基本架构很简单。输入服务(称为生产者/发布者)创建消息并将其发布到消息队列中。其他服务或服务器(称为消费者/订阅者)连接到队列并执行消息所定义的操作。图17展示了这一模型。

解耦使消息队列成为构建可扩展和可靠应用程序的首选架构。有了消息队列,当消费者不可用时,生产者可以将消息发布到队列中。即使生产者不可用,消费者也可以从队列中读取消息。

考虑以下使用场景:您的应用程序支持照片自定义,包括裁剪、锐化、模糊等。这些自定义任务需要时间来完成。如图18所示,Web服务器将照片处理任务发布到消息队列中。照片处理工作线程从消息队列中获取任务并异步执行照片自定义操作。生产者和消费者可以独立扩展。当队列变大时,添加更多工作线程以减少处理时间。然而,如果队列大部分时间为空,则可以减少工作线程的数量。

2.11 Logging, metrics, automation

When working with a small website that runs on a few servers, logging, metrics, and automation support are good practices but not a necessity. However, now that your site has grown to serve a large business, investing in those tools is essential.

Logging: Monitoring error logs is important because it helps to identify errors and problems in the system. You can monitor error logs at per server level or use tools to aggregate them to a centralized service for easy search and viewing.

Metrics: Collecting different types of metrics help us to gain business insights and understand the health status of the system. Some of the following metrics are useful:

  • Host level metrics: CPU, Memory, disk I/O, etc.
  • Aggregated level metrics: for example, the performance of the entire database tier, cache tier, etc.
  • Key business metrics: daily active users, retention, revenue, etc.

Automation: When a system gets big and complex, we need to build or leverage automation tools to improve productivity. Continuous integration is a good practice, in which each code check-in is verified through automation, allowing teams to detect problems early. Besides, automating your build, test, deploy process, etc. could improve developer productivity significantly.

Adding message queues and different tools

Figure 19 shows the updated design. Due to the space constraint, only one data center is shown in the figure.

  1. The design includes a message queue, which helps to make the system more loosely coupled and failure resilient.
  2. Logging, monitoring, metrics, and automation tools are included.

Screenshot 2024-11-14 at 16.28.42

As the data grows every day, your database gets more overloaded. It is time to scale the data tier.

日志、指标与自动化 对于运行在少数服务器上的小型网站而言,日志、指标和自动化支持是良好的实践,但并非必要。然而,随着您的网站成长为一个大型业务,投资这些工具变得必不可少。

日志记录:监控错误日志非常重要,因为它有助于识别系统中的错误和问题。您可以在每个服务器级别监控错误日志,或使用工具将其聚合到集中服务中,方便搜索和查看。

指标:收集不同类型的指标可以帮助我们获得业务洞察并了解系统的健康状况。以下一些指标非常有用:

主机级别指标:CPU、内存、磁盘I/O等。 聚合级别指标:例如,整个数据库层、缓存层的性能等。 关键业务指标:日活跃用户、用户留存、收入等。 自动化:当系统变得庞大且复杂时,我们需要构建或利用自动化工具来提高生产力。持续集成是一种良好的实践,其中每次代码提交都通过自动化验证,允许团队及早发现问题。此外,自动化构建、测试和部署过程等操作可以显著提高开发人员的工作效率。

添加消息队列和其他工具 图19展示了更新后的设计。由于空间限制,图中仅显示了一个数据中心。

设计中包括消息队列,这有助于使系统更加松散耦合且具有故障恢复能力。 添加了日志记录、监控、指标和自动化工具。随着数据的日益增长,数据库的负载也随之增加,是时候扩展数据层了。

2.12 Database scaling

There are two broad approaches for database scaling: vertical scaling and horizontal scaling.

数据库扩展有两种主要方法:垂直扩展和水平扩展。

2.12.1 Vertical scaling

Vertical scaling, also known as scaling up, is the scaling by adding more power (CPU, RAM, DISK, etc.) to an existing machine. There are some powerful database servers. According to Amazon Relational Database Service (RDS) [12], you can get a database server with 24 TB of RAM. This kind of powerful database server could store and handle lots of data. For example, stackoverflow.com in 2013 had over 10 million monthly unique visitors, but it only had 1 master database [13]. However, vertical scaling comes with some serious drawbacks:

  • You can add more CPU, RAM, etc. to your database server, but there are hardware limits. If you have a large user base, a single server is not enough.
  • Greater risk of single point of failures.
  • The overall cost of vertical scaling is high. Powerful servers are much more expensive.

垂直扩展(也称为纵向扩展或scale up)是通过为现有机器增加更多的资源(CPU、内存、磁盘等)来扩展。市场上有一些非常强大的数据库服务器。根据Amazon关系型数据库服务(RDS)[12],您可以获得具有24TB内存的数据库服务器。这种强大的数据库服务器可以存储和处理大量数据。例如,2013年时,stackoverflow.com每月的独立访客超过1000万,但它仅有1个主数据库[13]。然而,垂直扩展有一些严重的缺点:

  • 您可以为数据库服务器增加更多的CPU、内存等,但硬件总有上限。如果用户基数庞大,单台服务器是不够的。
  • 增加了单点故障的风险。
  • 垂直扩展的整体成本很高,强大的服务器价格昂贵。

2.12.2 Horizontal scaling

Horizontal scaling, also known as sharding, is the practice of adding more servers. Figure 20 compares vertical scaling with horizontal scaling.

Screenshot 2024-11-14 at 16.29.25

Sharding separates large databases into smaller, more easily managed parts called shards. Each shard shares the same schema, though the actual data on each shard is unique to the shard.

Figure 21 shows an example of sharded databases. User data is allocated to a database server based on user IDs. Anytime you access data, a hash function is used to find the corresponding shard. In our example, user_id % 4 is used as the hash function. If the result equals to 0, shard 0 is used to store and fetch data. If the result equals to 1, shard 1 is used. The same logic applies to other shards.

Screenshot 2024-11-14 at 16.29.39

Figure 22 shows the user table in sharded databases.

Screenshot 2024-11-14 at 16.29.51

The most important factor to consider when implementing a sharding strategy is the choice of the sharding key. Sharding key (known as a partition key) consists of one or more columns that determine how data is distributed. As shown in Figure 22, “user_id” is the sharding key. A sharding key allows you to retrieve and modify data efficiently by routing database queries to the correct database. When choosing a sharding key, one of the most important criteria is to choose a key that can evenly distributed data.

Sharding is a great technique to scale the database but it is far from a perfect solution. It introduces complexities and new challenges to the system:

Resharding data: Resharding data is needed when 1) a single shard could no longer hold more data due to rapid growth. 2) Certain shards might experience shard exhaustion faster than others due to uneven data distribution. When shard exhaustion happens, it requires updating the sharding function and moving data around. Consistent hashing is a commonly used technique to solve this problem.

Celebrity problem: This is also called a hotspot key problem. Excessive access to a specific shard could cause server overload. Imagine data for Katy Perry, Justin Bieber, and Lady Gaga all end up on the same shard. For social applications, that shard will be overwhelmed with read operations. To solve this problem, we may need to allocate a shard for each celebrity. Each shard might even require further partition.

Join and de-normalization: Once a database has been sharded across multiple servers, it is hard to perform join operations across database shards. A common workaround is to de-normalize the database so that queries can be performed in a single table.

In Figure 23, we shard databases to support rapidly increasing data traffic. At the same time, some of the non-relational functionalities are moved to a NoSQL data store to reduce the database load. Here is an article that covers many use cases of NoSQL [14].

Screenshot 2024-11-14 at 16.30.07

水平扩展(也称为分片或sharding)是增加更多服务器的做法。图20比较了垂直扩展和水平扩展。

分片将大型数据库拆分成更小、更易管理的部分,称为分片。每个分片共享相同的架构,但每个分片上的数据是唯一的。

图21展示了分片数据库的示例。用户数据根据用户ID分配到不同的数据库服务器上。每次访问数据时,都会使用哈希函数来找到对应的分片。在我们的示例中,user_id % 4用作哈希函数。如果结果为0,则使用分片0存储和获取数据;如果结果为1,则使用分片1。其他分片也遵循相同逻辑。

图22展示了分片数据库中的用户表。

实现分片策略时最重要的因素是选择分片键。分片键(也称为分区键)由一个或多个列组成,决定数据的分布方式。如图22所示,“user_id”是分片键。分片键允许数据库查询被路由到正确的数据库,从而高效地检索和修改数据。选择分片键时,最重要的标准之一是选择能够均匀分布数据的键。

尽管分片是一种扩展数据库的好技术,但它并非完美解决方案,它为系统引入了复杂性和新挑战:

  • 数据重新分片:在以下情况中需要重新分片:1)由于快速增长,单个分片无法再容纳更多数据;2)由于数据分布不均,某些分片比其他分片更快地耗尽存储。当分片耗尽时,需要更新分片函数并移动数据。常用的一种解决方案是一致性哈希。
  • 名人问题:也称为热点键问题。对特定分片的过度访问可能导致服务器过载。想象一下Katy Perry、Justin Bieber和Lady Gaga的数据都在同一个分片上,对于社交应用,这个分片将因大量读操作而不堪重负。为了解决这个问题,我们可能需要为每个名人分配一个分片,甚至可能进一步分区。
  • 连接和非规范化:一旦数据库分片在多个服务器上,跨分片的连接操作变得困难。一个常见的解决方案是非规范化数据库,以便查询可以在单个表中执行。

在图23中,我们对数据库进行分片以支持迅速增加的数据流量。同时,将一些非关系型功能转移到NoSQL数据存储中,以减少数据库负载。以下文章涵盖了NoSQL的许多使用场景[14]。

Millions of users and beyond

Scaling a system is an iterative process. Iterating on what we have learned in this chapter could get us far. More fine-tuning and new strategies are needed to scale beyond millions of users. For example, you might need to optimize your system and decouple the system to even smaller services. All the techniques learned in this chapter should provide a good foundation to tackle new challenges. To conclude this chapter, we provide a summary of how we scale our system to support millions of users:

  • Keep web tier stateless
  • Build redundancy at every tier
  • Cache data as much as you can
  • Support multiple data centers
  • Host static assets in CDN
  • Scale your data tier by sharding
  • Split tiers into individual services
  • Monitor your system and use automation tools

Congratulations on getting this far! Now give yourself a pat on the back. Good job!

扩展系统是一个反复迭代的过程。重复本章中的学习可以帮助我们走得更远。为了扩展到数百万用户以上,可能需要更多的优化和新策略。例如,您可能需要进一步优化系统并将其解耦为更小的服务。本章中学到的所有技术应该为应对新挑战打下坚实基础。为总结本章内容,我们概述了将系统扩展至支持数百万用户的方法:

  • 保持Web层无状态
  • 在每一层构建冗余
  • 尽可能缓存数据
  • 支持多个数据中心
  • 将静态资源托管在CDN上
  • 通过分片扩展数据层
  • 将各层分解为独立的服务
  • 监控系统并使用自动化工具

恭喜您走到这里!现在给自己一个赞,干得好!

Reference materials

[1] Hypertext Transfer Protocol: https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol

[2] Should you go Beyond Relational Databases?: https://blog.teamtreehouse.com/should-you-go-beyond-relational-databases

[3] Replication: https://en.wikipedia.org/wiki/Replication_(computing)

[4] Multi-master replication: https://en.wikipedia.org/wiki/Multi-master_replication

[5] NDB Cluster Replication: Multi-Master and Circular Replication: https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-replication-multi-master.html

[6] Caching Strategies and How to Choose the Right One: https://codeahoy.com/2017/08/11/caching-strategies-and-how-to-choose-the-right-one/

[7] R. Nishtala, "Facebook, Scaling Memcache at," 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13).

[8] Single point of failure: https://en.wikipedia.org/wiki/Single_point_of_failure

[9] Amazon CloudFront Dynamic Content Delivery: https://aws.amazon.com/cloudfront/dynamic-content/

[10] Configure Sticky Sessions for Your Classic Load Balancer: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-sticky-sessions.html

[11] Active-Active for Multi-Regional Resiliency: https://netflixtechblog.com/active-active-for-multi-regional-resiliency-c47719f6685b

[12] Amazon EC2 High Memory Instances: https://aws.amazon.com/ec2/instance-types/high-memory/

[13] What it takes to run Stack Overflow: http://nickcraver.com/blog/2013/11/22/what-it-takes-to-run-stack-overflow

[14] What The Heck Are You Actually Using NoSQL For: http://highscalability.com/blog/2010/12/6/what-the-heck-are-you-actually-using-nosql-for.html