Job Scheduling System

Jane is an engineer on the data quality monitoring team. Her team processes large volumes of data periodically.

During the problem review, Jane proposed creating a generic job scheduler that can satisfy the team’s needs. After quick research, she came up with the following requirements that the system needed to meet.

a. Must run a weekly batch job that needs to finish running by midnight of Tuesday each week. The job processes a week worth of data Sunday – Saturday and usually takes about 8 hours to run.

b. Must run a small job (3-4 minutes run time), that pulls data from a queue and processes it.

c. Stretch goal: Build a way to create and manage the jobs. We want to see which jobs are currently running and the status and execution history previously run jobs.

d. Stretch goal: Provide a mechanism to alert the team in case a job may not complete with the allotted time (or if the small job queue is too large).

For the weekly jobs, we will start with 1 job, but if we are successful, it may need to run up to 10k jobs each week.

For the small jobs, we are looking at 500-1000 jobs / day for now, but will scale to 10k jobs / day

Jane 是数据质量监控团队的一名工程师。她的团队需要定期处理大量数据。

在问题评审过程中，Jane 提出了创建一个通用的作业调度系统，以满足团队的需求。经过简要研究后，她提出了以下系统要求：

基本需求： a. 必须运行一个每周的批处理作业，该作业必须在每周二午夜前完成。该作业会处理一整周（星期日至星期六）的数据，并且通常需要大约 8 小时才能运行完成。 b. 必须运行一个小型作业（运行时间为 3-4 分钟），该作业从队列中提取数据并进行处理。

扩展目标： c. 建立一个管理作业的方式，以查看当前正在运行的作业、状态以及过去执行的作业历史。 d. 提供一种机制来在作业可能无法在规定时间内完成（或小型作业队列过大）时向团队发送警报。

扩展性要求：

对于每周的作业，起初只有 1 个作业，但如果成功，可能会增长到 每周 10,000 个作业。

对于小型作业，目前每天大约 500-1000 个作业，但未来会扩展到 每天 10,000 个作业。

Initial Design: After a few days of research, here was Jane’s initial design. A simple storage server that stores the job, with a scheduler that pulls the jobs and runs them on job runners. The job runners will report back status of the job execution back to the storage server. This high-level design meets the requirements (a) and (b). The DB may be a normal relational database or perhaps noSQL.

The design went through a few iterations based on feedback from peers and senior engineers. This design is flexible and can easily be scaled by adding more job runners, which are basic EC2 instances. Jane has not yet figured out how to manage the jobs. This is something she takes up as a follow up task.

Screenshot 2025-02-04 at 09.12.37

初始设计

经过几天的研究，Jane 提出了初步设计。该系统包含一个简单的存储服务器用于存储作业，同时有一个调度器定期拉取作业并在作业运行器（job runner）上执行。作业运行器会将作业执行状态反馈到存储服务器。这种高层设计满足了 (a) 和 (b) 的需求。数据库可以是 关系型数据库 也可以是 NoSQL。

经过团队的反馈后，该设计不断演进，并且具备了较好的扩展性。可以通过 增加作业运行器（EC2 实例） 来扩展系统的处理能力。不过，Jane 仍然 没有明确如何管理作业，这是她下一步要解决的问题。

Few iterations later: After spending more time on the design, she decides to add a UI layer in the form of an application server that integrates with a user management system and lets authorized users log-in, and based on the permissions of the user, enable creation of new jobs, managing the job schedules and view the status as reported in the database.

One of the key advantages of this design is that the UI layer is isolated from the job schedulers and runners, and each can be independently scaled based on needs. Jane’s design is still not clear about how she plans to add user management and authorization. Ideally, the app server should integrate with the company’s login servers to support single sign on (SSO) capability to make the application more user friendly.

Screenshot 2025-02-04 at 09.14.08

进一步的优化设计

在进一步的思考后，Jane 决定 添加 UI 层，即一个 应用服务器（Application Server），用于：

集成用户管理系统，让授权用户可以登录系统。

根据用户权限，创建新作业、管理作业调度、查看作业状态。

这一设计的优势是：

UI 层与作业调度器、作业运行器分离，可以 独立扩展。

用户可以 方便地管理作业，而不需要直接操作数据库。

未来可以集成 SSO（单点登录），使用户体验更友好。

但该设计 尚未明确如何实现用户管理和权限控制。 理想情况下，应用服务器应该 集成公司的登录服务器，支持 SSO（单点登录），以提升用户体验。

---- Pause Here! ---

Before you proceed further, reflect on how the design has evolved. How would you approach this design? What feedback would you give Jane if you were reviewing it?

On the next page, we have included some questions that were asked by engineers who reviewed this design. Read them after you have spent a bit of time with the design above.

在继续往下阅读之前，请思考 设计如何演变。如果你要设计这个系统，你会如何做？如果你是 Jane 设计评审的一员，你会给出什么建议？

接下来是一些由评审工程师提出的问题。在阅读它们之前，请先自己分析当前的设计。

Think about the following:

As you reviewed the design, were you able to spot the following:

a. Do you think the design meets functional requirement? Maybe, clearly, stretch goal (d) is not met. How do you think you can extend the design to support it?

b. What are the points of failure and how can we address them?

c. Do you think we need more than 1 app server? 1 db? 1 job runner? With 1000 jobs / day @ 4 mins each, it is 60+ hours and won’t finish in a day. You will need at least 3 runners.

d. What would be your choices for various components?

e. What would it take to build this system?

In a typical design interview, you will be evaluated on one or more of the following aspects

Did you ask clarifying questions to identify system requirements, limits and constraints?
Did you come up with a high-level design, covering high level needs of the system?
Did you connect the design with something you have delivered in the past?
Did you identify examples of components that can be used, like which app server or which scheduler?
Did you identify bottlenecks and challenges?
Did you define the data model for jobs and the reporting output?
Did you articulate your thoughts clearly without needing too much guidance?
Was your approach structured, logical and easy to follow?

思考以下问题：

1. 该设计是否满足功能需求？ 目前的设计基本满足了功能需求，但 (d) “告警机制” 还未实现。你认为可以如何扩展设计以支持告警？

2. 设计中的潜在故障点是什么？如何解决？

3. 是否需要多个应用服务器？多个数据库？多个作业运行器？ 每天 1000 个小型作业，每个 4 分钟，意味着至少 60 小时的处理量，无法在一天内完成。至少需要 3 个作业运行器。

4. 各个组件的技术选型应该是什么？

5. 需要做哪些工作来构建这个系统？

在典型的 系统设计面试 中，你将被评估以下方面：

你是否 提出了澄清问题，明确系统需求、限制和约束？你是否 设计了一个高层架构，涵盖系统的核心需求？你是否 能结合自己过去的项目经验，给出合理的方案？你是否 能识别潜在的瓶颈和挑战？你是否 定义了作业数据模型 以及作业执行结果的存储方式？你的思维是否 结构化、逻辑清晰、表达流畅？