Home
Fundamentals
1. Scale from Zero to Million Users 2. Back of Envelope Estimation 3. Framework for System Design 4. Design Content Hashing 5. Design Key-Value Store 6. Design a Unique ID Generator 7. Design Rate Limiter
Core Applications
8. Design a URL Shortener 9. Design a Web Crawler 10. Design a Notification System 11. Design a News Feed System 12. Design a Chat System 13. Design Search Autocomplete 14. Design YouTube 15. Design Google Drive
Location & Proximity
16. Proximity Service 17. Nearby Friends 18. Google Maps
Distributed Infrastructure
19. Distributed Message Queue 20. Metrics Monitoring & Alerting System 21. Ad Click Event Aggregation 22. Hotel Reservation System 23. Distributed Email Service 24. S3-like Object Storage 25. Real-time Gaming Leaderboard 26. Payment System 27. Digital Wallet 28. Stock Exchange
Appendix
29. Learning Continues
16

Design Google Drive

In recent years, cloud storage services such as Google Drive, Dropbox, Microsoft OneDrive, and Apple iCloud have become very popular. In this chapter, you are asked to design Google Drive.

Let us take a moment to understand Google Drive before jumping into the design. Google Drive is a file storage and synchronization service that helps you store documents, photos, videos, and other files in the cloud. You can access your files from any computer, smartphone, and tablet. You can easily share those files with friends, family, and coworkers [1]. Figure 1 and 15-2 show what Google drive looks like on a browser and mobile application, respectively.

Image represents a Google Drive interface displayed in an Incognito browser window.  The top displays the URL `drive.google.com/drive/my-drive` within the address bar, along with standard browser controls.  Below, the Google Drive logo and title are shown, followed by a 'New' button with a plus icon. A left-hand navigation pane lists options: 'My Drive' (selected), 'Shared with me,' 'Recent,' 'Starred,' 'Trash,' and 'Backups,' each accompanied by a relevant icon. The main area displays a list of files and folders within 'My Drive,' sorted by name in ascending order (indicated by an upward-pointing arrow). Each entry shows the file/folder name (e.g., 'Recipes,' 'Salad,' 'Apple cake'), the owner ('Alex' or 'me'), and the last modified timestamp (e.g., '1:41 PM Alex').  File types are visually distinguished by icons: folders are represented by folder icons with a person icon inside, while documents are represented by a list icon.  At the top right of the main area, icons for sharing, adding collaborators, viewing, deleting, and more options are present.  The browser's Incognito mode is indicated in the top right corner.
Figure 1
Image represents a Google Drive mobile interface.  At the top, a search bar labeled 'Search Drive' is present, along with a button displaying 'My Drive' and 'Computers' tabs. Below this, a list of files and folders is shown, sorted by name in ascending order (indicated by the 'Name ↑' label). Each entry displays a file icon (folder, PDF, document, image), the file name (e.g., 'Recipes,' 'Pasta.JPG'), and the last modification date (e.g., 'Modified Dec 23, 2019').  A three-dot menu icon appears to the right of each entry, suggesting additional options.  A '+' button is located in the bottom right corner, likely for creating new files or folders.  At the very bottom, navigation icons for home, star (favorites), shared files, and a 'Files' tab are visible.  A user's profile icon (a capital 'L' in a circle) is located in the top right corner.  No URLs or specific parameters are visible.
Figure 2

Step 1 - Understand the problem and establish design scope

Designing a Google drive is a big project, so it is important to ask questions to narrow down the scope.

Candidate: What are the most important features?
Interviewer: Upload and download files, file sync, and notifications.

Candidate: Is this a mobile app, a web app, or both?
Interviewer: Both.

Candidate: What are the supported file formats?
Interviewer: Any file type.

Candidate: Do files need to be encrypted?
Interviewer: Yes, files in the storage must be encrypted.

Candidate: Is there a file size limit?
Interviewer: Yes, files must be 10 GB or smaller.

Candidate: How many users does the product have?
Interviewer: 10M DAU.

In this chapter, we focus on the following features:

  • Add files. The easiest way to add a file is to drag and drop a file into Google drive.

  • Download files.

  • Sync files across multiple devices. When a file is added to one device, it is automatically synced to other devices.

  • See file revisions.

  • Share files with your friends, family, and coworkers

  • Send a notification when a file is edited, deleted, or shared with you.

Features not discussed in this chapter include:

  • Google doc editing and collaboration. Google doc allows multiple people to edit the same document simultaneously. This is out of our design scope.

Other than clarifying requirements, it is important to understand non-functional requirements:

  • Reliability. Reliability is extremely important for a storage system. Data loss is unacceptable.

  • Fast sync speed. If file sync takes too much time, users will become impatient and abandon the product.

  • Bandwidth usage. If a product takes a lot of unnecessary network bandwidth, users will be unhappy, especially when they are on a mobile data plan.

  • Scalability. The system should be able to handle high volumes of traffic.

  • High availability. Users should still be able to use the system when some servers are offline, slowed down, or have unexpected network errors.

Back of the envelope estimation

  • Assume the application has 50 million signed up users and 10 million DAU.

  • Users get 10 GB free space.

  • Assume users upload 2 files per day. The average file size is 500 KB.

  • 1:1 read to write ratio.

  • Total space allocated: 50 million * 10 GB = 500 Petabyte

  • QPS for upload API: 10 million * 2 uploads / 24 hours / 3600 seconds = ~ 240

  • Peak QPS = QPS * 2 = 480

Step 2 - Propose high-level design and get buy-in

Instead of showing the high-level design diagram from the beginning, we will use a slightly different approach. We will start with something simple: build everything in a single server. Then, gradually scale it up to support millions of users. By doing this exercise, it will refresh your memory about some important topics covered in the course.

Let us start with a single server setup as listed below:

  • A web server to upload and download files.

  • A database to keep track of metadata like user data, login info, files info, etc.

  • A storage system to store files. We allocate 1TB of storage space to store files.

We spend a few hours setting up an Apache web server, a MySql database, and a directory called drive/ as the root directory to store uploaded files. Under drive/ directory, there is a list of directories, known as namespaces. Each namespace contains all the uploaded files for that user. The filename on the server is kept the same as the original file name. Each file or folder can be uniquely identified by joining the namespace and the relative path.

Figure 3 shows an example of how the /drive directory looks like on the left side and its expanded view on the right side.

Image represents a visual depiction of a file system's hierarchical structure before and after expansion.  On the left, a collapsed view shows a 'drive' folder containing three subfolders: 'user1', 'user2', and 'user3'. A dashed arrow labeled 'expand' points to the right, indicating an expansion operation. The right side displays the expanded view of the 'drive', revealing the contents of each user folder.  'user1' contains a subfolder 'recipes' and a file 'chicken_soup.txt'. 'user2' contains files 'football.mov' and 'sports.txt'. Finally, 'user3' contains the file 'best_pic_ever.png'. The overall structure illustrates a tree-like representation of files and folders, with the expansion operation showing the hierarchical relationship between the 'drive' folder and its nested subfolders and files.
Figure 3

APIs

What do the APIs look like? We primarily need 3 APIs: upload a file, download a file, and get file revisions.

1. Upload a file to Google Drive

Two types of uploads are supported:

  • Simple upload. Use this upload type when the file size is small.

  • Resumable upload. Use this upload type when the file size is large and there is high chance of network interruption.

Here is an example of resumable upload API:

https://api.example.com/files/upload?uploadType=resumable

Params:

  • uploadType=resumable

  • data: Local file to be uploaded.

A resumable upload is achieved by the following 3 steps [2]:

  • Send the initial request to retrieve the resumable URL.

  • Upload the data and monitor upload state.

  • If upload is disturbed, resume the upload.

2. Download a file from Google Drive

Example API: https://api.example.com/files/download

Params:

  • path: download file path.

Example params:

{
"path": "/recipes/soup/best_soup.txt"
}

3. Get file revisions

Example API: https://api.example.com/files/list_revisions

Params:

  • path: The path to the file you want to get the revision history.

  • limit: The maximum number of revisions to return.

Example params:

{
"path": "/recipes/soup/best_soup.txt",
"limit": 20
}

All the APIs require user authentication and use HTTPS. Secure Sockets Layer (SSL) protects data transfer between the client and backend servers.

Move away from single server

As more files are uploaded, eventually you get the space full alert as shown in Figure 4.

Image represents a visual depiction of hard drive storage.  On the left, a realistic icon of a physical hard drive is shown. To its right, a long, dark-red horizontal bar represents the hard drive's storage capacity.  Above the bar, the label '/drive' indicates the mount point or directory name for the drive. Below the bar, the text '10 MB free of 1 TB' displays the available free space (10 MB) out of the total capacity of 1 terabyte (TB). The bar's length visually represents the total storage capacity, with the vast majority of the bar indicating used space, and a tiny, almost imperceptible portion representing the small amount of free space.  There is no explicit connection drawn between the hard drive icon and the bar, but their juxtaposition clearly implies that the bar visually represents the storage status of the depicted hard drive.
Figure 4

Only 10 MB of storage space is left! This is an emergency as users cannot upload files anymore. The first solution comes to mind is to shard the data, so it is stored on multiple storage servers. Figure 5 shows an example of sharding based on user_id.

Image represents a simple consistent hashing scheme for distributing user data across four servers (Server1, Server2, Server3, Server4).  A diamond-shaped node at the top contains the expression 'user_id % 4,' indicating that a user's ID is used as input for a modulo operation with 4 as the divisor. The result of this operation (0, 1, 2, or 3) determines which server receives the user's data. Light-blue arrows descend from the diamond, connecting it to each of the four server representations.  Each server is depicted as a green rectangle with two horizontal lines inside, symbolizing a database or storage unit.  The modulo operation ensures that each user's data is consistently routed to the same server, distributing the load relatively evenly across the four servers.
Figure 5

You pull an all-nighter to set up database sharding and monitor it closely. Everything works smoothly again. You have stopped the fire, but you are still worried about potential data losses in case of storage server outage. You ask around and your backend guru friend Frank told you that many leading companies like Netflix and Airbnb use Amazon S3 for storage. “Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance” [3]. You decide to do some research to see if it is a good fit.

After a lot of reading, you gain a good understanding of the S3 storage system and decide to store files in S3. Amazon S3 supports same-region and cross-region replication. A region is a geographic area where Amazon web services (AWS) have data centers. As shown in Figure 6, data can be replicated on the same-region (left side) and cross-region (right side). Redundant files are stored in multiple regions to guard against data loss and ensure availability. A bucket is like a folder in file systems.

Image represents a comparison of same-region and cross-region replication strategies.  The left side depicts same-region replication, showing a single 'Bucket' (presumably a data storage unit) within a 'Region A' (a geographical location). Three arrows labeled 'replication' emanate from this bucket, each pointing to a separate, identical bucket also within Region A, indicating data replication within the same region. The right side illustrates cross-region replication.  Here, a 'Bucket' in 'Region A' is similarly replicated, but the three replication arrows now point to three identical buckets located in a different region, labeled 'Region B,' demonstrating data replication across geographical regions. Both diagrams use a consistent visual style: buckets represent data storage, arrows represent replication processes, and dashed lines delineate the regions.  The text below each diagram clearly labels them as 'Same-region replication' and 'Cross-region replication' respectively.  A small text at the bottom mentions a viewer limitation regarding SVG support.
Figure 6

After putting files in S3, you can finally have a good night's sleep without worrying about data losses. To stop similar problems from happening in the future, you decide to do further research on areas you can improve. Here are a few areas you find:

  • Load balancer: Add a load balancer to distribute network traffic. A load balancer ensures evenly distributed traffic, and if a web server goes down, it will redistribute the traffic.

  • Web servers: After a load balancer is added, more web servers can be added/removed easily, depending on the traffic load.

  • Metadata database: Move the database out of the server to avoid single point of failure. In the meantime, set up data replication and sharding to meet the availability and scalability requirements.

  • File storage: Amazon S3 is used for file storage. To ensure availability and durability, files are replicated in two separate geographical regions.

After applying the above improvements, you have successfully decoupled web servers, metadata database, and file storage from a single server. The updated design is shown in Figure 7.

Image represents a simplified system architecture diagram.  A user, interacting via a web browser or mobile app (both contained within a light-blue rounded rectangle labeled 'User'), sends requests to a load balancer (a dark-blue trapezoid with an icon representing load balancing). The load balancer distributes these requests to multiple API servers (represented by three green vertical rectangles within a dashed light-blue rounded rectangle labeled 'API servers').  These API servers then communicate with two separate backend components: a Metadata DB (three dark-blue database cylinders within a light-blue rounded rectangle) and a File storage (a green bucket icon with various shapes inside, within a light-blue rounded rectangle).  Arrows indicate the direction of information flow, showing requests moving from the user to the load balancer, then to the API servers, and finally to the database and file storage.  The API servers receive requests, process them, and retrieve data from both the Metadata DB and File storage before sending responses back to the user via the load balancer.
Figure 7

Sync conflicts

For a large storage system like Google Drive, sync conflicts happen from time to time. When two users modify the same file or folder at the same time, a conflict happens. How can we resolve the conflict? Here is our strategy: the first version that gets processed wins, and the version that gets processed later receives a conflict. Figure 8 shows an example of a sync conflict.

Image represents a system design illustrating data synchronization between multiple users and a central system.  The diagram shows three stages.  In the first stage, two users (User 1 and User 2), each represented by a stick figure and associated with a document icon, independently upload data (represented by black arrows) to a central 'Our system' component, depicted as a light-blue rounded rectangle. The second stage shows the same users and documents, but now green arrows labeled 'synced' indicate that the system has processed the uploads and synchronized the data.  The final stage depicts a potential conflict resolution scenario.  Again, users upload data, but this time, a green arrow from User 2 to 'Our system' is labeled 'Confli...', suggesting a conflict has arisen requiring resolution within the system.  The overall flow demonstrates how user data is ingested, synchronized, and potentially managed for conflicts within the 'Our system' component.
Figure 8

In Figure 8, user 1 and user 2 tries to update the same file at the same time, but user 1’s file is processed by our system first. User 1’s update operation goes through, but, user 2 gets a sync conflict. How can we resolve the conflict for user 2? Our system presents both copies of the same file: user 2’s local copy and the latest version from the server (Figure 9). User 2 has the option to merge both files or override one version with the other.

Image represents a list view, likely from a file management system or collaborative workspace, showing two entries.  The first entry is labeled 'SystemDesignInterview' and is accompanied by a gray icon representing multiple users, suggesting it's a shared file or project.  Below it, the second entry is labeled 'SystemDesignInterview_user_2_conflicted_copy_2019-05-01' and also features a gray multi-user icon.  This second entry's name indicates a copy of the first, created by 'user 2,' marked as a 'conflicted copy' (implying a merge conflict), and timestamped '2019-05-01.' Both entries are preceded by a blue icon resembling a list or document, suggesting they are files or documents.  The arrangement is a simple vertical list, with each entry on a separate line, clearly differentiating the two items.  No explicit connections or information flow are depicted beyond the visual grouping and labeling.
Figure 9

While multiple users are editing the same document at the same, it is challenging to keep the document synchronized. Interested readers should refer to the reference material [4] [5].

High-level design

Figure 10 illustrates the proposed high-level design. Let us examine each component of the system.

Image represents a system architecture diagram illustrating the flow of data and interactions between various components.  The diagram begins with a 'User' component, represented by icons of a laptop and a mobile phone, which connects to a 'Load balancer.' The load balancer distributes requests to a set of 'API servers.'  These API servers interact with a 'Metadata Cache' and a 'Metadata DB,' suggesting a caching strategy for improved performance.  The API servers also connect to 'Block servers' and 'Cloud storage,' indicating data storage and retrieval mechanisms.  The 'Cloud storage' further connects to a 'Cold storage' component, likely for archiving less frequently accessed data.  A separate connection from the API servers goes to a 'Notification Service,' which uses 'long polling' to communicate with the user. Finally, the 'Notification Service' sends notifications to an 'Offline backup Queue,' represented by three email icons, suggesting a mechanism for handling notifications when the user is offline.  The overall flow depicts a typical client-server architecture with caching, data persistence, and notification mechanisms.
Figure 10

User: A user uses the application either through a browser or mobile app.

Block servers: Block servers upload blocks to cloud storage. Block storage, referred to as block-level storage, is a technology to store data files on cloud-based environments. A file can be split into several blocks, each with a unique hash value, stored in our metadata database. Each block is treated as an independent object and stored in our storage system (S3). To reconstruct a file, blocks are joined in a particular order. As for the block size, we use Dropbox as a reference: it sets the maximal size of a block to 4MB [6].

Cloud storage: A file is split into smaller blocks and stored in cloud storage.

Cold storage: Cold storage is a computer system designed for storing inactive data, meaning files are not accessed for a long time.

Load balancer: A load balancer evenly distributes requests among API servers.

API servers: These are responsible for almost everything other than the uploading flow. API servers are used for user authentication, managing user profile, updating file metadata, etc.

Metadata database: It stores metadata of users, files, blocks, versions, etc. Please note that files are stored in the cloud and the metadata database only contains metadata.

Metadata cache: Some of the metadata are cached for fast retrieval.

Notification service: It is a publisher/subscriber system that allows data to be transferred from notification service to clients as certain events happen. In our specific case, notification service notifies relevant clients when a file is added/edited/removed elsewhere so they can pull the latest changes.

Offline backup queue: If a client is offline and cannot pull the latest file changes, the offline backup queue stores the info so changes will be synced when the client is online.

We have discussed the design of Google Drive at the high-level. Some of the components are complicated and worth careful examination; we will discuss these in detail in the deep dive.

Step 3 - Design deep dive

In this section, we will take a close look at the following: block servers, metadata database, upload flow, download flow, notification service, save storage space and failure handling.

Block servers

For large files that are updated regularly, sending the whole file on each update consumes a lot of bandwidth. Two optimizations are proposed to minimize the amount of network traffic being transmitted:

  • Delta sync. When a file is modified, only modified blocks are synced instead of the whole file using a sync algorithm [7] [8].

  • Compression. Applying compression on blocks can significantly reduce the data size. Thus, blocks are compressed using compression algorithms depending on file types. For example, gzip and bzip2 are used to compress text files. Different compression algorithms are needed to compress images and videos.

In our system, block servers do the heavy lifting work for uploading files. Block servers process files passed from clients by splitting a file into blocks, compressing each block, and encrypting them. Instead of uploading the whole file to the storage system, only modified blocks are transferred.

Figure 11 shows how a block server works when a new file is added.

Image represents a system for storing a large file in cloud storage.  A large file (represented by a document icon) is first split into N smaller blocks (Block 1, Block 2, ..., Block N). Each block is then independently compressed (represented by a zipped file icon) and encrypted (represented by a padlock icon).  These encrypted and compressed blocks are then sent to a cloud storage system (represented by a bucket icon labeled 'Cloud storage'). The entire process is contained within a light-blue rounded rectangle labeled 'Block servers,' indicating that the splitting, compression, and encryption operations occur on these servers.  Arrows show the flow of data between the stages: from the original file to the split blocks, then through compression and encryption, finally to the cloud storage.  The ellipsis (...) between Block 2 and Block N indicates that there are multiple blocks between these two explicitly shown blocks.
Figure 11
  • A file is split into smaller blocks.

  • Each block is compressed using compression algorithms.

  • To ensure security, each block is encrypted before it is sent to cloud storage.

  • Blocks are uploaded to the cloud storage.

Figure 12 illustrates delta sync, meaning only modified blocks are transferred to cloud storage. Highlighted blocks “block 2” and “block 5” represent changed blocks. Using delta sync, only those two blocks are uploaded to the cloud storage.

Image represents a system architecture diagram showing data flow between a set of 'Block servers' and 'Cloud storage'.  The Block servers are depicted as a 4x2 grid containing ten blocks, labeled Block 1, 3, 4, 6, 7, 8, 9, and 10, while some blocks are represented as black rectangles without labels.  Only blocks with changes are transferred to the cloud storage. A unidirectional arrow labeled 'changed only' connects the Block servers to the Cloud storage, indicating that only modified data from the Block servers is sent to the Cloud storage. The Cloud storage is represented by a bucket icon containing geometric shapes (triangle, square, circle) and labeled 'Cloud storage'.  The overall structure suggests a system where data is processed or stored on the Block servers, and only the updated portions are subsequently uploaded to the Cloud storage for backup or other purposes.
Figure 12

Block servers allow us to save network traffic by providing delta sync and compression.

High consistency requirement

Our system requires strong consistency by default. It is unacceptable for a file to be shown differently by different clients at the same time. The system needs to provide strong consistency for metadata cache and database layers.

Memory caches adopt an eventual consistency model by default, which means different replicas might have different data. To achieve strong consistency, we must ensure the following:

  • Data in cache replicas and the master is consistent.

  • Invalidate caches on database write to ensure cache and database hold the same value.

Achieving strong consistency in a relational database is easy because it maintains the ACID (Atomicity, Consistency, Isolation, Durability) properties [9]. However, NoSQL databases do not support ACID properties by default. ACID properties must be programmatically incorporated in synchronization logic. In our design, we choose relational databases because the ACID is natively supported.

Metadata database

Figure 13 shows the database schema design. Please note this is a highly simplified version as it only includes the most important tables and interesting fields.

Image represents an Entity-Relationship Diagram (ERD) depicting a database schema for a file management system.  The diagram shows five entities: `user`, `workspace`, `file`, `file_version`, and `device`.  The `user` entity contains `user_id` (bigint), `user_name` (varchar), and `created_at` (timestamp).  A one-to-many relationship exists between `user` and `workspace`, where one user can own multiple workspaces, indicated by '1' on the `user` side and '*' on the `workspace` side of the connecting line. The `workspace` entity includes `id` (bigint), `owner_id` (bigint), `is_shared` (boolean), and `created_at` (timestamp).  The `file` entity has `id` (bigint), `file_name` (varchar), `relative_path` (varchar), `is_directory` (boolean), `lastest_version` (bigint), `checksum` (bigint), `workspace_id` (bigint), `created_at` (timestamp), and `last_modified` (timestamp). A one-to-many relationship connects `workspace` and `file`, signifying a workspace can contain multiple files.  A one-to-many relationship exists between `file` and `file_version`, representing a file having multiple versions. The `file_version` entity contains `id` (bigint), `file_id` (bigint), `device_id` (uuid), `version_number` (bigint), and `last_modified` (timestamp).  Finally, a one-to-many relationship links `device` and `file_version`, showing a device can create multiple file versions. The `device` entity has `device_id` (uuid), `user_id` (bigint), and `last_logged_in_at` (timestamp).  The `block` entity, with `block_id` (bigint), `file_version_id` (bigint), and `block_order` (int), represents file blocks and has a many-to-one relationship with `file_version`.  All `id` fields are of type `bigint`, indicating large integer primary keys.  The relationships are visually represented by connecting lines with cardinality annotations (1 or *).
Figure 13

User: The user table contains basic information about the user such as username, email, profile photo, etc.

Device: Device table stores device info. Push_id is used for sending and receiving mobile push notifications. Please note a user can have multiple devices.

Namespace: A namespace is the root directory of a user.

File: File table stores everything related to the latest file.

File_version: It stores version history of a file. Existing rows are read-only to keep the integrity of the file revision history.

Block: It stores everything related to a file block. A file of any version can be reconstructed by joining all the blocks in the correct order.

Upload flow

Let us discuss what happens when a client uploads a file. To better understand the flow, we draw the sequence diagram as shown in Figure 14.

Image represents a sequence diagram illustrating a file upload process.  Two clients (:Client 1 and :Client 2) initiate the process by sending '1. add file...' requests to :API Servers.  Client 1 then uploads the file (2.1 upload file) to :Block servers, which in turn forwards it (2.2 upload file) to :Cloud storage.  The :API Servers receive upload status updates (2. upload status:..) from :Metadata DB, which is also notified of file changes (3. notify change) by the :API Servers.  Concurrently, Client 2 also uploads a file, following the same path.  After the upload to :Cloud storage, the :Cloud storage updates the file information (2.3 update file...) in the :API Servers, which then updates the status (2.4 upload status:..) in the :Metadata DB, triggering another notification (2.5 notify change). Finally, the :API Servers notify both clients of the completed changes (4. notify changes and 2.6 notify changes).  The entire process involves interaction between clients, block servers, cloud storage, API servers, metadata database, and a notification service.
Figure 14

In Figure 14, two requests are sent in parallel: add file metadata and upload the file to cloud storage. Both requests originate from client 1.

Add file metadata

1. Client 1 sends a request to add the metadata of the new file.

2. Store the new file metadata in metadata DB and change the file upload status to “pending.”

3. Notify the notification service that a new file is being added.

4. The notification service notifies relevant clients (client 2) that a file is being uploaded.

Upload files to cloud storage

2.1 Client 1 uploads the content of the file to block servers.

2.2 Block servers chunk the files into blocks, compress, encrypt the blocks, and upload them to cloud storage.

2.3 Once the file is uploaded, cloud storage triggers upload completion callback. The request is sent to API servers.

2.4 File status changed to “uploaded” in Metadata DB.

2.5 Notify the notification service that a file status is changed to “uploaded.”

2.6 The notification service notifies relevant clients (client 2) that a file is fully uploaded.

When a file is edited, the flow is similar, so we will not repeat it.

Download flow

Download flow is triggered when a file is added or edited elsewhere. How does a client know if a file is added or edited by another client? There are two ways a client can know:

  • If client A is online while a file is changed by another client, notification service will inform client A that changes are made somewhere so it needs to pull the latest data.

  • If client A is offline while a file is changed by another client, data will be saved to the cache. When the offline client is online again, it pulls the latest changes.

Once a client knows a file is changed, it first requests metadata via API servers, then downloads blocks to construct the file. Figure 15 shows the detailed flow. Note, only the most important components are shown in the diagram due to space constraint.

Image represents a sequence diagram illustrating the interaction between different components in a system.  The diagram shows seven components arranged horizontally: 'Client 2', 'Block servers', 'Cloud storage', 'API Servers', 'Metadata DB', and 'Notification se...' (presumably 'Notification service').  The interaction begins with 'Notification service' sending a '1. notify changes' message to 'Client 2'.  'Client 2' then sends a '2. get changes' request to 'API Servers'.  'API Servers' subsequently requests '3. get changes' from 'Metadata DB', receives '4. return changes', and sends '5. return changes' to 'Client 2'.  Following this, 'Client 2' requests '6. download blocks' from 'Block servers'.  'Block servers' then requests '7. download blocks' from 'Cloud storage', receives '8. blocks', and sends '9. blocks' back to 'Client 2', completing the sequence.  The dashed lines indicate asynchronous communication, while solid lines represent synchronous calls.  The numbered labels describe the sequential steps in the data flow between the components.
Figure 15

1. Notification service informs client 2 that a file is changed somewhere else.

2. Once client 2 knows that new updates are available, it sends a request to fetch metadata.

3. API servers call metadata DB to fetch metadata of the changes.

4. Metadata is returned to the API servers.

5. Client 2 gets the metadata.

6. Once the client receives the metadata, it sends requests to block servers to download blocks.

7. Block servers first download blocks from cloud storage.

8. Cloud storage returns blocks to the block servers.

9. Client 2 downloads all the new blocks to reconstruct the file.

Notification service

To maintain file consistency, any mutation of a file performed locally needs to be informed to other clients to reduce conflicts. Notification service is built to serve this purpose. At the high-level, notification service allows data to be transferred to clients as events happen. Here are a few options:

  • Long polling. Dropbox uses long polling [10].

  • WebSocket. WebSocket provides a persistent connection between the client and the server. Communication is bi-directional.

Even though both options work well, we opt for long polling for the following two reasons:

  • Communication for notification service is not bi-directional. The server sends information about file changes to the client, but not vice versa.

  • WebSocket is suited for real-time bi-directional communication such as a chat app. For Google Drive, notifications are sent infrequently with no burst of data.

With long polling, each client establishes a long poll connection to the notification service. If changes to a file are detected, the client will close the long poll connection. Closing the connection means a client must connect to the metadata server to download the latest changes. After a response is received or connection timeout is reached, a client immediately sends a new request to keep the connection open.

Save storage space

To support file version history and ensure reliability, multiple versions of the same file are stored across multiple data centers. Storage space can be filled up quickly with frequent backups of all file revisions. Three techniques are proposed to reduce storage costs:

  • De-duplicate data blocks. Eliminating redundant blocks at the account level is an easy way to save space. Two blocks are identical if they have the same hash value.

  • Adopt an intelligent data backup strategy. Two optimization strategies can be applied:

  • Set a limit: We can set a limit for the number of versions to store. If the limit is reached, the oldest version will be replaced with the new version.

  • Keep valuable versions only: Some files might be edited frequently. For example, saving every edited version for a heavily modified document could mean the file is saved over 1000 times within a short period. To avoid unnecessary copies, we could limit the number of saved versions. We give more weight to recent versions. Experimentation is helpful to figure out the optimal number of versions to save.

  • Moving infrequently used data to cold storage. Cold data is the data that has not been active for months or years. Cold storage like Amazon S3 glacier [11] is much cheaper than S3.

Failure Handling

Failures can occur in a large-scale system and we must adopt design strategies to address these failures. Your interviewer might be interested in hearing about how you handle the following system failures:

  • Load balancer failure: If a load balancer fails, the secondary would become active and pick up the traffic. Load balancers usually monitor each other using a heartbeat, a periodic signal sent between load balancers. A load balancer is considered as failed if it has not sent a heartbeat for some time.

  • Block server failure: If a block server fails, other servers pick up unfinished or pending jobs.

  • Cloud storage failure: S3 buckets are replicated multiple times in different regions. If files are not available in one region, they can be fetched from different regions.

  • API server failure: It is a stateless service. If an API server fails, the traffic is redirected to other API servers by a load balancer.

  • Metadata cache failure: Metadata cache servers are replicated multiple times. If one node goes down, you can still access other nodes to fetch data. We will bring up a new cache server to replace the failed one.

  • Metadata DB failure.

  • Master down: If the master is down, promote one of the slaves to act as a new master and bring up a new slave node.

  • Slave down: If a slave is down, you can use another slave for read operations and bring another database server to replace the failed one.

  • Notification service failure: Every online user keeps a long poll connection with the notification server. Thus, each notification server is connected with many users. According to the Dropbox talk in 2012 [6], over 1 million connections are open per machine. If a server goes down, all the long poll connections are lost so clients must reconnect to a different server. Even though one server can keep many open connections, it cannot reconnect all the lost connections at once. Reconnecting with all the lost clients is a relatively slow process.

  • Offline backup queue failure: Queues are replicated multiple times. If one queue fails, consumers of the queue may need to re-subscribe to the backup queue.

Step 4 - Wrap up

In this chapter, we proposed a system design to support Google Drive. The combination of strong consistency, low network bandwidth and fast sync make the design interesting. Our design contains two flows: manage file metadata and file sync. Notification service is another important component of the system. It uses long polling to keep clients up to date with file changes.

Like any system design interview questions, there is no perfect solution. Every company has its unique constraints and you must design a system to fit those constraints. Knowing the tradeoffs of your design and technology choices are important. If there are a few minutes left, you can talk about different design choices.

For example, we can upload files directly to cloud storage from the client instead of going through block servers. The advantage of this approach is that it makes file upload faster because a file only needs to be transferred once to the cloud storage. In our design, a file is transferred to block servers first, and then to the cloud storage. However, the new approach has a few drawbacks:

  • First, the same chunking, compression, and encryption logic must be implemented on different platforms (iOS, Android, Web). It is error-prone and requires a lot of engineering effort. In our design, all those logics are implemented in a centralized place: block servers.

  • Second, as a client can easily be hacked or manipulated, implementing encrypting logic on the client side is not ideal.

Another interesting evolution of the system is moving online/offline logic to a separate service. Let us call it presence service. By moving presence service out of notification servers, online/offline functionality can easily be integrated by other services.

Congratulations on getting this far! Now give yourself a pat on the back. Good job!

Reference materials

[1] Google Drive: https://www.google.com/drive/

[2] Upload file data: https://developers.google.com/drive/api/v2/manage-uploads

[3] Amazon S3: https://aws.amazon.com/s3

[4] Differential Synchronization https://neil.fraser.name/writing/sync/

[5] Differential Synchronization YouTube talk: https://www.youtube.com/watch?v=S2Hp_1jqpY8

[6] How We’ve Scaled Dropbox: https://youtu.be/PE4gwstWhmc

[7] Tridgell, A., & Mackerras, P. (1996). The rsync algorithm.

[8] Librsync. (n.d.). Retrieved April 18, 2015, from https://github.com/librsync/librsync

[9] ACID: https://en.wikipedia.org/wiki/ACID

[10] Dropbox security white paper:
https://assets.dropbox.com/www/en-us/business/solutions/solutions/dfb_security_whitepaper.pdf

[11] Amazon S3 Glacier: https://aws.amazon.com/glacier/faqs/