How to Prepare for a Systems Design Interview

Jan 15, 2022

43 Min Read

The terms “architecture” and “system design” may be familiar to you as these are associated with commonly asked questions during developer job interviews, particularly at large IT companies.

This comprehensive guide will teach you basic software architectural ideas to help you prepare for the System Design interview. Because System Design is such a broad subject, this is not an exhaustive treatment. If you’re a junior or mid-level developer, though, this should be plenty; you can use additional resources to delve deeper. At the end of this article, a list of some of the relevant resources will be provided.

Bookmarking this guide is recommended because the topic is split into bite-sized portions. Furthermore, spaced learning and repetition are highly effective methods for learning and remembering information. Therefore, this instruction is broken down into thirteen sections with bite-sized chunks that are simple to spaced-repetition.

Networks and Protocols

“Protocols” is a fancy term with an English meaning that has nothing to do with computer science. It refers to a set of guidelines that governs anything. A type of “official procedure” or “formal manner to do anything.”

People require a network to connect to computers and code that communicate. However, regulations, structure, and agreed-upon procedures are necessary for effective communication. On the other hand, network protocols control how machines and software communicate over a network. The World Wide Web is an example of a network.

You may be familiar with the most common network protocols of the internet era, such as HTTP, TCP/IP, and so on. These fundamentals are briefly explained below. Take your time to consider them.

IP stands for Internet Protocol     

Consider this the foundational layer of protocols. The fundamental protocol explains how practically all communication over internet networks should be carried out. For example, messages are frequently sent via IP in “packets,” tiny bundles of data (216 bytes). Each packet has a basic structure that consists of two parts: the header and the data. The header includes “meta” information about the packet and its contents. This metadata contains information such as the source IP address (from which the packet originates) and the destination IP address (destination of the packet). In addition, you’ll need the “from” and “to” addresses to be able to send information from one location to another.

An IP Address is a numerical identification given to every device connected to a computer network that communicates using the Internet Protocol. There are public and private IP addresses and two versions of the software. Because IPv4 is running out of numerical designations, the new version, known as IPv6, is becoming more widely used. The additional protocols you’ll look at in this post are constructed on top of IP, just like libraries and frameworks are created on top of your preferred programming language.

Transmission Control Protocol (TCP)

TCP is an IP-based protocol. As you may have noticed from other articles, you must first realize why it was developed to properly comprehend what something does. It was established to address a flaw in IP. Because each packet is relatively tiny (216 bytes), data over IP is often transmitted in numerous packets. Multiple packets can result in data corruption due to (A) lost or missing packets and (B) disordered packets. TCP solves both of these problems by ensuring that packets are sent logically.

The packet has a TCP header and the IP header because it is constructed on top of the IP. This TCP header includes packet ordering, number of packets, etc. This guarantees that the data is received reliably on the other end. Because it is constructed on top of IP, it is commonly referred to as TCP/IP. TCP uses a “handshake” to build a connection between the source and destination before transmitting the packets. The connection is formed via packets, in which the source informs the destination that it wishes to initiate a connection, the destination confirms, and the connection is established.

This is essentially what happens when a server “listens” at a port – there is a handshake right before it starts listening, and then the connection is established (listening starts). Similarly, one sends the other a message that the connection is about to be terminated, and the connection is terminated.

HyperText Transfer Protocol (HTTP) 

HTTP is a protocol that is based on TCP/IP and is an abstraction. It introduces the request-response pattern, which is particularly useful for client-server interactions. A client is just a machine or system that requests data, whereas a server is a machine or system that provides the information. For example, a client is a browser, while a server is a web server. When one server requests data from another, the first server becomes a client, whereas this second server becomes the server (tautologies).

Under HTTP, this request-response cycle has its own set of rules, which standardizes how data is transferred across the internet. You don’t even need to bother about IP and TCP at this level of abstraction. Requests and answers with HTTP, on the other hand, have headers and bodies, which contain data that the developer can define. HTTP requests and responses can be viewed as messages with key-value pairs comparable to JavaScript objects and Python dictionaries but not identical.

The content and key-value pairs of HTTP request and response messages are depicted in the link below.

HTTP also includes some “verbs” or “methods,” which are directives that indicate the type of operation undertaken. The most frequent HTTP methods, for example, are “GET,” “POST,” “PUT,” “DELETE,” and “PATCH,” but there are others. Look for the HTTP verb in the start line in the image above.

Storage, Latency & Throughput


The term “storage” refers to the act of storing data. The two primary storage functions are storing and retrieving data required for any software, system, or service you create. Nevertheless, it is not only about storing data; it is also about retrieving it. This is accomplished through the usage of a database. A database is a type of software that allows the storage and retrieval of information.

The two fundamental types of operations, storing and retrieving, are sometimes known as “set, get,” “store, fetch,” “write, read,” and other variations. You’ll use the database to interface with storage, which serves as a gateway for doing these basic actions. The word “storage” can lead to believe that it refers to actual space. When you “store” your bike in the shed, you can count on it being there when you open it again.

However, in the realm of computing, this isn’t always the case. There are two primary forms of storage: “memory” storage and “disk” storage. Disk storage is by far the more reliable and “permanent” of the two options (not genuinely permanent, so often use the word “persistent” storage instead). Disk storage is a type of long-term storage. This indicates that if you save something to disk and then turn off or restart your server, the data will still be there. It will not be misplaced. If you leave data in “Memory,” it will usually be erased when you shut down, restart, or lose power.

These storage methods are included in the computer you use most often. Your RAM is temporary memory storage, while your hard disk is “persistent” Disk storage. If the data you’re keeping records of on a server is only valuable during that server’s session, it’s a good idea to maintain it in memory. This is far more efficient and cost-effective than writing to a permanent database. A single session, for instance, could refer to while a user is logged in and utilizing your website. You might not even have to keep track of bits of data collected during the session after they log out.

Even though anything you choose to keep (such as your checkout history) will be saved in persistent Disk storage, that way, you’ll be able to retrieve that information the following time the user signs in, giving them a consistent experience. This appears to be fairly basic and straightforward, and it is. However, this is a basic overview. It’s possible to get into a lot of trouble with storage. Your head will spin if you look at the variety of storage items and solutions available.

This is attributed to the reason that different use cases necessitate other forms of storage. The key to selecting the correct storage types for your system is to consider various aspects, including your application’s requirements and how users interact with it. Other considerations encompass:

  • Your data’s structure (shape), or
  • What level of availability it requires (how much downtime is acceptable for your storage), or
  • Scalability (how quickly do you need to read and write data, and will these reads and writes act synergistically (at the same time) or sequentially), or
  • Consistency (How consistent is the data throughout your stores if you use distributed storage to prevent downtime)?

These concerns, as well as the conclusions, demand you to analyze your trade-offs carefully. For example, is speed more important than consistency? Is the database required to handle millions of operations per minute or just regular updates? Don’t despair if you don’t understand these terms; these will be explained in sections later.


As you gain more expertise in creating systems to serve the front end of your application, you’ll hear terms like “latency” and “throughput” a lot. They are critical to your application and system’s overall experience and performance. Unfortunately, there’s a propensity to use these terminologies in a larger or out-of-context sense than intended; try to rectify that.

The term “latency” simply refers to the time it takes for anything to happen. What is the time frame? The time it takes for an action to finish or create a result. For instance, data can be moved from one part of the system to another. It’s sometimes referred to as a lag or simply the time it takes to execute an operation.

The “round trip” network request – how long would it have taken for your front-end website (client) to send a demand to your server and receive a response from the server – is the most frequently recognized delay.

When loading a website, you want it to go as quickly and effectively as possible. To put it another way, you want low latency. Low latency means quick lookups. So, identifying a value in an array of items is slower (greater latency) than finding a value in a hash-table (lower latency, because you simply look up the data in “constant” time by using the key). There is no need for iteration.)

Reading from memory is also significantly faster than reading from a disk (read more here). However, both have latency and which form of storage for particular data.

Latency is the inverse of speed in this sense. You desire faster download speeds and less latency. The distance is also a factor in speed (particularly for network calls like HTTP). As a result, the distance between London and another city will affect latency.

 As you might expect, you’d like to develop a system that avoids pinging remote servers, but holding data in memory may not be possible. These tradeoffs enable system design so complex, challenging, and fascinating. Websites that display news stories, for instance, may prioritize uptime and availability overloading speed, but multiplayer online games may necessitate high availability and low latency. These specifications will define the infrastructure design and investment needed to support the system’s unique requirements.


This can be thought of as a machine’s or system’s maximum capacity. It’s commonly used in industries to figure up how much effort an assembly line can complete in an hour, a day, or any other time unit. An assembly line, for example, can put together 20 automobiles per hour, which is its throughput. Likewise, it is the quantity of data that can be carried around in a unit of time in computing. For example, a 512 Mbps internet connection is defined as 512 Mb (megabits) per second of speed.

Consider the webserver for freeCodeCamp. Its throughput is 800,000 requests per second if it receives 1 million requests per second but can only handle 800,000 of them. You may wind up monitoring throughput in bits rather than requests, in which case it’ll be N bits per second. This scenario has a bottleneck since the server can only handle N bits per second, but the requests are higher. As a result, a bottleneck is a system’s constraint. The slightly slower bottleneck in a system determines how fast it runs.

If a server can manage 100 bits per second, another 120 bits per second, and a third only 50 bits per second, the total system will operate at 50 bits per second because that is the constraint – it slows down the other servers in the system. As a result, improving throughput somewhere else than the bottleneck could be a waste of time; instead, you should first focus on the lowest bottleneck.

You can boost throughput by purchasing additional hardware (horizontal scaling), enhancing the capacity and performance of current gear (vertical scaling), or a variety of other methods. Increasing throughput may be a temporary fix; thus, a smart systems designer will consider the best ways to scale the throughput of a particular system. Examples include dividing up requests (or any other type of “load”) and distributing them over various resources, among other things. The most important thing to know is the difference between throughput and limitation or bottleneck and how they affect a system.

Fixing latency and throughput are not stand-alone, uniform approaches, and they are not tied to one another. Instead, they have ramifications and implications throughout the system; thus, it’s critical to comprehend the system and the nature of the demands that will be made on it over time.

System Availability

Software engineers strive to create dependable systems. A reliable system consistently meets a user’s demands whenever that user expresses an interest in having that need fulfilled. Availability is an important part of that reliability. Therefore, it is important to think of availability as a system’s resiliency. A system can be considered fault-tolerant if it can handle failures in the network, database, servers, etc. This makes it accessible. In many ways, a system is the sum of its components, and each part must be highly available if the site or app’s ultimate user experience depends on it.

Quantifying Availability

Experts compute the average amount of time that the system’s primary functionality and operations are available (the uptime) in a specific window of time to assess a system’s availability. Near-perfect availability would be required for most business-critical systems. However, during off-peak hours, systems that sustain highly variable demands and loads with sharp peaks and troughs may be able to get away with significantly poorer availability.

It all relies on how the system is used and what type of system it is. However, even things with low but consistent demand or an implied promise that the system is “on-demand” would generally require excellent availability. For example, consider a place where you can back up your photos. You don’t always need to access and retrieve data from this; it’s primarily for storing information. However, you’d expect it to be available at all times when you check in to download even a single image.

In the context of large e-commerce buying days like Black Friday or Cyber Monday deals, a particular form of availability can be understood. On certain days, demand will rise, and millions will try to get their hands on the deals simultaneously. To sustain enormous loads, a system architecture that is exceptionally reliable and highly available is required. Any outage on the site that will lose money is a commercial argument for high availability. It could also be highly damaging to one’s reputation, mainly if other businesses use the service to provide services. Many companies, including Netflix, will suffer if AWS S3 goes down, which is not good. As a result, uptime is critical to success. It is important to remember that commercial availability numbers are determined annually. Thus 0.1 percent downtime (i.e., availability of 99.9%) equals 8.77 hours per year. As just another outcome, uptimes appear to be exceptionally high. It’s not uncommon to hear phrases like “99.9% uptime” (52.6 minutes of downtime per year). Consequently, uptimes are now commonly referred to as “nines” – the number of nines in the uptime guarantee.

That is inappropriate in today’s world for large-scale or mission-critical services. As a result, “five nines” is now widely regarded as the best availability criterion, as it equates to just over 5 minutes of downtime every year.


Online service providers frequently offer Service Level Agreements/Assurances to keep their services competitive and fulfill market expectations. These are a set of measures that guarantee a certain degree of service. One such metric is 99.999 percent uptime, frequently included in premium memberships.

If a customer’s fundamental purpose for that product warrants the expectation of such a statistic, database and cloud service providers can offer it even on trial or free tiers. In many circumstances, failing to satisfy the SLA will entitle the customer to credits or some other type of compensation due to the provider’s failure. Here’s Google’s SLA for the Maps API as an example. While designing a system, SLAs are an essential element of the overall commercial and technical considerations. For instance, it’s crucial to think about whether or not availability is a critical need for a system component and which components require high availability.

Designing HA

When establishing a high-availability (HA) system, “single points of failure” must be reduced or eliminated. A single point of failure is a component in a system that is the only one capable of causing the undesirable loss of availability. By incorporating ‘redundancy’ into the system, you may avoid single points of failure. Making one or more replacements (i.e., backups) the crucial element for high availability is all about redundancy.

Now, if your app requires users to be verified to access it, and there is only one authentication service and back end, and it fails, your system is no longer viable because that is the single point of failure. You’ve added redundancy and eliminated (or reduced) single points of failure by having two or more services that can handle authentication. You must comprehend and decompose your system into all of its components. Determine which are most likely to form single points of failure, which are not tolerant of such errors, and which parts can withstand them. Because designing HA necessitates compromises, some of which may be costly in terms of time, money, and other resources.


This is a very basic and easy-to-understand strategy for increasing system performance. As a result, caching aids in reducing “latency” in a system. You employ caching as a matter of course in your everyday routines (most of the time). For example, if you live next to a shop, you still want to purchase and store certain staples in your fridge and food pantry. This is known as caching. Of course, you could always go next door and buy these products whenever you want food, but having them in the pantry or fridge cuts down on the time it takes to prepare the meals. That’s what caching is all about.

Caching Common Scenarios

Conversely, if you frequently rely on specific pieces of data in software, you may wish to cache that data to make your app run faster. Because of the latency in performing network requests, it is often faster to get data from memory rather than disk. As a result, many websites are cached in CDNs (especially if the material isn’t updated regularly) to serve the end-user considerably faster. The burden on the backend servers is reduced. A further situation where caching may be beneficial is when your backend must perform computationally complex and time-consuming tasks. For example, caching past results reduces your lookup time from linear O(N) to constant O(1).

Similarly, if your server must make many network requests and API calls to write the data sent to the requester, caching data could decrease the number of network calls and, as a result, the latency. Caching can be set up on the client (for example, in the storage browser), between the client and the server (e.g., CDNs), or on the server itself if your system contains a client (front end) and a server and databases (backend). This would minimize database calls made over the network.

As a practical matter, caching can occur at many places or levels throughout the system and at the hardware (CPU) level.

Managing Out-of-Date Information

You may have observed that the preceding examples are useful for “read” activities. In terms of core principles, write operations are similar to read operations, with the following additional relevant factors:

  • Maintaining the cache and your database in sync is required for write operations.
  • Because there are more operations to complete, this may impact the performance, and specific developments around handling un-synced or “stale” data must be carefully examined.
  • New design concepts may need to be implemented to carry that synchronizing – should it be done synchronously or asynchronously? If executed in order, how often should it happen? In the meanwhile, where does data come from? How often should the cache be refreshed, and so on?
  • To keep cached data fresh and up-to-date, “eviction” or turnover and refreshes are performed. LIFO, FIFO, LRU, and LFU are examples of these approaches.

Then wrap things off with some broad, quasi conclusions. Caching is most effective when used to save static or occasionally changing data and when the sources of change are single actions instead of user-generated processes. When data consistency and freshness are crucial, caching may not be the best answer unless another component in the system efficiently refreshes the caches at intervals that do not interfere with the application’s purpose or user experience.


What is a proxy? Proxy servers are something that most are familiar with. Some PC or Mac products may provide configuration choices for installing and setting proxy servers or accessing “through a proxy.” And then take a look at that comparatively straightforward, extensively used, and crucial piece of technology. Start with the definition of a word in the English language but has nothing to do with computer science.

You can now discard much of that from your head, leaving only one word: “substitute.”

A proxy is often a server in computers, and it is a server that works as a mediator between a client and another server. It’s a piece of code between the client and the server. This is where proxies come into play. A “client” is a process (code) or machine that seeks data from another process or machine, in case you need a refresher or aren’t sure what those terms mean (the “server”). For example, when a browser requests data from a backend server, it is referred to as a client.

When retrieving data from a database, the server might be both a client and a server. The database is then the server, the server is the database’s client, and there is also a server for the front-end client (browser).

The client-server interaction is bi-directional, as you can see from the above. As a result, one entity can function as both the client and the server. A proxy server is a server that receives requests, sends them to another service, and then relays the response it receives from the other service back to the original client. Clients will be referred to as clients, servers will be referred to as servers, and proxies will be referred to as the item in between.

When a client submits a request to a server through a proxy, the proxy may mask the client’s identity – the IP address that comes through in the request to the server may be the proxy, not the originating client. This pattern may be familiar to those who access sites or download things that are typically forbidden (from the torrent network, for example, or sites banned in your country), as it is the foundation around which VPNs are constructed. Before going any further, pointing out that the term proxy typically refers to a “forward” proxy. In the interaction between client and server, a forward proxy acts on behalf of (as a substitute for).

This is different from a reverse proxy, which represents the interests of the server. The proxy lies here between the client and the server, and the data flows will be the same client->proxy->server on a diagram. The primary distinction is that a reverse proxy is intended to replace a server. Clients are frequently unaware that their network request was routed through a proxy, which is then forwarded to the appropriate server (while also doing the same with the server’s answer). The server will not be aware that the client’s request and response are being routed through a proxy in a forward proxy, and the client will not be aware that the request and answer are being routed through a proxy in a reverse proxy.

Proxies have a stealthy feel about them.

However, proxies are helpful in system design, especially for complex systems, and reverse proxies are extremely advantageous. You can assign many jobs to your reverse proxy that you would not want your primary server to handle – which can be a gatekeeper, a security officer, a load-balancer, and all-around support. As a result, proxies can be beneficial, but you may not realize why. You can only truly comprehend anything if you understand why it exists – knowing what it does isn’t enough.

Load Balancing

When you consider the two concepts, load and balance, you may get a sense of what they mean in the realm of computing. When a server receives many requests simultaneously, it may decelerate (throughput reduces, latency rises). It may even fail at some point (no availability). You may either increase the server’s muscle strength (vertical scaling) or add new servers (horizontal scaling). However, now you have to figure out how the revenue requests are spread among the multiple servers – which requests are routed to which services, and how to keep them from becoming overburdened as well? To put it another way, how do you distribute and regulate the request load?

Load balancers come into play. Because this is an overview of ideas and concepts, the interpretations are, by necessity, quite condensed. A load balancer’s task is to sit between the client and the server (though it can be inserted in other places) and figure out how to spread incoming request loads over numerous servers so that the end user’s (client’s) experience is always quick, smooth, and dependable. Load balancers are similar to traffic managers in that they direct traffic. This is done to ensure availability and throughput. When you realize where a load balancer fits into the system’s architecture, you’ll notice that load balancers are reverse proxies. On the other hand, a load balancer can be placed between different exchanges, such as your server and your database.

Server Selection Strategies: A Balancing Act

 Thus, how would the load balancer decide which request traffic to route and allocate? To begin, each time you pick up a server, you must notify your load balancer that there is a potential candidate for traffic routing. Likewise, the load balancer must be informed if a server is removed. The load balancer’s configuration guarantees that it understands how many servers it has in its go-to list and which ones are available. The load balancer can even be maintained on each server’s loading condition, status, availability, ongoing task, etc.

When the load balancer has been configured to recognize which servers it may redirect to, you must choose the appropriate routing plan to guarantee adequate distribution of the available servers. A simple approach would be for the load balancer to pick a server randomly and forward each incoming request to that server. However, as you might expect, randomness can lead to issues like “unbalanced” allocations, in which certain servers are overloaded more than others, badly impacting overall system performance.

Round Robin and Weighted Round Robin.

“Round-robin” is another immediately understandable way. Many people process looping lists in this manner. You begin with the first item on the list, work your way down in order, then loop back up to the top and start working your way down the list again. The load balancer can also do this by looping through available servers in a predetermined order. As a result, the demand is divided fairly evenly throughout your servers in a predictable and easy-to-understand pattern.

You can make the round-robin a little more “fancy” by “weighting” some services above others. For example, each server is given equal weight (suppose all are given a weighting of 1) in a conventional round-robin. When you weight servers separately, you can have those with a lower weighting (say, 0.5 if they’re less powerful) and others with a higher weighting (0.7, 0.9, or even 1).

The overall traffic will then be divided following those weights and distributed to the servers with power proportional to the volume of requests.

Server selection depending on load

Highly advanced load balancers can calculate the current capacity, productivity, and loads of the servers in their ready list and assign proactively based on actual loads and computations to determine which servers will have the highest throughput, lowest latency, and so on. This would be accomplished by monitoring each server’s performance and determining which ones can and cannot manage incoming requests.

Selection based on IP Hashing

You can instruct your load balancer to hash incoming requests’ IP addresses and use the hash value to determine which server to send them to. For example, if you have five servers available, the hash function will return one of five hash values, indicating that one of the servers will be nominated to execute the request. Where you want requests from a specific country or region to acquire data from a server that is especially suitable to satisfy the needs of that country or region, or where your servers cache requests so they can be processed quickly, IP hash-based routing can be highly effective.

Towards the latter case, you’ll want to make sure the request is sent to a server that has already cached the same request, as this will enhance processing and response time. However, suppose your load balancer does not routinely send identical requests to the same server. In that case, you will wind up with servers re-doing work that was already completed in a prior request to another server, and you will end up losing the optimization that comes with caching data.

Selection based on a path or a service

You can also instruct the load balancer to redirect requests according to their “path,” function, or service. For example, requests to load the “Bouquets on Special” may be sent to one server, while credit card payments may be routed to another server, for example, if you’re buying flowers from an internet flower shop. Perhaps if one out of every twenty visitors buys flowers, you may use a smaller server to process payments and a larger one to handle all browsing traffic.

Mixed Bag

And, like with anything, higher and more sophisticated levels of intricacy are possible. You can have several load balancers, each with its server selection strategy. And if your system is particularly large and heavily used, you could need load balancers for load balancers.

Finally, you add elements to the system until your performance is matched to your demands (your needs may appear flat, slowly increase over time, or spike). VPNs (for forwarding proxies) and load-balancing (for reverse proxies) have been mentioned, but more instances are here.

Consistent Hashing

Hashing in the context of load balancing is one of the more difficult concepts to grasp. As a result, it has its section. To comprehend this, you must first understand how hashing works on a conceptual level. Hashing, in a nutshell, turns an input into a fixed-size value, most commonly an integer value (the hash).

One of the most important aspects of a successful hashing algorithm or function is that it must be deterministic, which is a theory that proposes that when identical inputs are supplied into the function, the function will produce identical outputs. Thus, if you pass in the string “Code” (case sensitive) and the function generates a hash of 11002, this must generate “11002” as an integer each time you pass in “Code.” And if you type “code” into the box, it generates a varying number (consistently). It’s not the end of the line if the hashing algorithm generates the same hash for several inputs; there are techniques to deal with this. The wider the spectrum of distinct inputs, the more likely it is. A “collision” occurs when more than one input deterministically yields the same result.

Apply this to server routing and directed requests with this in mind. For example, assume you have five servers to distribute loads among. Hashing incoming requests (maybe by IP address or some other client detail) and then generating hashes for each request is a simple way. The modulo operator is then applied to that hash, with the right operand being the number of servers.

For instance, consider the following pseudo-code for your load balancers: 

As you can observe, the hashing function gives a wide range of possible values, and when you use the modulo operator, you get a narrower range of integers that correspond to the server number.

You will undoubtedly receive different requests that map to a similar server, which is acceptable as long as the total allocation among all servers is “uniform.”

Adding Additional Servers and Dealing with Failing Servers

As such, what if one of the servers to which you are delivering traffic goes down? The hashing function (refer to the pseudo-code excerpt above) believes there are five servers, and the mod operator generates a range of 0-4. However, now that one of the servers has failed, you only have four servers, still delivering traffic to it.

On the other hand, you could create a sixth server, but it will not receive any traffic because the mod operator is 5, and it would never return a figure that includes the recently introduced sixth server. 

You notice that the server number changes after the mod is applied (but not for requests #1 and #3 in this example – but that’s just because the numbers played out that manner in this case).

As a result, half of all queries (perhaps more in other cases) are now sent to new servers entirely and can no longer benefit from recently cached data on the servers.

Request #4, for instance, used to go to Server E, but now it flows to Server C. Because the request is now traveling to Server C, all the cached data for request#4 on Server E is useless. You can have a similar issue if one of your servers dies, but the mod function continues to send requests to it.

In this small system, it seems insignificant. However, this is a terrible result on a big-scale system—# SystemDesignFail.

A simple hashing-to-allocate system cannot grow or cope with failures.

Consistent hashing is a popular solution.

Unsurprisingly, that word description will not be enough at this point. Visually, consistent hashing is best grasped. However, the goal of this post thus far has been to give you a sense of the problem, what it is, why it occurs, and what the flaws in a basic remedy may be. So consider that at all times. As previously noted, the main issue with naive hashing is that when (A) a server fails, traffic is still routed to it, and (B) you introduce a new server, the allocations might be drastically modified, resulting in the loss of past cache benefits.

When it comes to consistent hashing, there are two crucial things to remember:

 1.     Consistent hashing does not solve the issues, particularly B. However, it significantly reduces the problems. You might question what the big deal is with consistent hashing, given that the underlying negative remains – sure, but to a lot lesser level, which is a significant advantage in extremely large-scale systems.

2.     Consistent hashing employs a hash function on the server and incoming requests. As a result, the outputs are limited to a specific range of values (continuum). This is a critical detail.

Please bear this in mind when you watch the video below that explains consistent hashing, as the benefits may not be immediately apparent. However, this video (see link below) is highly recommended because it encapsulates these themes without overly detailed.

Suppose you’re experiencing difficulties grasping why this method is vital in load balancing. In that case, it is recommended to take a pause, return to the load balancing section, and then re-reading this. Unless you’ve met the problem in your work, it’s very unusual for all of this to appear incredibly abstract.


Having briefly discussed how different storage solutions (databases) are built to serve a variety of distinct use-cases, with some being more specialized for specific activities than others. Databases can be divided into two categories at a high level: relational and non-relational.

Relational Databases

A relational database is one in which the interconnections between the items contained in the database are rigorously enforced. These connections are often made feasible by demanding that each such thing (referred to as an “entity”) be represented in the database as a structured table with zero or more rows (“records,” “entry”) and one or more columns (“attributes, “fields”). You may verify that each item/entry/record has the appropriate data by imposing such a structure on it. It improves consistency and the capacity to form close connections between the entities.

This structure can be seen in the table below that records “Baby” (entity) data. Each record (“entry”) in the table has four fields containing information on the baby. This is a typical relational database design (schema—a formalized entity structure).

The most important aspect of relational databases is that they are extensively structured and imposed with all entities. This structure is enforced by ensuring that data entered into the table follows it. It is impossible to add a height field to a table if the schema does not allow it.

SQL (Structured Query Language) is a database querying language supported by most relational databases. This programming language was created to interact with the data in a structured (relational) database. The two aspects are so closely related that a relational database is commonly referred to as a “SQL database” (and sometimes pronounced as a “sequel” database).

In general, SQL (relational) databases handle more complicated queries (integrating many fields, filters, and conditions) than non-relational databases. These queries are handled by the database, which returns comparable results.

Numerous SQL database supporters contend that without it, you’d have to fetch all of the data and then have the server or client load it “in memory” and implement the filtering conditions – which is fine for small sets of data but would wreak havoc on performance for a large, complex dataset with massive amounts of data and rows. However, as you will discover when you learn about NoSQL databases, this is not always the case.

The PostgreSQL (commonly referred to as “Postgres”) database is a popular and well-liked example of a relational database.


A combination of features that specify the transactions a decent relational database will support is ACID transactions. ACID means “Atomic, Consistent, Isolation, Durable.”  A transaction is a read or write operation with a database.

When a single transaction consists of multiple operations, atomicity requires the database to ensure that if one operation fails, the entire transaction (all operations) fails as well. It’s a case of “all or nothing.” If the transaction succeeds, you’ll know that all of the sub-operations succeeded as well, and if an operation fails, you’ll know that all of the operations that went with it failed as well.

If a single transaction involves reading from two tables and writing to three, for example, each of the actions taken will fail, and the transaction as a whole will fail. This implies that none of the operations should be completed. You don’t want even one of the three write transactions to succeed because it will “dirty” your databases’ data.

Consistency. Each transaction in a database must be valid according to the rules set in the database, and when the database changes state (some information has changed), the change must be valid and not corrupt the data. Each transaction changes the database’s state from one valid to another. The hereunder is an example of consistency: each “read” operation obtains the most relevant “write” operation results.

Isolation. This means that you can conduct several transactions on a database “concurrently” (at the same time). Still, the database will end up in a state that seems like each operation was run sequentially ( in a sequence, like a queue of operations). So maybe “Isolation” is a poor descriptor of the notion, although ACCD is more difficult to pronounce than ACID.

Durability. “durability” refers to the assurance that it will stay there indefinitely once data is recorded in a database. It will be “permanent,” meaning it will be stored on a disk rather than in “memory.”

Non-relational databases                

A non-relational database, on the other hand, has a less restrictive, or to put it another way, more flexible data structure. The information is usually expressed as “key-value” pairs. An array (list) of “key-value” pair objects, for example, would be a straightforward method to express this:

Non-relational databases, often known as “NoSQL” databases, are useful when you don’t want or need to store data in a consistent format.

NoSQL database properties are sometimes referred to as a BASE, just like ACID properties. The term “basically available” means that the system is guaranteed to be available. The term “soft state” refers to a system’s state that can vary over time even when no input is provided. Unless new inputs are received, the system will become consistent over a (concise) period, according to eventual consistency.

These databases are considerably faster, simple, and easy to use because they store data in a hash-table-like structure at their core. They are ideal for utilization cases such as caching, environment variables, configuration files, and session state. In addition, they are ideal for use in memory (e.g., Memcached) and persistent storage because of their versatility (e.g., DynamoDb).

Other “JSON-like” databases, such as the well-known MongoDb, are called document databases, and they are all “key-value” stores at their heart.

Database Indexing

Because this is a complex subject, skimming the surface to give you a high-level summary of what you’ll need for systems design interviews will still be helpful. Imagine a 100-million-row database table. The primary purpose of this table is to look up one or two values in each entry. You’d have to iterate over the table to get the values for a single row. It would take a long time if it were the very last record. Indexing is a method of quickly getting to a record with matching values rather than going through each row. Indexes are a type of data structure added to a database to make it easier to search for specified qualities (fields).

So, if the census bureau has 120 million entries with names and ages, and you frequently ought to obtain lists of people in a particular age group, you’d index the database using the age property. Although indexing is a key element of relational databases, it is also commonly found in non-relational databases. The benefits of indexing are thus theoretically available for both sorts of databases, which is incredibly beneficial in reducing lookup times.

Sharding and Replication

While these terms may sound like they belong in a bio-terrorism film, you’re more likely to encounter them in the context of database scaling daily.

Duplicating (making copies of, replicating) your database is called replication. You may recall it when explaining about availability.

You’d thought about the advantages of incorporating redundancy in a system to ensure high availability. Replication ensures database redundancy if one fails. However, since the replicas are supposed to have the same data, it poses the challenge of synchronizing data. Asynchronous (simultaneously as the changes to the primary database) or synchronous (at a later time) replication on write and update activities to a database is possible.

The permissible time interval between synchronizing the main and replica databases depends entirely on your requirements; if state consistency between the two databases is critical, replication must be quick. You should also make sure that if the write to the replica fails, the write to the main database fails as well (atomicity).

But what happens when you have so much data that merely replicating it solves availability concerns but not throughput or latency (speed) difficulties? You might want to consider “chunking down” your data into “shards” at this stage. This is also known as partitioning your data (which is not the same as partitioning your hard disk).

Data sharding divides a large database into smaller databases. Depending on your data structure, you can determine how you wish to shard it. For example, it might be as easy as saving every 5 million rows in a different shard, or it could be more complex, depending on your data, requirements, and regions served.

Leader Election

Return to servers for a bit more sophisticated discussion. You already know about the availability principle and how redundancy is one technique to improve availability. You’ve also gone over some practical issues regarding routing requests to redundant server clusters.

However, with this type of setup, where numerous servers are doing a similar task, there may be instances when just one server needs to take the lead. For example, you might want to ensure that only one server is in charge of updating a third-party API. Multiple updates from various servers could make things difficult or increase expenses for the third party. Therefore, you must decide which primary server will be responsible for the update in this situation. The procedure is known as leader election.

When numerous servers are in a cluster to ensure redundancy, they can be configured to have only one leader amongst themselves. They’d also notice when the leader server went down and appoint a new one to reach its destination. The concept is straightforward, but the harm seems to be in the details. The most challenging component is guaranteeing that the servers’ data, state, and functions are “in sync.” There’s still the possibility that certain outages will cause one or two servers to become separated from the rest of the network, for example. Engineers leverage some of the fundamental ideas from blockchain to determine consensus values for the cluster of servers in this example. In other circumstances, a consensus method is employed to provide an “agreed on” value that all servers can utilize in their logic to determine which server is the leader.

Leader Election is widely used with software like etcd, which is a key-value store that uses Leader Election and a consensus mechanism to provide both high availability and good consistency (a valuable and unique combination).

As a result, developers can rely on etcd’s leader election design in their systems to produce a leader election. This is accomplished by maintaining a key-value pair that indicates the current leader in a service like etcd. Because etcd is readily accessible and stable, your system can always rely on that key-value pair to include the final “source of truth” server in your cluster, which is the current elected leader.

Polling, Streaming, Sockets

It’s critical to understand the fundamental principles that support today’s technology, including continuous updates, push notifications, streaming services, and real-time data. One of the two ways below must be used to keep data in your application updated on a regular or immediate basis.


 This one is straightforward. If you look at the Wikipedia entry, you’ll notice that it’s pretty detailed. Instead, search up the definition in the dictionary, especially in the context of computer science. Keep that basic principle in mind.

Simply put, polling is when your client sends a network request to a server requesting updated data. These requests are often issued at regular intervals, such as 5 seconds, 15 seconds, 1 minute, or any other frequency that your use case requires.

Polling every few seconds isn’t nearly real-time, and it comes with the following drawbacks, particularly if you have a million or more concurrent users:

  • Network queries that are practically constant (not ideal for the client)
  • Inbound requests are almost constant (not optimal for server loads – 1 million+ requests per second)

Polling frequently is inefficient and ineffective, and polling is best employed when short gaps in data updates aren’t a concern for your application. If you’re making an Uber clone, for illustration, you might have the driver-side app submit driver location data every 5 seconds and the rider-side app poll for the driver’s whereabouts every 5 seconds.


The problem of continual polling is solved through streaming. However, if you need to visit the server frequently, you should use web-sockets. This is a network communication protocol that uses TCP to communicate. It creates a specialized two-way channel (socket) between a client and a server, similar to an exposed hotline between two endpoints.

Except for traditional TCP/IP communication, these sockets are “long-lived,” which means that instead of numerous separate requests to the server, a single request to the server opens up this hotline for the two-way movement of data. The socket connection between the machines will be long-lived until either side closes it or down the network. From explaining IP, TCP, and HTTP, you may recall that each request-response cycle is accomplished by transmitting “packets” of data. Web-sockets imply a single request-response exchange (rather than a cycle). This opens up the channel for two data sent in a “stream.”

The main difference between polling and all “normal” IP-based communication is that. In contrast, polling involves the client making periodic requests for data (“pulling”); streaming consists of the client being “on standby” and waiting for the server to “push” data with its way. When the server sends out new data, the client is always on the lookout for it. As a result, if the data changes frequently, it becomes a “stream,” which may be more appropriate for the user’s demands.

When using collaborative coding IDEs, for example, when one user types something, it can appear on the other’s screen, which is done via web-sockets because real-time cooperation is desired. It would be awful if what is being entered appeared on your screen after you tried to type the same thing or after you waited three minutes wondering what was up to. Consider online multiplayer games as an example of a good use case for streaming game data among players.

In conclusion, the use case dictates whether polling or streaming is used. You should generally stream if your data is “real-time,” while polling may be a decent option if you’re okay with a lag (even 15 seconds is still a lag). However, it all depends on how many concurrent users you have and whether they anticipate data to be available immediately. Apache Kafka is a popular example of a streaming service.

Endpoint Protection

When designing large-scale systems, it’s critical to protect your system from too many actions that aren’t truly required to use the system. Of course, that’s an abstract statement. But consider this: how many times have you frantically pushed a button in the hopes of making the system more responsive? Imagine if each of those button presses sent a message to a server, which then attempted to process them all. If the system’s throughput is poor for some reason (for example, if a server is under extraordinary load), each of those clicks would have slowed it down even more because it needs to process them all.

It’s not always about keeping the system safe. Because operations are a component of your service, you may want to limit them at times. For example, you might have used free tiers on third-party API services that limit you to 20 queries every 30 minutes. So if you make 21 or 300 requests in 30 minutes, the server will stop processing your requests after the first 20.

This is known as rate-limiting. A server can limit the number of actions attempted by a client in a specific window of time by applying rate-limiting. Users, requests, timings, payloads, and other factors can all be used to compute a rate limit. When a time limit is exceeded in a time frame, the server will often return an error for the remainder of that window.

Now you might be thinking that endpoint “protection” is a bit of a stretch. You’re simply limiting the endpoint’s ability to return something to the user. However, it also serves as a safeguard when the user (client) is hostile, such as when a bot is crashing your endpoint. Why would something like that happen? Because flooding a server with additional requests than it can process is a method used by unscrupulous individuals to take that server down, hence that service away forever. A Denial of Service (D0S) operation is exactly that.

While rate-limiting can help you defend against DoS attacks, it won’t protect you from a more sophisticated type of DoS attack known as a distributed DoS. The term “distribution” simply means that the attack comes from several seemingly unrelated clients. Unfortunately, there is no way to tell which ones are operated by the same malicious actor. As a result, other approaches must be used to defend against such coordinated, widespread attacks.

However, rate-limiting is beneficial and popular for less frightening use-cases, such as the API restriction discussed here. Given how rate-limiting works, since the server must first verify the limit conditions and execute them if necessary, you should consider what sort of data structure and database you’d like to employ to make those checks as fast as possible so that processing the request isn’t slowed. Furthermore, if you put it within the server, you must be sufficient to assure that all requests from a specific client will be directed to that server to enforce the limitations properly.

 It’s common to utilize a distinct Redis service that resides outside the server but retains the user’s details in memory. It can also rapidly evaluate whether such a user is within their acceptable boundaries to handle scenarios such as these. Rate limitation can be as complex as the rules you intend to apply. Still, the basics and most typical use-cases should be covered in the preceding section.

Messaging & Pub-Sub

It is critical to transmit information between the components and services that make up large-scale and distributed systems to work cohesively and seamlessly. However, networks that depend on networks share the same flaw: they’re fragile. Networks go down all the time, and it’s not uncommon for them to do so. When networks fail, system components are unable to interact with one another, which can cause the system to degrade (in the best case) or fail (in the worst situation) (worst case). As a result, distributed systems require strong strategies to ensure that communication persists or recovers where it broke off, even when components in the system “arbitrary partition” (i.e., fail).

As an example, suppose you’re trying to book airplane tickets. You receive a good deal, select your seats, validate your reservation, and even pay with your credit card. You’re now waiting for the PDF of your ticket to appear in your inbox. You patiently wait for it to arrive, but it never does. There was a system failure somewhere that wasn’t appropriately addressed or recovered. A booking system will frequently interface with airline and pricing APIs to manage the actual flight selection, fare summary, date and time of departure, and so on. All of this happens while you navigate the site’s booking interface. However, it is not required to email you the PDF tickets until a few minutes later. Instead, the UI might essentially state that your booking has been completed and that the tickets will be delivered to your mailbox shortly. This is a fair and usual user experience for bookings because the payment and receiving of tickets do not require to occur simultaneously – the two events might occur at different times. A system like this would require messaging to guarantee that the service (server endpoint) that asynchronously generates the PDF is notified of a verified, paid-for booking and all the details, so the PDF can be auto-generated and forwarded to you. However, if that message method fails, the email service will be unaware of your reservation, and no ticket will be issued.

Publisher / Subscriber Messaging

This is an extensively used messaging paradigm (model). The basic premise is that publishers’ messages are ‘published,’ and subscribers subscribe to messages. To provide more specificity, messages can be assigned to a certain “subject,” similar to a category. These topics are similar to dedicated “channels” or pipes, with each pipe handling only communications related to a single topic.

Subscribers decide a topic they want to be notified of new communications and messages on that topic. The benefit of this method is that it allows the publisher and subscriber to be entirely decoupled, which means they don’t need to know about each other. Instead, the publisher makes an announcement, and the subscriber listens for announcements on issues they are interested in.

A server frequently publishes messages, and there are usually numerous topics (channels) to which they are published. A subscriber to a given topic is a person who is interested in that topic. There is no explicit link between the server (publisher) and the subscriber (another server). The only interactions are between the publisher and the topic and between the topic and the subscriber. The messages in the topic are simply data that must be delivered, and they can take any form you require. Publisher, Subscriber, Topics, and Messages are the four players in Pub/Sub.

Better than a database

So, what’s the point of this? Why not just save everything to a database and consume it from there? Because each message correlates to a task that must be completed depending on the data contained in that message, you’ll need a system to queue the messages. So, in the ticketing example, storing 100 people’s information in a database doesn’t address the problem of emailing those 100 people. It only keeps track of 100 transactions. Communication, task sequencing, and message persistence are all handled by Pub/Sub systems. As a direct consequence, the system can provide valuable features such as “at least once” delivery (messages will not be lost), persistent storage, message ordering, “try-again,” “re-playability,” and so on. Simply storing messages in a database will not assist you in ensuring that they are delivered (consumed) and acted upon to finish the task properly.

A subscriber may consume the same message many times if the network goes down for a brief period, and the subscriber consumes the message but does not notify the publisher. As a result, the publisher will just resend the message to the subscriber. That is why the guarantee reads “at least once” rather than “once and only once.” Because networks are inherently unstable, this is inevitable in distributed systems. This can cause problems if the message causes an action on the subscriber’s end, which changes the database’s state (in the overall application). What if a single operation is repeated several times and the application’s state changes each time?

Controlling Outcomes: Is it better to have one or multiple outcomes?

Idempotency is the solution to this new difficulty, and it’s a crucial notion but difficult to grasp the first few times you look at it. Because it is a notion that can appear complex (especially when reading the Wikipedia entry), here is a user-friendly simplification from StackOverflow:

An idempotent operation in computing has no effective impact when called several times with identical input parameters.

As more than just a result, if a subscriber processes a message two or three times, the overall state of the application is the same as it was after the first time the message was processed. You wouldn’t want to spend 3X the ticket price if, at the end of ordering your flight tickets and after entering your credit card details, you clicked “Pay Now” three times because the system was slow, right? Idempotency is required to ensure that each subsequent click does not result in a second transaction and a second charge to your credit card. On the other hand, you can leave the same comment on your best friend’s newsfeed N times.

They will all appear as distinct comments, which, aside from being bothersome, isn’t necessarily a bad thing. Another example is offering “claps” on Medium posts; each clap is designed to add to the total number of claps, rather than being a single clap. Idempotency is not required in the last two instances but the payment example.

There are many different kinds of messaging systems, and the one you choose is determined by the problem you’re trying to address. For example, people frequently refer to “event-based” architecture, which means that the system processes actions depending on signals concerning “events” (such as paying for tickets) (like emailing the ticket). The most regularly referenced services are Apache Kafka, RabbitMQ, Google Cloud Pub/Sub, and AWS SNS/SQS.

Smaller Essentials


Your system will amass a large amount of data over time. The majority of this information is highly valuable. It can provide you with an overview of your system’s health, effectiveness, and issues. It can also provide you with useful information about who uses your system, how they use it, how frequently they use it, which areas are used more or less regularly, and so on. This data is helpful for product development, optimization techniques, and analytics. It’s also beneficial for debugging, not just when you’re logging into your console all through development but also when you’re looking for faults in your test and production settings. As a result, logs aid in auditing and traceability.

When logging, start by thinking of it as many successive events. This transforms the data into time-series data, and the materials and databases you use should be precisely intended to function with it.


After logging, this is the second approach. It provides an answer to the issue, “What should I do with all of this logging data?” First, you keep an eye on it and examine it. Then, you create or even use software and applications that scan through that data and present you with dashboards, charts, and other sentient representations.

You can plug in some other tools developed with that data structure and area of focus by storing the data in a specific database built to manage this type of data (time-series data).


You should also set up a mechanism to notify you when something significant happens when you’re proactively monitoring. For example, you’re tracking specific indicators that may generate an alert if they move overly high or overly low, just as stock prices reach over a specific ceiling or below a certain threshold. Likewise, if response times (latency) or errors and failures exceed an “acceptable” level, set up alerts for them.

The essential to great logging and monitoring is to make sure your data is reasonably consistent across the board. Working with inconsistent data might lead to missing fields, disrupting analytical tools, or limiting the logging’s benefits.


As suggested, here are some valuable resources:

  1. Introduction to Systems Design by Tushar Roy
  2. SQL vs. NoSQL: The Difference
  3. Github System Design Primer with concepts, diagrams, and study materials
  4. Gaurav Sen’s relevant YouTube content/tutorials

Hopefully, you found this lengthy instruction valuable and helpful!


Stay Connected with the Latest