Why Do Some Tech Organizations Place Such a Heavy Emphasis on Data Structures, Algorithms, and Distributed System Design?
This is a query that every software engineer has thought about at some point.
I’ll try to explain why I believe that while this technique is completely unnecessary for some companies, it is very vital for a select number.
So, why do we need to learn them?
The answer is scalability.
In layman’s words, if you are developing software for end users and your APIs take longer than 500 ms to produce the response, you will eventually fail. Nobody likes to wait more than a second to get the response to their inquiry as an end user.
Consider the scenario where you are tasked with creating a notification system for Cortana. It will send the notification to the user’s registered device, such as a sports update.
Once your daily active user count (DAU) climbs by a sizable order of magnitude, we’ll watch to observe how the design turns on its head.
The notification system’s functional requirements and non-functional requirements are as follows:
- It must be as real-time as possible, with a delay of no more than 200 milliseconds.
- The user can subscribe to N different topics (e.g. finance, sports, NYT, etc.)
- It must be constantly available or users may not receive critical notifications.
Case 1. When your Number of Daily active users(DAU) < 10K
- Scan the Database for Users, associated devices and some batch job running periodically on the same server could fanout the notifications.
- No complicated architecture is required. A single mid-size capability server can host the read/write APIs, database and the cron jobs.
- Not many points of failure in the design. So, it’s highly available and since there aren’t internal HTTP calls, latency is low and the system is real time.
- Simple, Sweet and Elegant design.
Case 2. Now your Number of Daily active users(DAU) > 10M
Now, the previous design will not work:
- Higher Latency: Scanning the database with millions of records to find users, their associated devices, etc. would take paramount of time and the service will not be real time. Notification system needs to be as real-time as possible. E.g. Stock prices, Sensex data, etc. may change every minute. Delivering those updates a little later could make the notification outdated and may not make sense to the subscribed users.
- A single server could not hold all the moving components together: Having a single server to host all the Read and Write APIs could have degrading performance. Read API could take most of the connections making the Write API wait longer to finish off the operation.
- Having a single database to hold all the records will add more latency in returning records. Our benchmark is less than 200ms.
- Overall, there are many points of failures for the current daily active users limit.
How we can design it for scale?
- You need a separate API server to serve read and write request. Let’s assume Read to write ratio 100:1. Having both these APIs on the same server could lead to latency in write as the server will be busy handling too many read request.
- Cluster Management becomes significant. Storing users data in data center closer to users locations makes retrieval faster. How to failover when one data center is down without impacting latency is also very critical.
- Writing APIs which are asynchronous in nature rather than blocking APIs
- Using Queue management system(Asynchronism- message queue, task queue, etc) to perform background tasks like cron jobs, notification service, etc
- Cache Server is required to store recently used records to reduce the load on the database. Minimizing the cold start for databases to avoid latency is also very important.
- Normalized DB vs Denormalized DB, Sharding vs Federation, SQL vs NoSQL database becomes important discussions and not just theory topics.
- Separate Read and write replicas of the database to reduce the load on a single database instance.
- CAP(Consistency, availability and partition tolerant) theorem and eventual consistency are no more a theory topics. It needs to be designed in practice.
Image shows general data flow in most scalable high level design. Used just for the sake of showing few of the moving components in high level design.
- There are too many points of failures in this type of architecture pipeline and so having a robust failover mechanism is a must. Remember our system has to be highly available all the time.
- Each of these bullet points requires broader discussions and they are complete topics in themselves with different options and best practices, etc.
You see the importance
You cannot program these things until you have solid fundamentals and knowledge designing large-scale systems. Companies try to find out how well a candidate will fit in the role by accessing the candidate’s problem-solving skills on writing clean code with object-oriented programming using design patterns, whether they are using the right data structures and algorithms.
System design is required for senior positions, not so much for entry level, but knowing the foundations of system design is always beneficial.
Good Program = Right Data Structure + Right Algorithm
Why it makes sense for a few companies?
- At companies with the larger user base, they know at some point you will have to go in these type of discussion, so they make sure they evaluate you on these parameters right during the interview.
- Companies like Fab 4, who started this culture of extremely Data structure, Algorithm and System design specific 4–6 rounds wanted to make sure they are hiring someone who can code for that level of scalability. As we have seen above, when the user base is in millions, even a simple design changes upside down. I have been part of teams who went on to modify their existing monolith single heavy APIs because they started failing for 50 concurrent users, but no one got to know them until they started having such user count.
- And if such a team doesn’t have members who could design such a system to scale, the company will end up failing the product because of the scalability issue.
True, most engineers have never had to build anything like that. Most simply use one of the established platforms such as Azure, GCP, AWS, and so on to provide the necessary abstraction so that engineers can focus on business logic rather than the underlying challenges in building those platforms.
We have a saying at Microsoft: “1 millisecond delay equals 1 million customers lost”
I recently participated in a design discussion about how to optimise latency in recovering failure when a specific data centre fails. So, when you have control over very small parts of the stack, it’s critical to understand the implications of distributed system design.
When you are sitting in a design meeting and people are talking in terms of 100 million active users, even the best of design sometime fail.
Why does it not make sense for the majority of companies?
- You don’t need a Swiss army knife to cut onions.
- You don’t need to judge blindly on these criteria when the work is completely different.
- Instead, compare them based on their matching skill-sets and on topics covered by the they are going to join. If a JavaScript developer is required, the candidate can be accessed on his web development fundamentals. For Backend developers, If they are not working on a large scale platform then basics of writing robust API, Schema designing fundamentals, Clean Code fundamentals, etc. are sufficient.
- You don’t need to evaluate the candidate on in-depth algorithms and system design if the team is working on building something using a framework/platform. They may never work on those aspects because the platform they will be utilizing will handle it for them.
- I’ve met many engineers who aren’t very good at data structures or system design but are far above average at whatever they do. These things can always be learned when needed to complete a task. So, don’t be discouraged if it’s look overwhelming at first.
- Top companies receive far too many applications each day, so having an extremely high hiring bar makes sense. They can afford to lose a few good candidates, but it is extremely expensive for smaller companies to lose good candidates because they do not have the luxury of receiving a large number of applications. I’ve seen businesses struggle to find candidates for face-to-face interviews.