Naturally, over the last 15 years tech and tech interview processes have changed significantly – though a lot of fundamentals have stayed the same… or, at least, working on standards I like to believe they have stayed the same. After leaving Mozilla, I found myself having returning to the “marketplace” to look for a job, which meant undertaking FAANG-style interviews.
- Buy the book System Design Interview, by Alex Xu.
- Practice is critical. Practice a friend or colleague… easier said than done.
- If you don’t have great hand-writing, use a digital document instead of paper.
- Set up your workspace for the interview (graph paper, ruler, pencil).
- Expect the unexpected – and try to keep control of the interview.
- Can take about 2+ weeks to prepare (e.g., 1 hour per night + 4-6 hours over two weekend).
- Grab my cheat sheet.
- You will be fine.
One interview type I’d never done was a “system design” interview: as the name implies this is the kind of interview where you work together with an interviewer to design some kind of system… these can range from, say, “let’s build YouTube” to “let’s build a vending machine” or anything in between.
The purpose is to figure out the scope of the problem, tease out assumptions, illustrate/design the mayor components, identify key design issues, and then overcome those issues by using various established technological/theoretical tools, techniques, or strategies… or, where possible, throw a bunch of money/people/computers at a problem.
The design challenges are, of course, all hypothetical – the fun part of these interviews is really the collaborative as aspect, and if you approach it not so much as an interview, but more as a “let’s build something really cool”, these interviews can be a lot of fun.
Before continuing, I’d strongly recommend the book System Design Interview, by Alex Xu. It provides a broad description of real world systems, and they are described as if in an interview. This post is based heavily on what I learned from that book, and I can’t praise it enough – even as a general book to have around next time you want to build any sort of distributed web site or service or you are just wondering “how is it that <insert favourite website or service> doesn’t come crashing down with some many users?”, or “no wonder X came crashing down, they probably didn’t do Y!”.
Back to the interview…
The typical system design interview is about 40-45 min, so it’s not very much time. It’s your responsibility to manage that time and get to a reasonable end state where you have some kind of theoretically operational system by the end of the 45 mins. In the System’s Design Book, Alex lays out a framework for the interview. That wasn’t working exactly for me, so I came up with my own breakdown (I include a PDF that I use during my own interviews, which is reduced down to a single page). This is was how I chose break up the interview and how to allot time:
- Step 0 – Set up your interview space
- Step 1 – Scope of problem (5 minutes)
- Step 2 – Assumptions (5 minutes)
- Step 3 – Draw Components (10 minutes)
- Step 4 – identify key issues (5 mins)
- Step 5 – Redesign for key issues (15 mins)
- Step 6 – Wrap up (5 mins)
Step 0 – Set up your interview space
Most important thing before the interview setting up your interview space or room.
Here are the things I used for my interview:
- Graph paper
- Multiple pencils
- Interview structure “cheat sheet” (see end of document)
I also propped up my laptop so the camera is at a better angle (which looks more professional… you don’t want people looking up your nose during the interview). This only really works if you intend to do all your design on the graph paper.
If the lighting in the space is poor, you might also consider using a “selfie ring“, which is a LED ring on light. However, if it’s going to make it difficult to see things, don’t use it… or use it to shine light off some other surface onto your face.
Step 1 – Scope of the problem (5 minutes)
First things first! Figure out what you actually have to build. The interviewer will probably just tell you (e.g., a twitter clone).
What specific feature product/feature/service are you setting out to build? And more importantly, why? This helps tease out who we are solving the problem for (i.e., who are the users).
Based on the above, the system may generally have a range of users: end-users, administrators, content producers, etc. So we want to figure out what parts of the system are exposes where, and how those things are accessed.
In particular, we want to know – and, remember, you need to gather a ton of information here in the first 5 minutes!:
- How many users are there? Like, there may be 1,000,000+ daily active users, but say 50 administrators, and 30,000 content producers or whatever.
- How does each type of user interface with the system?
- Is it on the Web? Native app? Or via a REST API? Something else?
- How frequently do users need to access the system? E.g., many times an hour, like GitHub… or maybe just once a month or less (e.g., random government website).
- What’s the expectation of growth in users? So, is this a new system or a system that is already built that we need so support? What’s the growth rate (e.g. 1000 users a day/month/year)?
- Is there a peak usage hour? (e.g., most user are in the West Coast of Wherever)
- Are there any special requirements? (e.g., it needs to work under water, it must handle X number of transactions, etc).
- Really important: are there super users or “celebrities”: i.e., super “nodes” that can unbalance the system or that may require additional computational resources. Or, a node in the system that draws a lot users (a Kim Kardashian or Ryan Reynolds, let’s say).
- Are we limited by technology stack? Or can we use whatever we want?
- Can we leverage existing infrastructure? This is really useful, because if we don’t need to build something, we get it mostly “for free”. For example, there might already be a distributed user database, and federated login might be handled for us, so we don’t need to build it: “we authenticate with GitHub/Google/Apple/Whatever, and done!“
- Lastly, for this part, are there any constraints/key tradeoffs we need to make? Budget restrictions? (e.g., if we have “FAANG money” we can go a little bit more nuts VS if we only have 3 servers we can use).
Step 2 – Assumptions (5 minutes)
Next we gather the assumptions and some hard requirements… this is again really about teasing out constraints.
- Gather the maximums, which really depend on whatever you are building. E.g., we don’t expect more than X number of people doing Y.
- Are there any caching/freshness requirements?
- Any “deal breakers”?
- Figure out, with your interviewer, if there might be an optimal way to organise and manage the data
- What are the availability and reliability requirements? Like, what if you need to take the server down for 10 minutes… is that ok? Does it need 99% up time? How much data redundancy is needed?
Ok, so at this point, you have a pretty good idea of what you are building, and who your user are. You know what the constraints are.
Now is probably a good time to check with your interviewer (as your co-creator) if they think there was anything we might have missed that is critical. You don’t want to jump into designing something where you might have overlooked obvious in the moment. Remember, you you are probably pretty pumped with adrenaline at this point and you are flying pretty fast, so make you use your interviewer to keep you in check.
Step 3 – Draw Components (10 minutes)
This step basically involves diagraming out what the major components of the system are, as well as where the users sit in relation to those components (see Tools below!). Normally, you would do draw the components on a whiteboard, but thanks to Covid, you might be stuck using digital tools. Personally, I prefer to draw on paper – I find the digital tools too clumsy given the limited time… but if you are fast with them, then by all means, use whatever works for you!
At this junction, you might need to do rough “back of the envelope” calculations… just go full “Silicon Valley” here: this whole process is a little silly and a little fun. In the real world, you would have a huge committee of people you would be coordinating with to figure out of this stuff. It’s also ok to say that (i.e., “I’d probably go and get input from the team that knows more about this, to make sure the numbers are accurate”).
Check reliability: look at the reliability of the components. What happens in one of the components stops working? Can you add redundancy? Do you need to and why? The decisions should be justified by what you agreed to in step 1 and 2. This is also why you want to make sure you co-created the system. It makes both you and the interviewer accountable.
Finally, identify “future things” that might make the system better: “we could apply machine learning here to do X… We could use solar panels over there to make this more sustainable…“. However, you want to treat these things as out of scope.
Step 4 – identify key issues (5 mins)
At this point you should have a fairly ok system, with the key inputs, outputs, and users diagrammed. It’s time to discover where the system is deficient by running through various scenarios (and even attack scenarios, for security/reliability purposes).
In this phase, don’t fix things as you go! Just identify the issues and you will fix them in step 5. You will want to triage particular issues, and use this opportunity to identify parts you actually know how to fix in the remainder of the interview.
You can start by identifying any bottlenecks, so consider:
- Bandwidth, throughput, latency and…
- Read, write, synchronise operations
- And where are you going to have to make tradeoffs.
Are there single points of failure in your system (e.g., DNS sever, lol hi Facebook!)? Can you add redundancy?
What about quality of service (QoS) requirements (e.g., it is cool if you drop down to lower quality video)? What’s tolerable?
And what about more weird situations, like (un)reliability of clocks? Do they affect anything? This is important for “realtime” systems.
Before proceeding, get agreement that you haven’t missed anything critical.
Step 5 – Redesign for key issues (15 mins)
This is basically your chance to shine: Pick a few areas to deep-dive into, using all your know-how. This is where you basically really need to show your knowledge of particular systems. Depending on your area, this might be backend stuff (using an object store or putting stuff in memory to make it really fast), front-end (e.g., making the UI nice and responsive), etc.
Step 6 – Wrap up (5 mins)
Finally, reflect on what you’ve created together with your interviewer. You should have a pretty good sense of where you both are at. There might be things neither of you were able to solve, and maybe silly things came up – but it’s a good opportunity to tease out what, if anything, specific the interviewer was looking for. For example, maybe they were looking for a particular caching strategy, or “sharding” servers in some particular way, like geographically or by “celebrity” user, or maybe it was some clever pre-rendering content strategy, and so on.
Depending on your field, when drawing the components, you will make use of the usual suspects, for example… for a website, you will obviously have a HTTP server. But as the number of users go up, or the server needs to do particular tasks, you need to start thinking about scaling: “horizontal” (more servers!) VS “vertical scaling” (more CPU/memory/faster disks).
However, from there, you need to know about more in-depth tooling, and when to deploy any of:
- Worker threads: what should they be processing.
- Message queues: a queue where you can put tasks, which are then handled async without overloading the available resources (e.g., compressing videos)
- Database – relational or NoSQL/GraphDB, and what’s best for particular types of content or data.
- CDN, and where you might want them in various geographical locations, and what content will they hold and for how long.
- Other external services – again, stuff you can buy or leverage.
- Load balancers.
- Servers (i.e.“shards”)
- And “CAP”: consistency <=vs=> availability <=vs=> partition tolerance – you can’t have all three… you need to choose two.