When I interview people, I like to start out easy and then work towards harder questions. One of my favorite interview questions came from a time that I was working at Tellme, a company that built a Voice Recognition IVR platform that is still in use today by large companies like E*Trade, Merrill Lynch, Orbitz, etc (Tellme was acquired by Microsoft and then was later spun off again as 24/7).
Imagine you have a large file and you want to read a random line from it.
That’s the question. One of the reasons I like this question so much is that it is really simple. Almost everyone can answer it in one way or another. I’ve found, however, that a lot of people actually struggle with this question. They start thinking about memory efficiency and processing time and start dreaming up a way to determine how big the file is so that they can pick a random spot in the file (and perhaps read backwards to find the last newline). The problem is that there is a complexity to this question that may not be intuitively obvious.
Lines in a file don’t necessarily have to be the same length. Some lines can be short, some can be long. In fact, you could have a scenario where 90% of the file is one line and the remaining 10% are the other lines. You could pick a random spot in the file, but 9 times out of 10, you are going to pick that one big line, and that is most certainly NOT a random line. This also brings up the point that when I say random, I mean that the probability of picking any given line is equal.
One thing I’m really trying to determine when I ask this is do you approach a problem by first making sure it is even possible to solve? A lot of projects get tossed in the trash because in the end, it really wasn’t possible to solve the problem (or at least not the way that was tried). I would at least like to hear the simple solution of reading every line into memory and then picking a line — I didn’t specify that you had limited memory or time.
If they do start with the reading of the whole file approach, I’m obviously happy but now I ask them to do it without reading the whole file into memory. It’s only one constraint, and there are solutions that are still fairly simple, such as going through the file once to figure out how many lines there are and then going through a second time to the random line you pick.
If they get that, now I ask can you do it only reading through the file once. Now the constraints have gotten harder. It is hard to do things in linear time — anyone who has taken algorithms knows that if you find a way to do something in linear time, you are doing really well. Most of the time it is more like n * log(n) or or worse.
This one, however, is quite possible. Imagine the file has 1 line. Obviously you are going to pick that 1 line all of the time. If it has two lines, the chance that you pick the first line is 1/2. Four lines it becomes 1/4. You might have already guessed where this is going. The chance that I pick any line in the file is 1/n, but clearly I don’t know how many lines are in the file, so how do I keep it random?
First, I pick the first line, because if the file only has one line, that’s definitely the line I’m going to pick. If there is a second line, I pick a random number between 1 and 2 (like flipping a coin) and if it is 1, I pick the new line, otherwise I keep the first. If there is a third line, I now pick a random number between 1 and 3 and if it is 1, I switch to the new line, otherwise I keep the previous line picked . This continues until the end of the file.
The proof that this works is pretty simple with a n+1 proof. It’s obviously true with 1 line and 2 lines. Assume it is true for n, and we’ll prove it for n + 1. Up to n lines, we assume that I’ve correctly picked a random line from the n lines. Now with one extra line, there is a 1/(n+1) chance that I will replace the line that I picked with this new line, and so it is true. Another way to think of it is this: there is a 1/n chance that I pick a given line, but if that line isn’t picked, there is now a 1/(n-1) chance that I pick one of the remaining lines, if I don’t pick the n-1 line, there is a 1/(n-2) chance that I pick one of the remaining lines, etc. So although it seems counter-intuitive that the probability goes down with each additional line, it actually works out correctly (unfortunately, probabilities are often counter-intuitive).
It can be hard to come up with this solution, and often I will meet talented developers who don’t give this answer. It’s not a deal-breaker for me. A deal-breaker for me would be over-thinking it and coming up with a convoluted solution that doesn’t accurately solve the problem. First, give me a solution. Then we can talk about the problems with that solution and come up with a better one. That’s engineering.
Speaking of great interview questions, here is another one that I got once:
On a 32-bit system, detect a circular linked list using only one pointer (no additional memory).
This would be my ultimate question, but I would start it like this:
How do you detect a circular linked list?
One of the easiest solutions is to use two pointers, and you move one pointer down the list twice as fast as the other. If the list is circular, the two pointers will eventually meet. How can you do it with only one?
I was asked this when interviewing for a company that worked on small devices with not a lot of memory, and you often have to use special tricks to conserve memory because you don’t have a lot of it to spare. This is kind of a tricky question because it relies on knowing how memory is addressed in a 32-bit system. Take a look at a list of memory addresses:
What’s one thing that every address has in common, aside from the fact that they are hex and start with 0x? All of them end in 00. Why is that?
The reason is that 32-bit memory addresses are aligned to 4-byte boundaries. If you try to reference a memory address that is not aligned to one of these boundaries, you will get a bus error, which simply means that you are incorrectly trying to read memory.
So how does this help us? It helps us because we know that those last bits will always be 0, which means we can store something there and then mask it out when reading the address to see what it points to. So, you basically take your one pointer and go down the list, marking each next pointer in the list by setting the final bit of the address to 1. When we move to the next item in the list, we set our pointer to the address but also set that last bit back to 0 in our pointer. If we ever encounter a next pointer with the last bit set to 1, we know the list is circular.
These are my favorite kinds of interview questions. Not to stress people’s brains with ridiculous algorithms or teasers that are more about some trick than actual engineering. I like to see how people solve progressively harder problems. Simple solutions will get you going fast, and then you can improve incrementally. If I have learned anything from Agile, it is that you get the best solutions when you focus on minimal viable product.
Comment below on interesting interview questions you have used!
Originally published on March 27, 2018.