For someone working or trying to work in data science, statistics is probably the biggest and most intimidating area of knowledge you need to develop. The goal of this post is to reduce what you need to know to a finite number of concrete ideas, techniques, and equations.

Of course, that’s an ambitious goal — if you plan to be in data science for the long-term, I’d still expect to continue learning statistical concepts and techniques throughout your career. But what I’m aiming to do is provide a baseline so that you can get through your interviews and start practicing data science in as short and painless of a process as possible. I’ll also end each section with key terms and resources for further reading. Let’s dive in.

### Probability

Probability is the underpinnings of statistics and often comes up in interviews. It’s worth learning the basics, not just so you can make it past the typical probability brain teasers that interviewers like to ask, but also because it’ll enhance and solidify your understanding of statistics.

Probability is about random processes. The classic examples are things like flipping coins and rolling dice — it gives you a framework for determining things like the number of 6s you’d expect to roll over a certain number of throws, or the likelihood of flipping 10 fair coins without a heads coming up. While these examples might seem pretty abstract, they are actually important ideas for analyzing human behavior and other domains that deals with non-deterministic processes, and are crucial to data scientists.

The approach I favor for learning or re-learning probability is to start with combinatorics, which provides some intuition on how random processes behave, and then move on to how we derive the rules of expectation and variance from those processes. Being comfortable with these topics should let you pass a typical data science interview.

To prepare specifically for the type of probability questions you’re likely to get asked, I’d find some example questions (this is a reasonable list but there are many others too) and work through them on a whiteboard. Practice making probability trees to help visualize and think through the problems.

### Probability Distributions

Intimately related to the topics above are probability distributions. A probability distribution is just a distribution that describes how likely it is that a single observation of a random variable is equal to a particular value or range of values. In other words, for any given random process there is both a range of values that are possible and a likelihood that a single draw from the random process will take on one of those values. Probability distributions provide the likelihood for all possible values of a given process

As with probability, knowing distributions is a prerequisite to understanding inferential and predictive statistics, but you might also get interview questions specifically about them. The most typical example is: you have a process that behaves like X— what distribution would you use to model that process? Answering these types of questions is just a matter of mapping random processes to a sensible distribution, and this blog post does a great job of explaining just how to do that.

### Prediction and Machine Learning

Lastly, we come to prediction. This is the stuff that a lot of people are most excited about — it includes topics as diverse as image recognition, video recommendations, web search, and text translation. Obviously this is a huge area, but I’m assuming you’re interviewing for a more generalist position, in which case expertise in any of the areas will not be assumed.

Instead you want to be able to take any particular prediction problem an interviewer throws at you and provide a reasonable approach to start solving it. Mostly, this means being ready to discuss how you’d pick a model, assess that model’s effectiveness, and then improve on the model. When interviewing, I’d break down the problem into those three steps.

When choosing a model, you mostly want to base your decision on the following: the type and distribution of the outcome variable, the nature of the relationship between dependent and independent variables, the amount of data you have, and the desired level of interpretability. Again, there are no right answers here (though there are often wrong ones), so you just want to be able to have an intelligent discussion about the decisions you’d make and the tradeoffs they imply.

You might also get asked about what kind of features you’d want to include as independent variables (predictors) in your model. This is primarily an exercise in domain knowledge: it’s more about understanding the industry and which pieces of data are likely to predict the outcome of interest than about statistics. The discussion might also drift into feature engineering, which would involve having some intuition on how and when to transform your variables, and data-driven ways of selecting your predictors (i.e. regularization, dimensionality reduction, and automated feature selection).

Assessing a model is a relatively straightforward art that involves holdout data sets used to validate your model and mitigate any overfitting issues. The wiki on this topic is probably sufficient for a baseline. Additionally you want to be familiar with the different evaluation metrics: accuracy, ROC curves, confusion matrices, etc. This stuff is much less open-ended and I wouldn’t expect to go into microscopic detail about it. A cursory understanding of why holdout sets are necessary, and the pros and cons of different evaluation metrics should suffice.

The third step would be improvement. Mostly this is just a rehash of the feature engineering topics, and the decision about whether it’s necessary to collect more data. When interviewing, make sure your first stab at a model leaves room for you to make improvements — otherwise you’ll have a hard time answering the inevitable follow-up on how you could make it better.