Machine Learning Questions
 Supervised
 Classification & Regression
 Unsupervised: clustering, feature learning, dimension reduction (PCA), autoencoder, GAN

Semisupervised
 Generative, can be either supervised or unsupervised?
 Discriminative
1. What’s the tradeoff between bias and variance?
 bias: difference between the expected/average prediction and the actual value. High bias means the model failed to capture the relations between features and labels.
 variance: how sensitive to fluctuations in training set. High variance means the model is very complex and is overfitting the training data.
low variance  high variance  

low bias  great  overfitting 
high bias  underfitting  ? 
As model becomes more complex, bias will get reduced while variance will rise.
Example: Bias and Variance decomposition in MSE
$x_1, x_2, …, x_n$ and $y_1, y_2, …, y_n$. Assume a function $y = f(x) + \epsilon$, where the noise $\epsilon \sim \mathcal{N}(0, \sigma^2)$. We want to find a function $\hat f$ to approximate $f$. We use mean square error as target loss.
\[\begin{aligned} E[(y  \hat f(x))^2] & = E [y^2  2 y \hat f + \hat f^2]\\ & = \big(E[y^2]  E[y]^2 + E[y]^2 \big)  2 E[y \hat f] + \big(E[\hat f^2]  E[\hat f]^2 + E[\hat f]^2\big)\\ & = Var[y] + Var[\hat f] + E[y]^2 + E[\hat f]^2  2 E[y \hat f]\\ & = \sigma^2 + Var[\hat f] + (f^2  2f \cdot E[\hat f] + E[\hat f]^2)\\ & = \sigma^2 + Var[\hat f] + (E[\hat f]  f)^2\\ & = \sigma^2 + Var[\hat f] + Bias[\hat f]^2 \end{aligned}\]Recall:
\[\begin{aligned} Var(X) & = E[(X E[X])^2]\\ & = E[X^2  2XE[x] + E[x]^2]\\ & = E[X^2]  2E[X]E[X] + E[X]^2\\ & = E[X^2]  E[X]^2\\ \end{aligned}\]According to the given conditions, since $f$ is deterministic, then $E[f] = f$. So $E[y] = E[f + \epsilon] = f$.
And $E[\epsilon] = 0$, $Var[\epsilon] = \sigma^2$. So $Var[\epsilon] = E[\epsilon^2]  E[\epsilon]^2 = \sigma^2 \rightarrow E[\epsilon] = \sigma^2$.
Therefore $Var[y] = E[(y  E[y])^2] = E[(y  f)^2] = E[\epsilon^2] = \sigma^2$
2. What is the difference between supervised and unsupervised machine learning?
 Supervised learning requires labels.
 Unsupervised learning doesn’t require labels.
3. How is KNN different from kmeans clustering?
KNN is supervised/discriminative and classification. Kmeans is unsupervised.
4. Explain how a ROC curve works.
Actual (True)  Actual (False)  

Pred (True)  TP  FP (Type I) 
Pred (False)  FN (Type II)  TN 
 xaxis: FP
 yaxis: TP
 Rank by predictive values
 Linear interpolation
5. Define Precision and Recall.
Precision $P = \frac{TP}{TP+FP}$
Recall $R = \frac{TP}{TP+FN}$
6. What is Bayes’ Theorem? How is it useful in a machine learning context?
$P(AB) = \frac{P(BA) P(A)}{P(B)}$
$\text{posterior} = \frac{\text{contidional prob} \times \text{prior}}{\text{evidence}}$
?Bayes’ Theorem gives you the posterior probability of an event given what is known as prior knowledge.
7. Why is “Naive” Bayes naive?
$p(C_kx) = \frac{p(C_k) p(xC_k)}{p(x)}$
$\text{posterior} = \frac{\text{likelihood} \times \text{prior}}{\text{evidence}}$
Because it assumes all the feature dimensions are independent with each other, which is $p(x_i  x_{i+1}, …, x_n  C_k) = p(x_i  C_k)$.
Therefore we have
\[\begin{aligned} p(C_k  x_1, ..., x_n) & \propto p(C_k, x_1, ..., x_n)\\ & = p(C_k) p(x_1  C_k) \cdot ... \cdot p(x_n  C_k)\\ & = p(C_k) \prod_{i=1}^n p(x_i  C_k) \end{aligned}\]The answer is
\[\begin{aligned} y & = \underset{c_k}{argmax} \,\, p(y=C_k) \prod_{i=1}^n p(x_i  C_k) \end{aligned}\]8. Explain the difference between L1 and L2 regularization.
 L1 is $\lambda w_1$, use when sparse.
 L2 is $\lambda w_2$, tends to spread errors among all terms.
Details in here.
 L1 can be thought of as a Bayesian regression with a Laplacian prior.
 L2 corresponds to a Gaussian prior.
9. What’s your favorite algorithm, and can you explain it to me in less than a minute?
Logistic regression. Replace meas squared error with sigmoid function. (Why we use sigmoid function, and how it can reflect logistic regression can be found link).
Recall the sigmoid function is $g(z) = \frac{1}{1+e^{z}}$, and $g’(z) = g(z) \cdot (1g(z))$.
假设label是0/1，$h_\theta(x) = g(\theta^T x) = \frac{1}{1+e^{\theta^T x}}$。
\[\begin{aligned} p(y=1x;\theta) & = h_\theta(x)\\ p(y=0x;\theta) & = 1h_\theta(x) \end{aligned}\]将y=0和1的情况可以整合一下，有$p(yx;\theta) = (h_\theta(x))^y (1h_\theta(x))^{1y}$。
由此，m个独立的data point上的likelihood就是
\[\begin{aligned} L(\theta) & = \prod_{i=1}^m p(y_i  x_i; \theta)\\ & = \prod_{i=1}^m (h_\theta(x_i))^y_i (1h_\theta(x_i))^{1y_i} \end{aligned}\]Therefore, the loglikelihood is
\[\begin{aligned} \ell(\theta) & = \log L(\theta)\\ & = \sum_{i=1}^m \,\, y_i \log h_\theta(x_i) + (1y_i) \log (1  h_\theta(x_i)) \end{aligned}\]To maximize $\ell(\theta)$ is to make the firstorder derivation, then apply gradient descent. So we have
\[\begin{aligned} \frac{\partial}{\partial \theta_j} \ell(\theta) & = \frac{\partial}{\partial \theta_j} \sum_{i=1}^m \,\, y_i \log h_\theta(x_i) + (1y_i) \log (1  h_\theta(x_i))\\ & = \sum_{i=1}^m \,\, \big( y_i \frac{1}{h(\theta^T x_i)}  (1y_i) \frac{1}{1h(\theta^T x_i)} \big) \frac{\partial h(\theta^T x_i)}{\partial \theta_j}\\ & = \sum_{i=1}^m \,\, \big( y_i \frac{1}{h(\theta^T x_i)}  (1y_i) \frac{1}{1h(\theta^T x_i)} \big) \cdot h(\theta^T x_i) (1h(\theta^T x_i)) \frac{\partial \theta^T x_i}{\partial \theta_j}\\ & = \sum_{i=1}^m \,\, \big( y_i  h(\theta^T x_i) \big) x_{i,j} \end{aligned}\]10. What’s the difference between Type I and Type II error?
 Type I error is a false positive.
 Type II error is false negative.
11. What’s a Fourier transform?
??
12. What’s the difference between probability and likelihood?
Their value can be same: $L(\theta  y) = p(y  \theta)$, but they have different meanings/different emphasis.
 Probability, $p(y  \theta)$, is a function of $y$, treating $\theta$ as known/observed.
 If $\theta$ is not random variable, then $p(y  \theta)$ is the (parameterized) probability of $y$ given model parameters $\theta$, which can also be written as $p(y;\theta)$ or $p_\theta(y)$.
 If $\theta$ is random variable, as Bayesian statistics, $p(y\theta) = p(y, \theta) / p(\theta)$ is conditional probability.
 Likelihood, $L(\theta  y)$ is a function of $\theta$, which assumes $y$ as observed. Our target is to find a certain aggisnment $\hat \theta$ for $\theta$ that can maximize $p(y\theta)$. Because it’s a function of $\theta$, then it would become clearer if we write $L(\theta  y)$. And we call the method to find $\hat \theta$ (to maximize $L(\hat \theta  y)$) as maximizes likelihood.
So the term likelihood is just shorthand to refer to the probability $p(y\theta)$ for some data $y$ that results from assigning different values to $\theta$. (比如，在参数空间找最优$\theta$，侧重在$\theta$)
13. What is deep learning, and how does it contrast with other machine learning algorithms?
Deep learning: use backpropagation and certain principles from neuronscience to more accurately model larget sets of unlabelled or semistructured data. It benefits from learning data representations, as opposed to taskspecific algorithms. It can be supervised, unsupervised, and semisupervised.
Deep learning structures like deep neural networks, deep belief networks, and recurrent neural networks have been widely used.
14. What’s the difference between a generative and discriminative model?
 判别模型直接判断
 生成模型，生成指的是 underlying distribution on latent variables.
生成模型进行生成的过程之后，还是会通过Bayes，然后判别的方法进行prediction.

在训练/学习阶段，inference，是学习的P(X,Y) = P(X Y) P(Y) 
在测试阶段，还是通过条件概率，对新的X计算P(Y X)  包含信息比较全，能做其他事

??
Both are supervised.
 Generative: joint distribution $P(X,Y)$.
Examples: naive Bayes, HMM, …
 Special cases:
 AE: No.
 VAE: Yes, the decoder part (reparameterization trick) can generate new samples based on input noise.
 GAN: Yes.
 Tips: models can be considered as “generative” when the input latent variable has probability distribution with it.
 Special cases:
 Discriminative: conditional distribution $P(YX)$. Examples: kNN, perceptron, RF, LR, SVM, NN…
Can also check this.
15. What crossvalidation technique would you use on a time series dataset?
Do the data split based on temporal relation.
16. How is a decision tree pruned?
Pruning is what happens in decision trees when branches that have weak predictive power are removed in order to reduce the complexity of the model and increase the predictive accuracy of a decision tree model.
Pruning can happen bottomup and topdown, with approaches such as reduced error pruning and cost complexity pruning.
 Prepruning: pruning during tree construction.
 Stop when the entropy cannot get reduced. To avoid overfitting, set up a threshold. If the entropy is smaller than this threshold, then it will keep branching.
 Postpruning: pruning after the decision tree is built up.
 reduced error pruning (错误率降低剪枝): starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected then the change is kept. While somewhat naive, reduced error pruning has advantage of simplicity and speed.
 cost complexity pruning(代价复杂度剪枝): generate a series of trees $T_0, T_1, \cdots, T_m$ where $T_0$ is the initial tree and $T_m$ is the root alone (?).
At step $i$, the tree is created by removing a subtree from tree $i1$ and replacing it with a leaf node with value chosen as in the tree building algorithm.
The subtree that is removed is chosen as follows:
 Define the error rate of tree $T$ over data set $S$ as $err(T,S)$.
 The subtree that minimizes $\frac{err(prune(T,t), S)  err(T,S)}{leaves(T)  leaves(prune(T,t))}$ is chosen for removal. Once the series of trees has been created, the best tree is chosen by generalized accuracy as measured by a training set or crossvalidation.
Entropy:
Entropy is the expectation of information. Let $x_i$ be one class, then the information of $x_i$ is
\[l(x_i) = \log_2 p(x_i)\]where $p(x_i)$ is the probability of $x_i$. And entropy is
\[H =  \sum_{i=1}^n p(x_i) \log p(x_i)\]17. Which is more important to you– model accuracy, or model performance?
视具体情况而定。 首先Accuracy本身也可以作为performance的一部分。 但是当数据highly skewed的时候，accuracy作为eval method会带来偏差。
18. What’s the F1 score? How would you use it?
$\frac{2}{F1} = \frac{1}{P} + \frac{1}{R}$
$F1 = \frac{2TP}{2TP+FP+FN}$
How to use it?: You would use it in classification tests where true negatives don’t matter much.
19. How would you handle an imbalanced dataset?
如果数据本身是balance的，那么
 收集更多数据
 resample dataset to correct the imbalances
 try different algorithms
如果数据本身是imbalance
 训练的时候增加class weight
20. When should you use classification over regression?
You would use classification over regression if you wanted your results to reflect the belongingness of data points in your dataset to certain explicit categories (ex: If you wanted to know whether a name was male or female rather than just how correlated they were with male and female names.)
21. Name an example where ensemble techniques might be useful.
Ensemble techniques use a combination of learning algorithms to optimize better predictive performance. They typically reduce overfitting in models and make the model more robust (unlikely to be influenced by small changes in the training data).
 bagging (bootstrap aggregating) example: random forest.
Details see here.
22. How do you ensure you’re not overfitting with a model?
 Keep the model simpler. Reduce variance by taking into account fewer variables and parameters, thereby remove some of the noise in the training data.
 Use crossvalidation.
 Apply regularization.
 Do early stopping on a holdout validation dataset.
23. What evaluation approaches would you work to gauge the effectiveness of a machine learning model?
First split the dataset into training and test sets, or perhaps use crossvalidation techniques to further segment the dataset into composite sets of training and test sets within the data.
Then implement a choice selection of performance metrics, like
 Accuracy
 Confusion matrix
 Precision
 Recall
 F1 score
 ROC AUC
 PR AUC
What’s important here is to demonstrate that you understand the nuances of
 how a model is measured
 how to choose the right performance measures for the right situations
24. How would you evaluate a logistic regression model?
You have to demonstrate an understanding of what the typical goals of a logistic regression are (classification, prediction etc.) and bring up a few examples and use cases.
25. What’s the “kernel trick” and how is it useful?
It’s a function mapping from input space to feature space, and from low dimension to higher dimension. The benefit is that we don’t have to explicitly calculate the coordinates of points within that dimension; instead kernel function compute the inner products between the images of all pairs of data in a feature space. Therefore it is computationally cheaper than the explicit calculation of said coordinates.
26. How do you handle missing or corrupted data in a dataset?
You could find missing/corrupted data in a dataset and
 drop those rows or columns
 (imputation) decide to replace them with another value
27. Do you have experience with Spark or big data tools for machine learning?
Yes.
 FB intern
28. Pick an algorithm. Write the psuedocode for a parallel implementation.
distributed deep learning (pytorch)
29. What are some differences between a linked list and an array?
 An array is an ordered collection of objects.
 A linked list is a series of objects with pointers that direct how to process them sequentially.
30. Describe a hash table.
A hash table is a data structure that produces an associative array. A key is mapped to certain values through the use of a hash function. They are often used for tasks such as database indexing.
Details in here.
31. Which data visualization libraries do you use? What are your thoughts on the best data visualization tools?
Seaborn. The best visualization depends on the final task.
What’s important here is to define your views on how to properly visualize data and your personal preferences when it comes to tools. Popular tools include R’s ggplot, Python’s seaborn and matplotlib, and tools such as Plot.ly and Tableau.
32. How would you implement a recommendation system for our company’s users?
SVD
33. How can we use your machine learning skills to generate revenue?
for company: 产生收入
This is a tricky question. The ideal answer would demonstrate knowledge of what drives the business and how your skills could relate.
We can start from the development product, apply machine learning for better products/service. For example, if you were interviewing for musicstreaming startup Spotify, you could remark that your skills at developing a better recommendation model would increase user retention, which would then increase revenue in the long run.
34. What do you think of our current data process?
Currently we don’t have a unified framework for data fetching. Because it may vary among different tasks and problems. Just thinking wildly, maybe we can consider applying a machine learning system to do this.
My friend had a talk with Yangqing Jia, when he was trying to create cafe2. The biggest problem they had at that time was: where is the data.
35. What are the last machine learning papers you’ve read?
Deep learning understanding:
Ben Rench: Understanding Deep Learning Requires Rethinking Generalization
Ben Rench: Do CIFAR10 Classifiers Generalize to CIFAR10?
36. Do you have research experience in machine learning?
Yes, CV.
37. What are your favorite use cases of machine learning models?
When we need to find underlying patterns, which are hidden inside the data.
38. How would you approach the “Netflix Prize” competition?
Competetion like this, always apply ensemble method.
39. Where do you usually source datasets?
Machine learning interview questions like these try to get at the heart of your machine learning interest. Somebody who is truly passionate about machine learning will have gone off and done side projects on their own, and have a good idea of what great datasets are out there. If you’re missing any, check out Quandl for economic and financial data, and Kaggle’s Datasets collection for another great list.
40. How do you think Google is training data for selfdriving cars?
Machine learning interview questions like this one really test your knowledge of different machine learning methods, and your inventiveness if you don’t know the answer. Google is currently using recaptcha to source labelled data on storefronts and traffic signs. They are also building on training data collected by Sebastian Thrun at GoogleX — some of which was obtained by his grad students driving buggies on desert dunes!
41. How would you simulate the approach AlphaGo took to beat Lee Sidol at Go?
The Nature paper describes how this was accomplished with “MonteCarlo tree search with deep neural networks that have been trained by supervised learning, from human expert games, and by reinforcement learning from games of selfplay.”