Bayesian Methods for Machine Learning
Intro to Bayesian Methods
Frequentist use Maximum Likelihood:
$\hat \theta = arg max_\theta P(X | \theta) $
Bayesian use Bayes theorem:
$P(\theta | X) = \frac{P(X | \theta)P(\theta)}{P(X)}$
How to define a model
Bayesian network
Nodes: random variables Edge: direct impact
Model: joint probability over all variables
$\prod P(X | Parent(X))$
Naive Bayes classifier
$P(c, f_1, f_2, …, f_n) = P(c) \prod_i P(f_i, c)$
Notice: Bayesian network cannot include directed cycles. And for cycles, we can either make cycle into a joint random variable, or use Markov Random Fields (MRF).
Linear Regression
Multivariate normal:
$\mathcal{N}(x | \mu, \Sigma) = \frac{1}{\sqrt{|2\pi \Sigma|}} exp[-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)]$
Classical linear regression is (MSE):
$\mathcal{L}(w) = \sum_i (w^T x_i - y_i)^2 $
$P(w, y | X) = P(y | X, w) P(w)$
assume: $P(y | w, X) = \mathcal{N}(y | w^T X, \sigma^2 I)$, and $P(w) = \mathcal{N}(w | 0, \gamma^2 I)$
Our goal is to calculate $P(w | y, X) = \frac{P(y, w | X)}{ P(y | X)}$. Maximize it w.r.t. $w$.
Which is equal to maximize $P(w, y | X)$ w.r.t. $w$.
Take log won’t hurt if we are going to take max.
$ \log P(w, y | X) $
$ = \log{P(y | X, w)} + \log{P(w)} $
$ = \log{\frac{1}{\sqrt{|2\pi \sigma^2 I|}} exp[-\frac{1}{2} (y - w^T X)^T (\sigma^2 I)^{-1} (y - w^T X))}] + \log{\frac{1}{\sqrt{|2\pi \gamma^2 I|}} exp[{-\frac{1}{2} }} w^T (\gamma^2 I)^{-1} w]$
remove the constant term that has no correlation with $w$
$ = -\frac{1}{2 \sigma^2} (y-w^TX)^T(y-w^TX) - \frac{1}{2 \gamma^2} w^T w $
$ = -\frac{1}{2 \sigma^2} || y - w^T X ||_2 - \frac{1}{2 \gamma^2} ||w||_2$
we try to maximize this w.r.t. $w$.
$ \Longleftrightarrow || y - w^T X ||_2 - \lambda ||w||_2$, we want minimize this w.r.t. $w$.
Latent Variable Models
Probabilistic Clustering
The validation won’t increase with training in GMM, as more and more cluster numbers.
But in K-means, it will.
Gaussian Mixture Model
In the example, suppose each point comes from three gaussian distributions.
$p(x | \theta) = \pi_1 \mathcal{N}(x | \mu_1, \Sigma_1) + \pi_2 \mathcal{N}(x | \mu_2, \Sigma_2) + \pi_3 \mathcal{N}(x | \mu_3, \Sigma_3)$
In other words, $\theta = { \pi_1, \pi_2, \pi_3, \mu_1, \mu_2, \mu_3, \Sigma_1, \Sigma_2, \Sigma_3 }$
Training GMM
For $N$ data points, we want to get
$\max_\theta p(X | \theta) $
$ = \prod_i p(x_i | \theta) $
$= \prod_i^N (\pi_1 \mathcal{N}(x_i | \mu_1, \Sigma_1) + \pi_2 \mathcal{N}(x_i | \mu_2, \Sigma_2) + \pi_3 \mathcal{N}(x_i | \mu_3, \Sigma_3))$
s.t.
$ \pi_1 + \pi_2 + \pi_3 = 1 ; \pi_k \ge 0$
$ \Sigma_k > 0$ (positive definite), but SGD is hard to guarantee this
Training GMM
Introduce latent variable
$p(x | \theta) = \pi_1 \mathcal{N}(x | \mu_1, \Sigma_1) + \pi_2 \mathcal{N}(x | \mu_2, \Sigma_2) + \pi_3 \mathcal{N}(x | \mu_3, \Sigma_3)$
$t \to x$, which normal distribution ($t$) each data ($x$) comes from.
Therefore we have
$p(t=c | \theta) = \pi_c$
$p(x | t=c, \theta) = \mathcal{N}(x | \mu_c, \Sigma_c)$
$p(x | \theta) = \sum_c p(x | t=c, \theta) p(t=c | \theta)$
This is exactly same as first formula.
Expectation Maximization
If sources $t$ are know, which is to say we know each data point comes from which distribution, it’s easy to estimate $\theta$.
If we know the distribution, we can know the sources $t$.
Given: $p(x | t=1, \theta) = \mathcal{N}(-2, 1)$
Find: $p(t=1 | x, \theta)$
And $p(t=1 | x, \theta) = \frac{p(x | t=1, \theta) p(t=1 | \theta)}{Z}$
Chicken and egg problem :)