Intro to Bayesian Methods

Frequentist use Maximum Likelihood:

$\hat \theta = arg max_\theta P(X | \theta) $

Bayesian use Bayes theorem:

$P(\theta | X) = \frac{P(X | \theta)P(\theta)}{P(X)}$

How to define a model

Bayesian network

Nodes: random variables Edge: direct impact

Model: joint probability over all variables

$\prod P(X | Parent(X))$

Naive Bayes classifier

$P(c, f_1, f_2, …, f_n) = P(c) \prod_i P(f_i, c)$

Notice: Bayesian network cannot include directed cycles. And for cycles, we can either make cycle into a joint random variable, or use Markov Random Fields (MRF).

Linear Regression

Multivariate normal:

$\mathcal{N}(x | \mu, \Sigma) = \frac{1}{\sqrt{|2\pi \Sigma|}} exp[-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)]$

Classical linear regression is (MSE):

$\mathcal{L}(w) = \sum_i (w^T x_i - y_i)^2 $

$P(w, y | X) = P(y | X, w) P(w)$

assume: $P(y | w, X) = \mathcal{N}(y | w^T X, \sigma^2 I)$, and $P(w) = \mathcal{N}(w | 0, \gamma^2 I)$

Our goal is to calculate $P(w | y, X) = \frac{P(y, w | X)}{ P(y | X)}$. Maximize it w.r.t. $w$.

Which is equal to maximize $P(w, y | X)$ w.r.t. $w$.

Take log won’t hurt if we are going to take max.

$ \log P(w, y | X) $

$ = \log{P(y | X, w)} + \log{P(w)} $

$ = \log{\frac{1}{\sqrt{|2\pi \sigma^2 I|}} exp[-\frac{1}{2} (y - w^T X)^T (\sigma^2 I)^{-1} (y - w^T X))}] + \log{\frac{1}{\sqrt{|2\pi \gamma^2 I|}} exp[{-\frac{1}{2} }} w^T (\gamma^2 I)^{-1} w]$

remove the constant term that has no correlation with $w$

$ = -\frac{1}{2 \sigma^2} (y-w^TX)^T(y-w^TX) - \frac{1}{2 \gamma^2} w^T w $

$ = -\frac{1}{2 \sigma^2} || y - w^T X ||_2 - \frac{1}{2 \gamma^2} ||w||_2$

we try to maximize this w.r.t. $w$.

$ \Longleftrightarrow || y - w^T X ||_2 - \lambda ||w||_2$, we want minimize this w.r.t. $w$.

Latent Variable Models

Probabilistic Clustering

The validation won’t increase with training in GMM, as more and more cluster numbers.

But in K-means, it will.

Gaussian Mixture Model

In the example, suppose each point comes from three gaussian distributions.

$p(x | \theta) = \pi_1 \mathcal{N}(x | \mu_1, \Sigma_1) + \pi_2 \mathcal{N}(x | \mu_2, \Sigma_2) + \pi_3 \mathcal{N}(x | \mu_3, \Sigma_3)$

In other words, $\theta = { \pi_1, \pi_2, \pi_3, \mu_1, \mu_2, \mu_3, \Sigma_1, \Sigma_2, \Sigma_3 }$

Training GMM

For $N$ data points, we want to get

$\max_\theta p(X | \theta) $

$ = \prod_i p(x_i | \theta) $

$= \prod_i^N (\pi_1 \mathcal{N}(x_i | \mu_1, \Sigma_1) + \pi_2 \mathcal{N}(x_i | \mu_2, \Sigma_2) + \pi_3 \mathcal{N}(x_i | \mu_3, \Sigma_3))$

s.t.

$ \pi_1 + \pi_2 + \pi_3 = 1 ; \pi_k \ge 0$

$ \Sigma_k > 0$ (positive definite), but SGD is hard to guarantee this

Training GMM

Introduce latent variable

$p(x | \theta) = \pi_1 \mathcal{N}(x | \mu_1, \Sigma_1) + \pi_2 \mathcal{N}(x | \mu_2, \Sigma_2) + \pi_3 \mathcal{N}(x | \mu_3, \Sigma_3)$

$t \to x$, which normal distribution ($t$) each data ($x$) comes from.

Therefore we have

$p(t=c | \theta) = \pi_c$

$p(x | t=c, \theta) = \mathcal{N}(x | \mu_c, \Sigma_c)$

$p(x | \theta) = \sum_c p(x | t=c, \theta) p(t=c | \theta)$

This is exactly same as first formula.

Expectation Maximization

If sources $t$ are know, which is to say we know each data point comes from which distribution, it’s easy to estimate $\theta$.

If we know the distribution, we can know the sources $t$.

Given: $p(x | t=1, \theta) = \mathcal{N}(-2, 1)$

Find: $p(t=1 | x, \theta)$

And $p(t=1 | x, \theta) = \frac{p(x | t=1, \theta) p(t=1 | \theta)}{Z}$

Chicken and egg problem :)

Expectation maximization Algorithm