# Likelihood

Assume $\epsilon \sim \mathcal{N}(\mu, \sigma^2)$, that is $p(\epsilon) = \frac{1}{\sqrt{2\pi} \sigma} exp \big( - \frac{\epsilon^2}{2\sigma^2} \big)$. This implies that $p(y|x;\theta) = \frac{1}{\sqrt{2\pi}\sigma} exp \big( - \frac{( y - \theta^T x)^2}{2\sigma^2} \big)$, which indicates the distribution of y given x and parameterized by $\theta$ (not condition on $\theta$ since it’s not a random variable). Or we can write $y | x; \theta \sim \mathcal{N}(\theta^T x, \sigma^2)$.

$L(\theta) = L(\theta; X,y) = p(y | X;\theta)$

Suppose we have $m$ data points, so we want to instead maximize the log likelihood $\ell (\theta)$

\begin{aligned} \ell(\theta) & = log L (\theta)\\ & = log \prod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma} exp \big( - \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} \big)\\ & = \sum_{i=1}^m log \frac{1}{\sqrt{2\pi}\sigma} exp \big( - \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} \big)\\ & = m log \frac{1}{\sqrt{2\pi}\sigma} - \frac{1}{2\sigma^2} \cdot \sum_{i=1}^m (y_i - \theta^T x_i) \end{aligned}

# Regularization

\begin{aligned} p(\theta | S) & = \frac{p(S|\theta) \cdot p(\theta)}{p(S)}\\ & = \frac{p(S|\theta) \cdot p(\theta)}{\int_\theta p(S|\theta) p(\theta) d\theta}\\ & = \frac{\prod_{i=1}^m p(y_i | x_i, \theta) \cdot p(\theta)}{\int_\theta \big( \prod_{i=1}^m p(y_i | x_i, \theta) \cdot p(\theta) \big) d\theta} \end{aligned}

$\theta_{MAP} = \underset{\theta}{arg max} \prod p(y_i|x_i, \theta) p(\theta)$

## 1. Suppose $\theta \sim \mathcal{N}(0, \lambda)$

Suppose $\theta \in \mathbb{R}^p$, and $p(\theta) = \frac{1}{\sqrt{2\pi\lambda}} exp\big( -\frac{\theta^2}{2\lambda} \big)$。

\begin{aligned} \theta_{MAP} & = \underset{\theta}{arg max} \,\, \prod p(y_i|x_i, \theta) p(\theta)\\ & = \underset{\theta}{arg max} \,\, log \Big( \prod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma} exp \big( - \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} \big) \cdot \frac{1}{\sqrt{2\pi\lambda}} exp\big( －\frac{\theta^2}{2\lambda} \big) \Big)\\ & = \underset{\theta}{arg max} \,\, \frac{m}{\sqrt{2\pi}\sigma} + \frac{m}{\sqrt{2\pi\lambda}} + \sum_{i=1}^m - \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} - \sum_{j=1}^p \frac{\theta_j^2}{2\lambda}\\ & = \underset{\theta}{arg min} \,\, \sum_{i=1}^m \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} + \sum_{j=1}^p \frac{\theta_j^2}{2\lambda}\\ & = \underset{\theta}{arg min} \,\, \sum_{i=1}^m \frac{1}{2} ( y_i - \theta^T x_i)^2 + \frac{\sigma^2}{2\lambda} \|\theta\|_2^2 \end{aligned}

Therefore, norm distribution introduces L2 norm, in the bayesian regression.

## 2. Suppose $\theta \sim Lap(0, \lambda)$

Similarly for Laplacian distribution, where $\theta = \frac{1}{2\lambda} exp \big(-\frac{|\theta|}{\lambda} \big)$。

\begin{aligned} \theta_{MAP} & = \underset{\theta}{arg max} \,\, \prod p(y_i|x_i, \theta) p(\theta)\\ & = \underset{\theta}{arg max} \,\, log \Big( \prod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma} exp \big( - \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} \big) \cdot \frac{1}{2\lambda} exp \big( -\frac{|\theta|}{\lambda} \big) \Big)\\ & = \underset{\theta}{arg max} \,\, log \Big( \prod_{i=1}^m exp \big( - \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} \big) \cdot exp \big( -\frac{|\theta|}{\lambda} \big) \Big)\\ & = \underset{\theta}{arg max} \,\, \sum_{i=1}^m - \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} - \sum_{j=1}^p \frac{|\theta_j|}{\lambda}\\ & = \underset{\theta}{arg min} \,\, \sum_{i=1}^m \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} + \sum_{j=1}^p \frac{|\theta_j|}{\lambda}\\ & = \underset{\theta}{arg min} \,\, \sum_{i=1}^m \frac{1}{2} ( y_i - \theta^T x_i)^2 + \frac{\sigma^2}{\lambda} |\theta|_1 \end{aligned}

Therefore, Laplacian distribution introduces L1 norm, in the bayesian regression.