Likelihood

考虑linear regression问题 $y = \theta^T x + \epsilon$。

Assume $\epsilon \sim \mathcal{N}(\mu, \sigma^2)$, that is $p(\epsilon) = \frac{1}{\sqrt{2\pi} \sigma} exp \big( - \frac{\epsilon^2}{2\sigma^2} \big) $. This implies that $p(y|x;\theta) = \frac{1}{\sqrt{2\pi}\sigma} exp \big( - \frac{( y - \theta^T x)^2}{2\sigma^2} \big)$, which indicates the distribution of y given x and parameterized by $\theta$ (not condition on $\theta$ since it’s not a random variable). Or we can write $y | x; \theta \sim \mathcal{N}(\theta^T x, \sigma^2)$.

因此给定$\theta$和$x$,我们有了一个关于$y$的概率分布。这个quantity可以被当作关于给定$\theta$,关于$y$的一个函数。而我们想换一个角度，把它当作是关于$\theta$的函数，我们就把它叫做likelihood函数。

\[L(\theta) = L(\theta; X,y) = p(y | X;\theta)\]

Suppose we have $m$ data points, so we want to instead maximize the log likelihood $\ell (\theta)$

\[\begin{aligned} \ell(\theta) & = log L (\theta)\\ & = log \prod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma} exp \big( - \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} \big)\\ & = \sum_{i=1}^m log \frac{1}{\sqrt{2\pi}\sigma} exp \big( - \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} \big)\\ & = m log \frac{1}{\sqrt{2\pi}\sigma} - \frac{1}{2\sigma^2} \cdot \sum_{i=1}^m (y_i - \theta^T x_i) \end{aligned}\]

因此，如果要最大化$\ell(\theta)$就等于最小化$\frac{1}{2} \sum_{i=1}^m (y_i - \theta^T x_i)$。

通过MLE，将线性拟合问题转换为了least square问题。

Regularization

从bayesian(probabilistic)的角度来理解regularization。

前面我们把$\theta$当作constant-valued but unknown，是frequentist stat思考模式。而w如果从bayesian的角度考虑，我们会把$\theta$当作random variable whose value is unknown。在这种情况下，我们就会考虑在$\theta$上定义一个先验prior，而后通过训练进行调整。（稍后我们会看到这些先验知识在优化目标上就表现为regularization。）

注意1，因为这里我们将$\theta$当作变量，所以考虑的是$p(y|x,\theta)$,而不是$p(y|x;\theta)$.

注意2，下面的推导不仅限于linear regression，可以是任意的 $f$ 或者 $p(y|x,\theta)$。

给定整个数据集$S = {(x_1, y_1), …, (x_n, y_n)}$，有

\[\begin{aligned} p(\theta | S) & = \frac{p(S|\theta) \cdot p(\theta)}{p(S)}\\ & = \frac{p(S|\theta) \cdot p(\theta)}{\int_\theta p(S|\theta) p(\theta) d\theta}\\ & = \frac{\prod_{i=1}^m p(y_i | x_i, \theta) \cdot p(\theta)}{\int_\theta \big( \prod_{i=1}^m p(y_i | x_i, \theta) \cdot p(\theta) \big) d\theta} \end{aligned}\]

因此，当给定新的数据点$(x,y)$的时候，就可以根据$\theta$的posterior distribution计算$p(y|x,S) = \int_\theta p(y|x,\theta) p(\theta|S) d\theta$，其中$p(\theta|S)$就是前面计算的$\theta$的后验概率。

以上展现的“完整Bayesian”过程。有一个问题是$\theta$的后验概率非常难以计算，所以实际经常用single point estimate来代替posterior distribution。（也就是分子部分）由此我们有了MAP (maximum a posterior) estimate for $\theta$

\[\theta_{MAP} = \underset{\theta}{arg max} \prod p(y_i|x_i, \theta) p(\theta)\]

回到linear regression的例子，$y=\theta^T x + \epsilon$, where $\epsilon \sim \mathcal{N}(0, 1)$，我们再讨论两个情形。

1. Suppose $\theta \sim \mathcal{N}(0, \lambda)$

Suppose $\theta \in \mathbb{R}^p$, and $p(\theta) = \frac{1}{\sqrt{2\pi\lambda}} exp\big( -\frac{\theta^2}{2\lambda} \big)$。

\[\begin{aligned} \theta_{MAP} & = \underset{\theta}{arg max} \,\, \prod p(y_i|x_i, \theta) p(\theta)\\ & = \underset{\theta}{arg max} \,\, log \Big( \prod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma} exp \big( - \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} \big) \cdot \frac{1}{\sqrt{2\pi\lambda}} exp\big( －\frac{\theta^2}{2\lambda} \big) \Big)\\ & = \underset{\theta}{arg max} \,\, \frac{m}{\sqrt{2\pi}\sigma} + \frac{m}{\sqrt{2\pi\lambda}} + \sum_{i=1}^m - \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} - \sum_{j=1}^p \frac{\theta_j^2}{2\lambda}\\ & = \underset{\theta}{arg min} \,\, \sum_{i=1}^m \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} + \sum_{j=1}^p \frac{\theta_j^2}{2\lambda}\\ & = \underset{\theta}{arg min} \,\, \sum_{i=1}^m \frac{1}{2} ( y_i - \theta^T x_i)^2 + \frac{\sigma^2}{2\lambda} \|\theta\|_2^2 \end{aligned}\]

Therefore, norm distribution introduces L2 norm, in the bayesian regression.

2. Suppose $\theta \sim Lap(0, \lambda)$

Similarly for Laplacian distribution, where $\theta = \frac{1}{2\lambda} exp \big(-\frac{|\theta|}{\lambda} \big)$。

\[\begin{aligned} \theta_{MAP} & = \underset{\theta}{arg max} \,\, \prod p(y_i|x_i, \theta) p(\theta)\\ & = \underset{\theta}{arg max} \,\, log \Big( \prod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma} exp \big( - \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} \big) \cdot \frac{1}{2\lambda} exp \big( -\frac{|\theta|}{\lambda} \big) \Big)\\ & = \underset{\theta}{arg max} \,\, log \Big( \prod_{i=1}^m exp \big( - \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} \big) \cdot exp \big( -\frac{|\theta|}{\lambda} \big) \Big)\\ & = \underset{\theta}{arg max} \,\, \sum_{i=1}^m - \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} - \sum_{j=1}^p \frac{|\theta_j|}{\lambda}\\ & = \underset{\theta}{arg min} \,\, \sum_{i=1}^m \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} + \sum_{j=1}^p \frac{|\theta_j|}{\lambda}\\ & = \underset{\theta}{arg min} \,\, \sum_{i=1}^m \frac{1}{2} ( y_i - \theta^T x_i)^2 + \frac{\sigma^2}{\lambda} |\theta|_1 \end{aligned}\]

Therefore, Laplacian distribution introduces L1 norm, in the bayesian regression.

Appendix

参考内容 1

参考内容 2

参考内容 3 (还是大佬的课件清楚啊)