Exponential Family

Exponential family可以有如下的形式

\[p(y;\eta) = b(y) \exp (\eta^T T(y) - a(\eta)) = b(y) \frac{\exp( \eta^T T(y))}{\exp(a(\eta))}\]

$\eta$: natural parameter of the distribution
$T(y)$: sufficient statistic
$a(y)$: log partition function. $e^{-a(y)}$在这里起到了normalization constant的作用，保证 $\sum_y p(y;\eta) = 1$ 或者 $\int_y p(y;\eta) dy = 1$

固定$T$，a和b，就能确定a family (or set) of distributions that is parameterized by $\eta$.

Example: Bernoulli distribution

\[\begin{aligned} p(y;\phi) & = \phi^y (1-\phi)^{1-y}\\ & = \exp (y \log \phi + (1-y) \log(1-\phi))\\ & = \exp ((\log(\frac{\phi}{1-\phi})) y + \log(1-\phi)) \end{aligned}\]

$\eta = \log(\phi / (1-\phi))$, and we have $\phi = 1 / (1+e^{-\eta})$
$T(y) = y$
$a(\eta) = - \log (1-\phi) = \log(1+e^{\eta})$
$b(y) = 1$

Example: Gaussian distribution

Recall that in linear regression cases, the value of $\sigma^2$ has not effect on final choice of $\theta$ and $h_\theta(x)$. Thus here we set $\sigma=1$ for simplicity.

\[\begin{aligned} p(y;\mu) & = \frac{1}{\sqrt{2\pi}} \exp (-\frac{1}{2}(y-\mu)^2)\\ & = \frac{1}{\sqrt{2\pi}} \exp (-\frac{1}{2}y^2) \exp(\mu y - \frac{1}{2} \mu^2)\\ \end{aligned}\]

$\eta = \mu$
$T(y) = y$
$a(\eta) = \mu^2/2 = \eta^2 / 2$
$b(y) = \frac{1}{\sqrt{2\pi}} \exp (-\frac{1}{2}y^2)$

Generalized Linear Model (GML)

从exponential family构造GLM。 Target is to predict the value of random variable $y$ as a function of $x$. To derive a GLM, we will make following three asumptions about the conditional distribution $y|x$.

$y|x;\theta \sim \text{Exponential}(\eta)$. Given $x$ and $\theta$, the distribution of $y$ follows some exponential family distribution, with parameter $\eta$.
Given $x$, our goal is to predict expected value of $T(y)$ given $x$. In most examples, we have $T(y) = y$, so this means we would like the prediction $h(x)$ output by our learned hypothesis $h$ to satisfy $h(x) = E[y|x]$.
The natural parameter $\eta$ and the inputs $x$ are realted linearly: $\eta = \theta^T x$.

Example: Least Squares

Consider the setting where the target variable $y$ is continuous, and we model the conditional distribution of $y$ given $x$ as a Gaussian $\mathcal{N}(\mu, \sigma^2)$. We let the exponential family be the Gaussian distribution. So we have $\mu = \eta$, and

\[\begin{aligned} h_\theta(x) & = E[y | x; \theta]\\ & = \mu\\ & = \eta\\ & = \theta^T x \end{aligned}\]

Example: Logistic Regression

Here we are interested in binary classification, so $y \in {0, 1}$. Given it’s binary, therefore we can use Bernoulli family of distribution to model the conditional distribution of $y$ given $x$. We have $\phi = \frac{1}{1+e^{-\eta}}$. Furthermore, note that if $y | x; \theta \sim \text{Bernoulli}(\phi)$, then $E[y |x;\theta] = \phi$. So we have

\[\begin{aligned} h_\theta(x) & = E[y | x;\theta]\\ & = \phi\\ & = \frac{1}{1+e^{-\eta}}\\ & = \frac{1}{1+e^{-\theta^T x}} \end{aligned}\]

So this gives us hypothesis functions of the form $h_\theta = \frac{1}{1+e^{-\theta^T x}}$, which is the logistic function.

Example: Softmax Regression

Let’s consider another case: $y$ can take on any one of the $k$ values ($k$-ary classification), so $y \in {1, 2, …, k}$. We will model it to a multinomial distribution.

To parameterize a multinomial over $k$ possible outcomes, we use $k$ parameters $\phi_1, \phi_2, …, \phi_k$ specifying the probability of each class. However, these parameters would be dependent, since $\sum_i \phi = 1$. So we will instead parameterize the multinomial with only $k-1$ parameters, $\phi_1, …, \phi_{k-1}$. (ke neng you qi ta de yi lai xing, zhe li yong de zhi shi zui ming xian de yi ge) The probability of last class is $p(y=k;\phi) = 1 - \sum_{i=1}^{k-1} \phi_i$. For notation convenience, we let $\phi_k = p(y=k;\phi) = 1 - \sum_{i=1}^{k-1} \phi_i$.

To express the multinomial as an exponential family distribution, we defile $T(y) \in \mathbb{R}^{k-1}$ as follows:

\[\begin{aligned} T(1) = \begin{bmatrix} 1 \\ 0 \\ 0 \\ \vdots \\ 0\end{bmatrix} ,& T(2) = \begin{bmatrix} 0 \\ 1 \\ 0 \\ \vdots \\ 0\end{bmatrix} ,& T(3) = \begin{bmatrix} 0 \\ 0 \\ 1 \\ \vdots \\ 0\end{bmatrix} ,& \cdots ,& T(k-1) = \begin{bmatrix} 0 \\ 0 \\ 0 \\ \vdots \\ 1\end{bmatrix} ,& T(k) = \begin{bmatrix} 0 \\ 0 \\ 0 \\ \vdots \\ 0\end{bmatrix} ,& \end{aligned}\]

Unlike previous examples, we do not have $T(y) = y$, and instead it’s $k-1$ dimensional vector.

Next we will introduce indicator function $1{\cdot}$. The multinomial can be expressed in exponential family as follows:

\[\begin{aligned} p(y;\theta) & = \phi_1^{1\{y=1\}} \phi_2^{1\{y=2\}} \cdots \phi_k^{1\{y=k\}}\\ & = \phi_1^{1\{y=1\}} \phi_2^{1\{y=2\}} \cdots \phi_k^{1 - \sum_{i=1}^{k-1} 1\{y=i\}}\\ & = \phi_1^{T(y)_1} \phi_2^{T(y)_2} \cdots \phi_k^{1 - \sum_{i=1}^{k-1} 1\{T(y)_i}\\ & = \exp \big( T(y)_1 \log(\phi_1) + T(y)_2 \log(\phi_2) + \cdots + (1 - \sum_{i=1}^{k-1} T(y)_i) \log(\phi_k) \big)\\ & = \exp \big( T(y)_1 \log(\phi_1 / \phi_k) + T(y)_2 \log(\phi_2 / \phi_k) + \cdots + T(y)_{k-1} \log(\phi_{k-1} / \phi_k) + \log(\phi_k) \big)\\ & = b(y) \exp (\eta^T T(y) - a(\eta)) \end{aligned}\]

where

\[\begin{aligned} \eta & = \begin{bmatrix} \log(\phi_1 / \phi_k) \\ \log (\phi_2 / \phi_k) \\ \vdots \log (\phi_{k-1} / \phi_k)\end{bmatrix}\\ a(\eta) & = - \log (\phi_k) \\ b(y) & = 1 \end{aligned}\]

The link function is given by (for $i = 1, 2, \cdots, k)

\[\eta_i = \log \frac{\phi_i}{\phi_k}\]

To invert the link function and derive the response function, we have

\[\begin{aligned} e^{\eta_i} & = \frac{\phi_i}{\phi_k}\\ \phi_k e^{\eta_i} & = \phi_i \\ \phi_k \sum_{i=1}^k e^{\eta_i} & = \sum_{i=1}^k \phi_i = 1 \end{aligned}\]

This implies that $\phi_k = 1 / \sum_{i=1}^k e^{\eta_i}$. And put this back to the previous equation, we have

\[\phi_i = \frac{e^{\eta_i}}{\sum_{j=1}^k e^{\eta_j}}\]

This function mapping from $\eta$ to $\phi$ is called the softmax function.

In GLM, we assume $\eta$ and $x$ are linearly related. So $\eta_i = \theta_i^T x$, for $i=1,2,\cdots, k-1$. For notational convenience, we have $\theta_k = 0$, so that $\eta_k = \theta^T x = 0$. Therefore we have

\[\begin{aligned} p(y=i|x;\theta) & = \phi_i\\ & = \frac{e^{\eta_i}}{\sum_{j=1}^k e^{\eta_j}}\\ & = \frac{e^{\theta_i^T x}}{\sum_{j=1}^k e^{\theta_j^T x}} \end{aligned}\]

This model is called softmax regression. It is a generalization of logistic regression.

Our hypothesis will output

\[\begin{aligned} h_\theta(x) & = E[T(y)|x;\theta] \\ & = E[\begin{bmatrix} 1 \{y=1\} \\ 1 \{y=2\}\\ \vdots\\ 1 \{y=k-1\} \end{bmatrix} | x;\theta]\\ & = \begin{bmatrix} \phi_1\\ \phi_2\\ \vdots\\ \phi_{k-1}\end{bmatrix}\\ & = \begin{bmatrix} \frac{e^{\theta_1^T x}}{\sum_{j=1}^k e^{\theta_j^T x}}\\ \frac{e^{\theta_2^T x}}{\sum_{j=1}^k e^{\theta_j^T x}}\\ \vdots\\ \frac{e^{\theta_{k-1}^T x}}{\sum_{j=1}^k e^{\theta_j^T x}}\end{bmatrix} \end{aligned}\]

So our hypothesis will ouput the estimated probability that $p(y=i|x;\theta)$ for every value of $i=1,2,\cdots,k$. Therefore, the likelihood for $m$ data points is

\[\begin{aligned} \ell(\theta) & = \sum_{i=1}^m \log p(y_i|x_i;\theta)\\ & = \sum_{i=1}^m \log \prod_{l=1}^k \Big( \frac{e^{\theta_{l}^T x_i}}{\sum_{j=1}^k e^{\theta_j^T x_i}} \Big)^{1\{y_i = l\}} \end{aligned}\]

Then we can apply gradient ascent or Newton’s method.

Appendix

How to construct GLM may take in a little heuristic strategies.

Reference to pdf