Exponential Family

Exponential family可以有如下的形式

  • $\eta$: natural parameter of the distribution
  • $T(y)$: sufficient statistic
  • $a(y)$: log partition function. $e^{-a(y)}$在这里起到了normalization constant的作用,保证 $\sum_y p(y;\eta) = 1$ 或者 $\int_y p(y;\eta) dy = 1$

固定$T$,a和b,就能确定a family (or set) of distributions that is parameterized by $\eta$.

Example: Bernoulli distribution

  • $\eta = \log(\phi / (1-\phi))$, and we have $\phi = 1 / (1+e^{-\eta})$
  • $T(y) = y$
  • $a(\eta) = - \log (1-\phi) = \log(1+e^{\eta})$
  • $b(y) = 1$

Example: Gaussian distribution

Recall that in linear regression cases, the value of $\sigma^2$ has not effect on final choice of $\theta$ and $h_\theta(x)$. Thus here we set $\sigma=1$ for simplicity.

  • $\eta = \mu$
  • $T(y) = y$
  • $a(\eta) = \mu^2/2 = \eta^2 / 2$
  • $b(y) = \frac{1}{\sqrt{2\pi}} \exp (-\frac{1}{2}y^2)$

Generalized Linear Model (GML)

从exponential family构造GLM。 Target is to predict the value of random variable $y$ as a function of $x$. To derive a GLM, we will make following three asumptions about the conditional distribution $y|x$.

  1. $y|x;\theta \sim \text{Exponential}(\eta)$. Given $x$ and $\theta$, the distribution of $y$ follows some exponential family distribution, with parameter $\eta$.
  2. Given $x$, our goal is to predict expected value of $T(y)$ given $x$. In most examples, we have $T(y) = y$, so this means we would like the prediction $h(x)$ output by our learned hypothesis $h$ to satisfy $h(x) = E[y|x]$.
  3. The natural parameter $\eta$ and the inputs $x$ are realted linearly: $\eta = \theta^T x$.

Example: Least Squares

Consider the setting where the target variable $y$ is continuous, and we model the conditional distribution of $y$ given $x$ as a Gaussian $\mathcal{N}(\mu, \sigma^2)$. We let the exponential family be the Gaussian distribution. So we have $\mu = \eta$, and

Example: Logistic Regression

Here we are interested in binary classification, so $y \in {0, 1}$. Given it’s binary, therefore we can use Bernoulli family of distribution to model the conditional distribution of $y$ given $x$. We have $\phi = \frac{1}{1+e^{-\eta}}$. Furthermore, note that if $y | x; \theta \sim \text{Bernoulli}(\phi)$, then $E[y |x;\theta] = \phi$. So we have

So this gives us hypothesis functions of the form $h_\theta = \frac{1}{1+e^{-\theta^T x}}$, which is the logistic function.

Example: Softmax Regression

Let’s consider another case: $y$ can take on any one of the $k$ values ($k$-ary classification), so $y \in {1, 2, …, k}$. We will model it to a multinomial distribution.

To parameterize a multinomial over $k$ possible outcomes, we use $k$ parameters $\phi_1, \phi_2, …, \phi_k$ specifying the probability of each class. However, these parameters would be dependent, since $\sum_i \phi = 1$. So we will instead parameterize the multinomial with only $k-1$ parameters, $\phi_1, …, \phi_{k-1}$. (ke neng you qi ta de yi lai xing, zhe li yong de zhi shi zui ming xian de yi ge) The probability of last class is $p(y=k;\phi) = 1 - \sum_{i=1}^{k-1} \phi_i$. For notation convenience, we let $\phi_k = p(y=k;\phi) = 1 - \sum_{i=1}^{k-1} \phi_i$.

To express the multinomial as an exponential family distribution, we defile $T(y) \in \mathbb{R}^{k-1}$ as follows:

Unlike previous examples, we do not have $T(y) = y$, and instead it’s $k-1$ dimensional vector.

Next we will introduce indicator function $1{\cdot}$. The multinomial can be expressed in exponential family as follows:

where

The link function is given by (for $i = 1, 2, \cdots, k)

To invert the link function and derive the response function, we have

This implies that $\phi_k = 1 / \sum_{i=1}^k e^{\eta_i}$. And put this back to the previous equation, we have

This function mapping from $\eta$ to $\phi$ is called the softmax function.

In GLM, we assume $\eta$ and $x$ are linearly related. So $\eta_i = \theta_i^T x$, for $i=1,2,\cdots, k-1$. For notational convenience, we have $\theta_k = 0$, so that $\eta_k = \theta^T x = 0$. Therefore we have

This model is called softmax regression. It is a generalization of logistic regression.

Our hypothesis will output

So our hypothesis will ouput the estimated probability that $p(y=i|x;\theta)$ for every value of $i=1,2,\cdots,k$. Therefore, the likelihood for $m$ data points is

Then we can apply gradient ascent or Newton’s method.

Appendix

How to construct GLM may take in a little heuristic strategies.

Reference to pdf