2016-02-28

## Why we have the Expectation Maximization Algorithm

Sometimes we want to estimate the parameters of a model with hidden parameters. A typical example application is clustering. Let's say we measured a couple of data points and we know that these data points come from one of $$K$$ different models. What we don't know, however, is which data points comes from which of the $$K$$ different sources. And this is exactly what we want to infer. The prototypical example looks thus like

\begin{equation*} p(x|\theta) = p(x|z=k, \theta_k) p(z=k|\pi) \end{equation*}

Where $$\theta=(\theta_1,\ldots, \theta_K, \pi)$$. Each $$\theta_k$$ denotes the parameters of the model $$k$$ and $$\pi$$ are the model weights.

In principle you could now calculate the gradient of the likelihood and try to find the optimal parameter $$\theta$$. This is, however, slightly annoying since the log-likelihood $$\ell(\theta)$$ contains a sum within the logarithm

\begin{equation*} \ell(\theta) = \log p(D|\theta) = \sum_i \log p(x_i | \theta) = \sum_i \log \sum_{z=1}^K p(x_i| \theta_k, z=k). \end{equation*}

With the Expectation Maximization algorithm we "move the sum outside" the logarithm and iteratively maximize the log-likelihood towards a local optimum. More precisely we find a sequence $$(\theta_0, \theta_1, \ldots)$$ such that

\begin{equation*} \ell(\theta_0) \leq \ell(\theta_1) \leq \ell (\theta_2) \leq \ldots \end{equation*}

## Construction of the sequence and proof of its monotonicity

At the heart of the algorithm is a function $$Q$$ defined as

\begin{equation*} Q(\theta, \theta') := \sum_i E_{p(z| x_i, \theta')}[\log p(x_i, z|\theta)] + \sum_i H[p(z|x_i, \theta')] = \sum_i \sum_z p(z|x_i, \theta') \log p(x_i, z| \theta) - \sum_i \sum_z p(z| x_i, \theta') \log p(z |x_i, \theta') \end{equation*}

where $$H$$ denotes the entropy. We also define the sequence

\begin{equation*} \theta_{t+1} := \arg\max_\theta Q(\theta, \theta_t) \end{equation*}

We'll show that $$Q$$ and $$(\theta_t)_t$$ have the following properties

• $$(a) \quad Q(\theta, \theta) = \ell(\theta)$$
• $$(b) \quad \ell(\theta) \geq Q(\theta, \theta') \forall \theta'$$
• $$(c) \quad Q(\theta_{t+1}, \theta_t) \geq Q(\theta', \theta_t) \forall \theta'$$

With the properties a, b and c we have

\begin{equation*} \ell(\theta_{t+1}) \geq Q(\theta_{t+1}, \theta_t) \geq Q(\theta_t, \theta_t) = \ell(\theta_t). \end{equation*}

The first inequality follows from b, the second from c and the last equality from a. This proves monotonic convergence.

### Proof of c

By definition of the sequence, property c is satisfied.

### Proof of b

Let's proceed to proof b using Jensen's inequality.

\begin{equation*} \ell(\theta) = \sum_i \log \sum_z p(x_i, z | \theta) = \sum_i \log \sum_z \frac{p(x_i, z| \theta)}{p(z| x_i, \theta')} p(z|x_i, \theta') \geq \sum_i \sum_z p(z|x_i \theta') \log \frac{p(x_i, z| \theta)}{p(z |x_i, \theta')} = Q(\theta, \theta'). \end{equation*}

This proves property b.

### Proof of a

Lastly, let's proof a.

\begin{equation*} Q(\theta, \theta) = \sum_i \sum_z p(z|x_i, \theta) \log \frac{p(z|x_i, \theta) p(x_i|\theta)}{p(z |x_i, \theta)} = \sum_i \sum_z p(z|x_i, \theta) \log p(x_i|\theta) = \sum_i \log p(x_i|\theta) = \ell(\theta) \end{equation*}

## Some remarks

The algorithm is called Expectation-Maximization algorithm because we alternate between the expectation step we perform to calculate $$Q(\theta, \theta_{t})$$ and the maximization step to arrive at $$\theta_{t+1}$$. The parameter vector $$\theta_0$$ has to be initialized at the beginning. The local optimum which is found depends on the initial value.

Also, check my next post for a Python implementation.