Expectation Maximization Theory - Unsupervised Learning

Expectation maximization theory

Why we have the Expectation Maximization Algorithm

Sometimes we want to estimate the parameters of a model with hidden parameters. A typical example application is clustering. Let's say we measured a couple of data points and we know that these data points come from one of \(K\) different models. What we don't know, however, is which data points comes from which of the \(K\) different sources. And this is exactly what we want to infer. The prototypical example looks thus like

\begin{equation*} p(x|\theta) = p(x|z=k, \theta_k) p(z=k|\pi) \end{equation*}

Where \(\theta=(\theta_1,\ldots, \theta_K, \pi)\). Each \(\theta_k\) denotes the parameters of the model \(k\) and \(\pi\) are the model weights.

In principle you could now calculate the gradient of the likelihood and try to find the optimal parameter \(\theta\). This is, however, slightly annoying since the log-likelihood \(\ell(\theta)\) contains a sum within the logarithm

\begin{equation*} \ell(\theta) = \log p(D|\theta) = \sum_i \log p(x_i | \theta) = \sum_i \log \sum_{z=1}^K p(x_i| \theta_k, z=k). \end{equation*}

With the Expectation Maximization algorithm we "move the sum outside" the logarithm and iteratively maximize the log-likelihood towards a local optimum. More precisely we find a sequence \((\theta_0, \theta_1, \ldots)\) such that

\begin{equation*} \ell(\theta_0) \leq \ell(\theta_1) \leq \ell (\theta_2) \leq \ldots \end{equation*}

Construction of the sequence and proof of its monotonicity

At the heart of the algorithm is a function \(Q\) defined as

\begin{equation*} Q(\theta, \theta') := \sum_i E_{p(z| x_i, \theta')}[\log p(x_i, z|\theta)] + \sum_i H[p(z|x_i, \theta')] = \sum_i \sum_z p(z|x_i, \theta') \log p(x_i, z| \theta) - \sum_i \sum_z p(z| x_i, \theta') \log p(z |x_i, \theta') \end{equation*}

where \(H\) denotes the entropy. We also define the sequence

\begin{equation*} \theta_{t+1} := \arg\max_\theta Q(\theta, \theta_t) \end{equation*}

We'll show that \(Q\) and \((\theta_t)_t\) have the following properties

  • \((a) \quad Q(\theta, \theta) = \ell(\theta)\)
  • \((b) \quad \ell(\theta) \geq Q(\theta, \theta') \forall \theta'\)
  • \((c) \quad Q(\theta_{t+1}, \theta_t) \geq Q(\theta', \theta_t) \forall \theta'\)

With the properties a, b and c we have

\begin{equation*} \ell(\theta_{t+1}) \geq Q(\theta_{t+1}, \theta_t) \geq Q(\theta_t, \theta_t) = \ell(\theta_t). \end{equation*}

The first inequality follows from b, the second from c and the last equality from a. This proves monotonic convergence.

Proof of c

By definition of the sequence, property c is satisfied.

Proof of b

Let's proceed to proof b using Jensen's inequality.

\begin{equation*} \ell(\theta) = \sum_i \log \sum_z p(x_i, z | \theta) = \sum_i \log \sum_z \frac{p(x_i, z| \theta)}{p(z| x_i, \theta')} p(z|x_i, \theta') \geq \sum_i \sum_z p(z|x_i \theta') \log \frac{p(x_i, z| \theta)}{p(z |x_i, \theta')} = Q(\theta, \theta'). \end{equation*}

This proves property b.

Proof of a

Lastly, let's proof a.

\begin{equation*} Q(\theta, \theta) = \sum_i \sum_z p(z|x_i, \theta) \log \frac{p(z|x_i, \theta) p(x_i|\theta)}{p(z |x_i, \theta)} = \sum_i \sum_z p(z|x_i, \theta) \log p(x_i|\theta) = \sum_i \log p(x_i|\theta) = \ell(\theta) \end{equation*}

Some remarks

The algorithm is called Expectation-Maximization algorithm because we alternate between the expectation step we perform to calculate \(Q(\theta, \theta_{t})\) and the maximization step to arrive at \(\theta_{t+1}\). The parameter vector \(\theta_0\) has to be initialized at the beginning. The local optimum which is found depends on the initial value.

Also, check my next post for a Python implementation.