Expectation Maximization Theory - Unsupervised Learning
Why we have the Expectation Maximization Algorithm
Sometimes we want to estimate the parameters of a model with hidden parameters. A typical example application is clustering. Let's say we measured a couple of data points and we know that these data points come from one of \(K\) different models. What we don't know, however, is which data points comes from which of the \(K\) different sources. And this is exactly what we want to infer. The prototypical example looks thus like
Where \(\theta=(\theta_1,\ldots, \theta_K, \pi)\). Each \(\theta_k\) denotes the parameters of the model \(k\) and \(\pi\) are the model weights.
In principle you could now calculate the gradient of the likelihood and try to find the optimal parameter \(\theta\). This is, however, slightly annoying since the log-likelihood \(\ell(\theta)\) contains a sum within the logarithm
With the Expectation Maximization algorithm we "move the sum outside" the logarithm and iteratively maximize the log-likelihood towards a local optimum. More precisely we find a sequence \((\theta_0, \theta_1, \ldots)\) such that
Construction of the sequence and proof of its monotonicity
At the heart of the algorithm is a function \(Q\) defined as
where \(H\) denotes the entropy. We also define the sequence
We'll show that \(Q\) and \((\theta_t)_t\) have the following properties
- \((a) \quad Q(\theta, \theta) = \ell(\theta)\)
- \((b) \quad \ell(\theta) \geq Q(\theta, \theta') \forall \theta'\)
- \((c) \quad Q(\theta_{t+1}, \theta_t) \geq Q(\theta', \theta_t) \forall \theta'\)
With the properties a, b and c we have
The first inequality follows from b, the second from c and the last equality from a. This proves monotonic convergence.
Proof of c
By definition of the sequence, property c is satisfied.
Proof of b
Let's proceed to proof b using Jensen's inequality.
This proves property b.
Proof of a
Lastly, let's proof a.
Some remarks
The algorithm is called Expectation-Maximization algorithm because we alternate between the expectation step we perform to calculate \(Q(\theta, \theta_{t})\) and the maximization step to arrive at \(\theta_{t+1}\). The parameter vector \(\theta_0\) has to be initialized at the beginning. The local optimum which is found depends on the initial value.
Also, check my next post for a Python implementation.