2016-03-13

## Be careful with scikit-learn's defaults

Playing around with the factor analysis model in scikit-learn, I discovered that you have to be a little bit careful using it. I guess this is generally true for any library you use, but this example here is quite nice. The inspiration for this example is taken from the book by Hastie et, al 2009, exercise 14.15.

## A latent linear model

Let's start to generate 2000 samples from a hidden Gaussian $$z \sim N(0, I_3)$$ distribution, where $$I_3$$ denotes the identity matrix in $$\mathbb{R}^3$$.

import scipy as sp
n_samples = 2000
z = sp.randn(n_samples, 3)


We consider a latent linear model of the form

\begin{equation*} x = W z + \mu + \epsilon \end{equation*}

where $$\epsilon \sim N(0, D)$$ is Gaussian noise. Importantly, the components of $$\epsilon$$ are uncorrelated. In the context of a factor analysis model, $$W$$ is called the "factor loading matrix".

W = sp.array([[1,    0,   0],
[1, .001,  0],
[0,     0, 3]])
x = z @ W.T


## Principal Component Analysis

Let' see if we can recover the hidden $$W$$. We first try a one dimensional embedding using principal component analysis (PCA).

from sklearn.decomposition import  PCA
pca = PCA(1)
pca.fit(x)
pca.components_

array([[-0.00638523, -0.00638588,  0.99995922]])


PCA picks up the third component as principal direction. This is the component with maximal variance.

## Factor Analysis

Let's try the same thing using the factor analysis (FA) model from scikit learn with one hidden component but otherwise default parameters.

from sklearn.decomposition import FactorAnalysis

fa_naive = FactorAnalysis(1)
fa_naive.fit(x)
fa_naive.components_, fa_naive.noise_variance_

(array([[-0.01843269, -0.01843453,  2.85628359]]),
array([ 0.96209937,  0.96212315,  1.00016197]))


So, it also picks up the last component. Also note that the noise is about the same for all components. Does that mean that PCA and FA just give the same result? I was a little surprised by this. Let's try FA again and set the tolerance to a lower level.

fa = FactorAnalysis(1, tol=1e-8, max_iter=1000000)
fa.fit(x)
fa.components_, fa.noise_variance_

(array([[-0.98103956, -0.98105172,  0.04708941]]),
array([  4.87664897e-07,   4.87676762e-07,   9.15638709e+00]))


Now, the result looks quite different. FA now picks up the direction of maximal correlation, as it is supposed to. The third component is recognized as noise. Note that the first two components have almost no observation noise associated.

## Lesson learned

Be careful to use some machine learning algorithm without looking behind the scenes. The FA model uses the EM algorithm behind the scenes. The tolerance parameter tells the algorithm when to stop. Apparently, setting this parameter to high can be disastrous.