Factor analysis, principal component analysis and scikit learn
Be careful with scikit-learn's defaults
Playing around with the factor analysis model in scikit-learn, I discovered that you have to be a little bit careful using it. I guess this is generally true for any library you use, but this example here is quite nice. The inspiration for this example is taken from the book by Hastie et, al 2009, exercise 14.15.
A latent linear model
Let's start to generate 2000 samples from a hidden Gaussian \(z \sim N(0, I_3)\) distribution, where \(I_3\) denotes the identity matrix in \(\mathbb{R}^3\).
import scipy as sp n_samples = 2000 z = sp.randn(n_samples, 3)
We consider a latent linear model of the form
where \(\epsilon \sim N(0, D)\) is Gaussian noise. Importantly, the components of \(\epsilon\) are uncorrelated. In the context of a factor analysis model, \(W\) is called the "factor loading matrix".
W = sp.array([[1, 0, 0], [1, .001, 0], [0, 0, 3]]) x = z @ W.T
Principal Component Analysis
Let' see if we can recover the hidden \(W\). We first try a one dimensional embedding using principal component analysis (PCA).
from sklearn.decomposition import PCA pca = PCA(1) pca.fit(x) pca.components_
array([[-0.00638523, -0.00638588, 0.99995922]])
PCA picks up the third component as principal direction. This is the component with maximal variance.
Factor Analysis
Let's try the same thing using the factor analysis (FA) model from scikit learn with one hidden component but otherwise default parameters.
from sklearn.decomposition import FactorAnalysis fa_naive = FactorAnalysis(1) fa_naive.fit(x) fa_naive.components_, fa_naive.noise_variance_
(array([[-0.01843269, -0.01843453, 2.85628359]]), array([ 0.96209937, 0.96212315, 1.00016197]))
So, it also picks up the last component. Also note that the noise is about the same for all components. Does that mean that PCA and FA just give the same result? I was a little surprised by this. Let's try FA again and set the tolerance to a lower level.
fa = FactorAnalysis(1, tol=1e-8, max_iter=1000000) fa.fit(x) fa.components_, fa.noise_variance_
(array([[-0.98103956, -0.98105172, 0.04708941]]), array([ 4.87664897e-07, 4.87676762e-07, 9.15638709e+00]))
Now, the result looks quite different. FA now picks up the direction of maximal correlation, as it is supposed to. The third component is recognized as noise. Note that the first two components have almost no observation noise associated.
Lesson learned
Be careful to use some machine learning algorithm without looking behind the scenes. The FA model uses the EM algorithm behind the scenes. The tolerance parameter tells the algorithm when to stop. Apparently, setting this parameter to high can be disastrous.