2016-01-12

Leave one out cross validation can be really off for model selection or hyperparameter optimization

Leave one out cross validation

This is inspired by an exercise in the book Murhpy, "Machine learning: a probabilistic perspective", which took the exercise from the book by Witten and Frank, "Data Mining: Practical Machine Learning Tools and Techniques".

Setting

Assume we have a labeled data set \(\mathcal{D} = \{(y_i,x_i)\}_{i=1}^{2N}\). The \(y_i\)'s denote the class labels which we assume to be binary \(y_i \in \{0, 1\}\). Moreover, we assume that the features \(x_i\) are completely random, that is they tell us nothing about the class. Last, we assume that we have \(N\) training samples of class 0 and also \(N\) training examples of class 1.

The best classifier is a constant one

The best classifier \(c\) we can come up with is one that always predicts class 0 (or always class 1, that does not matter). Hence, we have \(c(x)=1\) for all \(x\). What is its misclassification rate? Obviously it is 50%, as we assumed that we have as many examples of class 0 as of class 1.

Cross validation is way too pessimistic

What about leave one out cross validation? In leave one out cross validation we train the classifier on all examples but one and then check its performance on the one we omitted for training. If we pick an example of class 0 to be left out, we are left with \(N-1\) examples of class 0 and \(N\) examples of class 1. Since the features don't tell us anything about the class the best classifier now is one which always predicts class 1. Testing this classifier on the sample we have left out (which was of class 0) we misclassify. This happens similarly for the other training examples. So leave one out cross validation predicts a misclassification rate of 1. It is completely off!

Isn't that cute?