Every data scientist eventually confronts the “Curse of Dimensionality,” or trying to work with a large number of feature variables. Machine learning shines when analyzing data with many dimensions. Humans, on the other hand, are not good at finding patterns that may be spread out across a large number of dimensions, especially if those dimensions are connected in ways that are not intuitive. As more dimensions are added, constraints of processing power and data training inevitably arise. The visualization below shows how once peak performance (accuracy) of an algorithm is reached, increasing dimensionality by adding more features is counterproductive.
Once the optimal number of features is reached, increasing the dimensionality without increasing the number of training samples results in a decrease in classifier performance.
In this article, we’ll review the “curse of dimensionality” phenomenon and how it impacts the data science process, as well as suggest ways to address it.
The Nature of the Curse
You might believe that as you increase the number of features to fit a model, the quality of the fitted model will increase accordingly, but this is not necessarily the case. In general, adding more “signal” (i.e. more features) that are truly associated with the response variable, or what is to be predicted, will improve the fitted model by yielding a reduction to the test set error. On the other hand, adding “noise” (i.e. features that are not truly associated with the response variable) will lead to a weakening in the fitted model, and consequently an increase to the test set error. Noise features serve to increase the dimensionality of the problem space and increase the risk of overfitting. Since the noise features may be given nonzero coefficients as a result of chance associations with the response variable of the training set, there’s no upside potential in terms of improved test set error.
We keep hearing about the allure of “big data” that allows for the collection of data points for thousands or even millions of features. But this benefit is a double-edged sword. It can lead to improved predictive models if these features are in fact relevant to the problem at hand, but it can also lead to worse results if the features are not sufficiently relevant. Even if they are relevant, the variance incurred in fitting their coefficients may overshadow the decrease in bias that they bring.
Genesis of the Curse
It was applied mathematician Richard Bellman who originated the phrase, “the curse of dimensionality,” in his book “Adaptive Control Processes: A Guided Tour,” published in 1961. The specific quote, from page 97 of the book, is:
“In view of all that we have said in the foregoing sections, the many obstacles we appear to have surmounted, what casts the pall over our victory celebration? It is the curse of dimensionality, a malediction that has plagued the scientist from the earliest days.”
The issue referred to in Bellman’s quote is the difficulty of optimizing a function of many variables by a brute force search on a discrete multidimensional grid where the number of grids points increases exponentially with dimensionality (the number of variables). Over time, the “curse of dimensionality” phrase has come to denote any difficulty in data analysis, and now data science, that results from a large number of variables.
Drilling Down into the Curse
The performance of a machine learning algorithm decreases when the dimensionality of the problem space increases beyond a certain threshold. The question is: what’s the threshold, and how can overfitting be avoided? Unfortunately, there is no hard fast rule that defines how many features should be used for a given problem. In actuality, this determination depends on the amount of training data available, the complexity of the decision boundaries, and the type of algorithm used.
If we had an infinite number of training examples available, the curse of dimensionality wouldn’t apply and we could simply use an infinite number of features to obtain perfect accuracy for the model. But in reality, the smaller the size of the training set, the smaller the feature vector that should be used. Essentially, the number of training examples needed grows exponentially with the number of dimensions used.
Overfitting occurs when estimating relatively few parameters in higher dimensional space, and also when estimating many parameters in lower dimensional space. Additionally, the variance of a parameter estimate increases if the number of parameters to be estimated increases while the bias of the estimate and the size of the training set are kept constant. This suggests that the quality of the parameter estimates decreases as the dimensionality rises, due to the increase of variance. Here, an increase in model variance corresponds to overfitting.
There are methods that work to determine the optimal linear or non-linear combination of original features to reduce the dimensionality of the final problem space. These are called Feature Extraction methods. This is a fertile area of data science research, with different feature extraction techniques for specific problem spaces.
One familiar dimensionality reduction technique is called Principal Component Analysis (PCA). When confronted with a large set of correlated variables in feature space, PCA provides a way to summarize this set with a small number of representative variables that collectively explain most of the variability in the original set. In other words, PCA yields uncorrelated, linear combinations of the original feature vector and calculates a linear subspace of lower dimensionality, where the largest variance of the original data is retained. You can then use the lower dimensional representation of the data with machine learning algorithms instead of the original set.