A well-known issue in machine learning is called the curse of dimensionality, particularly as it applies to K-Nearest Neighbour (K-NN) and other distance-based algorithms. Nearest neighbours algorithms are great in low features (dimensions). When you have too many dimensions in your data, it causes serious problems for most distance-based models like K-NN. The efficiency and effectiveness of algorithms deteriorate as the dimensionality of the data increases exponentially.

Overfitting

Overfitting happens when a machine learning model learns the training data so well that it memorises the noise and irrelevant details instead of learning the underlying patterns. High-dimensional spaces increase the risk of overfitting because there are more opportunities for the model to memorise noise and irrelevant details.

Why nearest neighbour fails in high dimensions?

Distance concentration

Every is far, and similarly far.

In high dimensions, the difference between the nearest and farthest neighbours becomes negligible. In other words, the distance between points become almost the same.

In high-dimensional spaces, data points become far from the centre due to greater distance. For example, when you compute distances using Manhattan distance formulas: As (number of dimensions) increases:

  • You are adding more and more absolute differences (resulting in a very long equation)
  • Even if each difference is small, the total distance grows

Suppose you randomly generate 100 data points inside a 100-dimensional cube, and every coordinate ranges from 0 to 1. Now you pick one point and compute its distance to all others. You will get a range of distances:

  • A minimum distance (to the nearest point)
  • A maximum distance (to the farthest point)

Now, as dimensionality increases:

  1. Minimum distance increases Even the nearest point tends to get farther away. Because there are more ways for two points to differ in more dimensions, every dimension is a new chance for distance to increase.

  2. Maximum distance also increases, but not as fast The farthest neighbour gets farther too, but the rate of increase slows down compared to the minimum distance.

You will find this phenomenon:

Distance TypeValue (rough idea)
Nearest neighbor~9.5
Farthest neighbor~10
Difference (max - min)only 0.5
Each dimension adds more components to the distance calculation. The total distance grows, but all distances grow together, so they start to concentrate. The variance between distances shrinks.
Even though distances are larger, they’re almost all the same — very close together. Mostly randomly place data points lie near the boundaries (or edges) of the space.

Why this matters

For K-NN and similar algorithms, if all distances are about the same, the core idea of K-NN that nearby examples give good predictions fails. It becomes unreliable because all examples are roughly equally far apart, nearest neighbour classification cannot be trusted for high-
dimensional data.


Back to parent page: Supervised Machine Learning

AI Machine_Learning COMP3308 Unsupervised_Learning Classification KNN K-Nearest_Neighbour Curse_of_Dimensionality