Dive into the “Curse of Dimensionality” idea and perceive the mathematics behind all of the shocking phenomena that come up in excessive dimensions.
Within the realm of machine studying, dealing with high-dimensional vectors isn’t just frequent; it’s important. That is illustrated by the structure of fashionable fashions like Transformers. As an example, BERT makes use of 768-dimensional vectors to encode the tokens of the enter sequences it processes and to higher seize complicated patterns within the information. On condition that our mind struggles to visualise something past 3 dimensions, the usage of 768-dimensional vectors is sort of mind-blowing!
Whereas some Machine and Deep Studying fashions excel in these high-dimensional situations, additionally they current many challenges. On this article, we’ll discover the idea of the “curse of dimensionality”, clarify some attention-grabbing phenomena related to it, delve into the arithmetic behind these phenomena, and talk about their common implications in your Machine Studying fashions.
Notice that detailed mathematical proofs associated to this text are accessible on my web site as a supplementary extension to this text.
Folks typically assume that geometric ideas acquainted in three dimensions behave equally in higher-dimensional areas. This isn’t the case. As dimension will increase, many attention-grabbing and counterintuitive phenomena come up. The “Curse of Dimensionality” is a time period invented by Richard Bellman (well-known mathematician) that refers to all these shocking results.
What’s so particular about high-dimension is how the “quantity” of the house (we’ll discover that in additional element quickly) is rising exponentially. Take a graduated line (in a single dimension) from 1 to 10. There are 10 integers on this line. Lengthen that in 2 dimensions: it’s now a sq. with 10 × 10 = 100 factors with integer coordinates. Now take into account “solely” 80 dimensions: you’ll have already got 10⁸⁰ factors which is the variety of atoms within the universe.
In different phrases, because the dimension will increase, the quantity of the house grows exponentially, leading to information changing into more and more sparse.
Excessive-dimensional areas are “empty”
Take into account this different instance. We wish to calculate the farthest distance between two factors in a unit hypercube (the place either side has a size of 1):
- In 1 dimension (the hypercube is a line section from 0 to 1), the utmost distance is just 1.
- In 2 dimensions (the hypercube kinds a sq.), the utmost distance is the gap between the alternative corners [0,0] and [1,1], which is √2, calculated utilizing the Pythagorean theorem.
- Extending this idea to n dimensions, the gap between the factors at [0,0,…,0] and [1,1,…,1] is √n. This formulation arises as a result of every further dimension provides a sq. of 1 to the sum beneath the sq. root (once more by the Pythagorean theorem).
Curiously, because the variety of dimensions n will increase, the biggest distance inside the hypercube grows at an O(√n) charge. This phenomenon illustrates a diminishing returns impact, the place will increase in dimensional house result in proportionally smaller positive aspects in spatial distance. Extra particulars on this impact and its implications will probably be explored within the following sections of this text.
Let’s dive deeper into the notion of distances we began exploring within the earlier part.
We had our first glimpse of how high-dimensional areas render the notion of distance nearly meaningless. However what does this actually imply, and may we mathematically visualize this phenomenon?
Let’s take into account an experiment, utilizing the identical n-dimensional unit hypercube we outlined earlier than. First, we generate a dataset by randomly sampling many factors on this dice: we successfully simulate a multivariate uniform distribution. Then, we pattern one other level (a “question” level) from that distribution and observe the distance from its nearest and farthest neighbor in our dataset.
Right here is the corresponding Python code.
def generate_data(dimension, num_points):
''' Generate random information factors inside [0, 1] for every coordinate within the given dimension '''
information = np.random.rand(num_points, dimension)
return informationdef neighbors(information, query_point):
''' Returns the closest and farthest level in information from query_point '''
nearest_distance = float('inf')
farthest_distance = 0
for level in information:
distance = np.linalg.norm(level - query_point)
if distance < nearest_distance:
nearest_distance = distance
if distance > farthest_distance:
farthest_distance = distance
return nearest_distance, farthest_distance
We will additionally plot these distances:
Utilizing a log scale, we observe that the relative distinction between the gap to the closest and farthest neighbor tends to lower because the dimension will increase.
This can be a very unintuitive conduct: as defined within the earlier part, factors are very sparse from one another due to the exponentially rising quantity of the house, however on the identical time, the relative distances between factors turn into smaller.
The notion of nearest neighbors vanishes
Which means that the very idea of distance turns into much less related and fewer discriminative because the dimension of the house will increase. As you’ll be able to think about, it poses issues for Machine Studying algorithms that solely depend on distances resembling kNN.
We are going to now discuss another attention-grabbing phenomena. For this, we’ll want the n-ball. A n-ball is the generalization of a ball in n dimensions. The n-ball of radius R is the gathering of factors at a distance at most R from the middle of the house 0.
Let’s take into account a radius of 1. The 1-ball is the section [-1, 1]. The two-ball is the disk delimited by the unit circle, whose equation is x² + y² ≤ 1. The three-ball (what we usually name a “ball”) has the equation x² + y² + z² ≤ 1. As you perceive, we are able to lengthen that definition to any dimension:
The query now’s: what’s the quantity of this ball? This isn’t a straightforward query and requires various maths, which I received’t element right here. Nonetheless, yow will discover all the small print on my web site, in my submit in regards to the quantity of the n-ball.
After lots of enjoyable (integral calculus), you’ll be able to show that the quantity of the n-ball could be expressed as follows, the place Γ denotes the Gamma operate.
For instance, with R = 1 and n = 2, the quantity is πR², as a result of Γ(2) = 1. That is certainly the “quantity” of the 2-ball (additionally known as the “space” of a circle on this case).
Nonetheless, past being an attention-grabbing mathematical problem, the quantity of the n-ball additionally has some very shocking properties.
Because the dimension n will increase, the quantity of the n-ball converges to 0.
That is true for each radius, however let’s visualize this phenomenon with just a few values of R.
As you’ll be able to see, it solely converges to 0, nevertheless it begins by rising after which decreases to 0. For R = 1, the ball with the biggest quantity is the 5-ball, and the worth of n that reaches the utmost shifts to the precise as R will increase.
Listed here are the primary values of the quantity of the unit n-ball, as much as n = 10.
The quantity of a high-dimensional unit ball is concentrated close to its floor.
For small dimensions, the quantity of a ball appears to be like fairly “homogeneous”: this isn’t the case in excessive dimensions.
Let’s take into account an n-ball with radius R and one other with radius R-dR the place dR could be very small. The portion of the n-ball between these 2 balls is named a “shell” and corresponds to the portion of the ball close to its floor(see the visualization above in 3D). We will compute the ratio of the quantity of all the ball and the quantity of this skinny shell solely.
As we are able to see, it converges in a short time to 0: nearly all the quantity is close to the floor in excessive dimensional areas. As an example, for R = 1, dR = 0.05, and n = 50, about 92.3% of the quantity is concentrated within the skinny shell. This reveals that in increased dimensions, the quantity is in “corners”. That is once more associated to the distortion of the idea of distance now we have seen earlier.
Notice that the quantity of the unit hypercube is 2ⁿ. The unit sphere is mainly “empty” in very excessive dimensions, whereas the unit hypercube, in distinction, will get exponentially extra factors. Once more, this reveals how the concept of a “nearest neighbor” of a degree loses its effectiveness as a result of there may be nearly no level inside a distance R of a question level q when n is massive.
The curse of dimensionality is carefully associated to the overfitting precept. Due to the exponential progress of the quantity of the house with the dimension, we want very massive datasets to adequately seize and mannequin high-dimensional patterns. Even worse: we want quite a lot of samples that grows exponentially with the dimension to beat this limitation. This situation, characterised by many options but comparatively few information factors, is especially vulnerable to overfitting.
Occam’s Razor means that easier fashions are usually higher than complicated ones as a result of they’re much less more likely to overfit. This precept is especially related in high-dimensional contexts (the place the curse of dimensionality performs a task) as a result of it encourages the discount of mannequin complexity.
Making use of Occam’s Razor precept in high-dimensional situations can imply lowering the dimensionality of the issue itself (through strategies like PCA, characteristic choice, and so on.), thereby mitigating some results of the curse of dimensionality. Simplifying the mannequin’s construction or the characteristic house helps in managing the sparse information distribution and in making distance metrics extra significant once more. As an example, dimensionality discount is a quite common preliminary step earlier than making use of the kNN algorithm. More moderen strategies, resembling ANNs (Approximate Nearest Neighbors) additionally emerge as a approach to cope with high-dimensional situations.
Whereas we’ve outlined the challenges of high-dimensional settings in machine studying, there are additionally some benefits!
- Excessive dimensions can improve linear separability, making methods like kernel strategies simpler.
- Moreover, deep studying architectures are significantly adept at navigating and extracting complicated patterns from high-dimensional areas.
As at all times with Machine Studying, this can be a trade-off: leveraging these benefits entails balancing the elevated computational calls for with potential positive aspects in mannequin efficiency.
Hopefully, this provides you an concept of how “bizarre” geometry could be in high-dimension and the various challenges it poses for machine studying mannequin growth. We noticed how, in high-dimensional areas, information could be very sparse but additionally tends to be concentrated within the corners, and distances lose their usefulness. For a deeper dive into the n-ball and mathematical proofs, I encourage you to go to the prolonged of this text on my web site.
Whereas the “curse of dimensionality” outlines important limitations in high-dimensional areas, it’s thrilling to see how fashionable deep studying fashions are more and more adept at navigating these complexities. Take into account the embedding fashions or the newest LLMs, for instance, which make the most of very high-dimensional vectors to extra successfully discern and mannequin textual patterns.