When working with data in high dimensional spaces fn-highDimSpace our intuition is often misleading and can lead to dangerous fallacies. An example that is often quoted talks about ratio of the volume of a -dimensional ball compared to the volume -dimensional cube. As the dimension increases the volume quickly approaches zero, meaning that most of the volume in the cube lies outside the ball! To this end it is important to gain intuition about the data using visual tools and approximations. One such technique is the t-distributed stochastic neighbor embedding that is used to produce low-dimensional clustered distribution of the original data. In this post we look in more detail at t-SNE and its foundation the Kullback-Leibler divergence. The code to recreate the numerical experiments and generate all the plots encountered here can be found in my github repository. Aso this blog has been turned into a talk.
Let's try to break this down a little bit more. In one dimension the unit "ball" and the unit "cube" are in fact identical: It's just the real line from to . This starts to change in two dimensions: The "volume" of the two-dimensional cube is given by the area of a square, which in this case is . Correspondingly the volume of the 2D unit ball is just the area of a circle . The ratio between these two areas is , which we can interpret that about of the volume of the unit square lie outside the unit circle. Doing the same calculation in three dimensions we find or already about of the volume are already outside of the unit ball!
In -dimensions the ratio between the d-dimensional unit ball and its enclosing cube becomes
where is the famous Gamma function. We can interpret this result as saying that with increasing dimensionality more and more of the volume of the unit cube lies outside the unit ball.
Alternatively, we can also think about it by drawing a unit-ball around a point in d-dimensional space. As the volume of this ball becomes smaller and smaller, the likelihood of finding another point within that volume vansihes as well. This is typically referred to as the Curse of Dimensionality, where not only our intuition leaves us, but also where data is becoming so sparse that finding patterns becomes hard.
Nevertheless, or maybe just because, intuiton is not reliable we need tools to visualize the potential patterns of the data that arise in higher dimensions. One such tool is based on the idea of trying to match an effective probability distribution function -- we will discuss below what that means -- of the actual high dimensional data with a corresponding probability distribution in fewer (two or three) dimensions, which allows for better visualization. In this way we can get ideas of pockets, clusters, randomness, etc. and gain insight about what we should look out for in our modeling efforts.
Enter t-SNE or t-distributed Stochastic Neighbor Embedding: This technique introduces a probability function of the square Euclidean distances fn-tsne-metric, according to a softmax function:
The are vectors in the original space and we calculate the probability of finding two points at a distance .
To get some intuition for this, we play with a toy example, where we generate Gaussian ditrisbuted data clusters centered at the corners of a 3-dimensional cube with side-length . We expect the distribution of square distances to be peaked around , , , , according to intra-cluster, side, face-diagonal and cube-diagonal separations (see figure to the left). This distribution can be understood as a probability distribution and the idea of t-SNE is to find a probability distribution that is a function of fewer coordinates (e.g. two instead of three) and creates a similar distribution function fn-complication. To approximate the density function in the lower-dimensional mapping space, t-SNE employs a Student-t distribution fn-Cauchy to match the corresponding distribution in the original space:
The goal is to learn the coordinated of the low-dimensional space, such that the cluster distribution is preserved in the lowe-dimensional embedding space.
Using the scikit-learn t-SNE implementation we can quickly compute the t-SNE points for a 2-dimensional mapping space and visualize those using a scatter showning clearly distinguishable clusters! For completeness we should look at the underlying distance distriution. t-SNE is also doing a great job at distinguishing the clusters even in a more reduced version by just looking at the distances.
We can clearly identify 7 different distances (meaning we have one degenerate distance in mapping space) allowing us to easily identify structure in the data. For a different parameter set this might not be true and we could end up with overlapping distributions again, but looking at the t-SNE scatter should quickly resolve these ambiguities and inform us whether we actually have clusters in the data at hand.
Warning: This section is math heavy, feel free to skip it since it is not required to appreciate the power of t-SNE
Up to now the notion of "matching two distributions" is fairly abstract and we should clarify its meaning. An often used measure for the similarity of two distribution is the Kullback-Leibler (KL) divergence that gives a numerical representation for the deviation of a model distribution from the true distribution and is defined as:
The last line gives an information-theoretic interpretation of the KL divergence, by observing that it measures the difference of the cross-entropy and the entropy of the of the true distribution . The cross-entropy can be understood as a measure of how much information the distribution contains about the distribution . With this the KL divergence tells us how much information we loose by using the approximation instead of the true dostribution
The KL divergence has several properties
This is all fair and well, but how does this look in practice; is that at all a useful measure and how can we compute it? {: .text.img-left width="60%"} To this end let's study one of my favorite stats theorems, the central limit theorem. E.g., given -identically distributed Bernoulli variables , then their sum approaches a Normal distribution. This can be seen in the figure to the right hand side, whee we used a Bernoulli distribution woth and . Drawing samples and making a histogram looks fairly normal (see the black dashed fit), except for some outliers, that are due to the binning. Mathematically the central limit theorem for this example looks like:
A question we might want ask is how fast the distribution converges to the true Normal distribution. With the KL divergence this is straight forward to answer in just a few lines of code. We draw various numbers of samples using the same Bernoulli parameters from above and plot the Kullback-Leibler divergence against the sample size: {: img width="550px"}
Note that the axes are both logarithmically. At about a sample size of the convergence seems linear in this plot meaning that we converge towards a true Normal distribution with a power-law in the sample size. Which is really fast!
To put this in connection with t-SNE: With the KL divergence as a measure for similarity of two distribution we can now optimize the target distribution by minimizing the KL divergence to the true high-dimensional distribution . Note however, that we are not generating new samples, but rather optimze the coordinates of the target distrbution. t-SNE generally does that using gradient-decent methods which are often (but not exclusively) used in Neural Networks.
The curse of dimensionality forces us to rethink the approach to high-dimensional data. One tool to gain insights into the distribution of the data is using the t-distributed stochastic neighbor embedding (t-SNE) that tries to maximize the similarity between two paramerterized probability distributions in the original high-dimensional space and the low-dimensional target (embedding) space. The similarity between the distributions is defined using the Kullback-Leibler divergence measure. All code is found on github