A solid laptop computer in 2018 has about 1TB (1000GB) of disk space, and the capability to store about 16GB of memory in RAM. In comparison, internet users in the United States generate about 3000TB of data every minute 1. An enormous amount of this data takes the form of high dimensional sparse relationships that are difficult to efficiently model.
In order to effectively harness and interpret this data, researchers and organizations have relied heavily on algorithms like neural networks that work best on low dimensional dense data. In this article I will describe one technique for reducing large high dimensional sparse data to small low dimensional dense data that is compatible with machine learning algorithms like neural networks.
Say we have a set of entities (words, people, movies, etc) and we want to represent some relation between them. For simplicity’s sake, let’s say that we are interested in a binary relation, such as “these two people are friends”, or “these two movies have a shared co-star.” We can represent these entities and this relation in an unweighted graph in which we represent each entity with a node and the relation with the edges between the nodes.
One common way of storing is an adjacency matrix - i.e. an matrix , where is the number of nodes and entry is if an edge exists between nodes and otherwise. However, since this matrix is typically extremely large and sparse, it can be prohibitive to store and manipulate for very large graphs. One way to avoid this is to instead use a low-rank symmetric matrix factorization algorithm to form the matrix , where is an approximation of . Intuitively, we are assiging to each node in a dimensional vector such that for any two nodes in , the dot product is indicative of the likelihood that an edge exists between and . This can allow us to represent our entities and their relationships in much less space. Furthermore, the low dimensional entity representations are often easier to use in Machine Learning tasks than the sparse high dimensional graph representations.
Now consider the case where the relationship between entities is no longer binary, for example, the relationship between two movies might be their number of shared actors. In this case:
- becomes a graph with weighted edges.
- is if an edge with weight exists between nodes and otherwise.
- The dot product is as close as possible to .
In some cases, the continuous graph might be quite dense. Consider the case where our entities are the words in the English language and the relationship between words is the frequency with which they appear together in a sentence, perhaps normalized by their individual frequencies (PMI). In this case is a large, dense matrix that’s likely impossible to store in memory and factorize directly. We could instead apply an algorithm like the following to symmetrically factorize it:
- Randomly initialize the matrices
- For each tuple where is the number of times that and appear in a sentence, take a gradient descent step towards minimizing the loss .
This will produce “word embeddings” for each word in our vocabulary such that the dot product between any two word’s embeddings is indicative of the frequency with which the words appear in a sentence. Note that if we use an additional matrix to represent word context and modify the loss function a bit, we get the skigram algorithm from Mikolov’s word2vec. In the image below, we demonstrate how similar words appear closer to each other in the word vector space:
Representing Bi-Partite Graphs
Now consider the case where we instead want to represent the affinities between a group of entities of type 1 and a group of entities of type 2. For example, we may want to represent the rating a user would give to a movie or the likelihood that a user would click on an advertisement.
We can represent these relationships with a bi-partite graph where the two disjoint sets of nodes correspond to the two types of entities. You can see an example for users and movies below:
We can use a modified adjacency matrix to represent . Say we have entities of type and entities of type . Then define to be the matrix such that entry of is equal to the weight of the connection between entity of type and entity of type . We can use a technique similar to those described above to represent with two matrices of low dimensional entity vectors.
For example, if we take the reduced SVD of with dimension , we get the orthogonal matrix , the orthogonal matrix and the diagonal matrix such that is minimized. We can “absorb” into and to form and such that is still minimized.
Note that each row of corresponds to entity of type and each row of corresponds to entity of type such that the affinity between entity and entity is correlated to . Note that this is equivalent to constructing a linear regression model for each entity of type and projecting each entity into a space such that the outputs of each ’s linear regression model on ’s projection is an estimate of the affinity between entity and entity . We refer to this as a “co-embedding” between the entity types and .
We can represent most entity relationships with graphs. However, large sparse graphs are difficult to store and query. Furthermore, node-edge relationships are difficult to utilize as features for machine learning algorithms. Matrix factorization approaches can efficiently model and represent the nodes of these graphs with low dimensional vectors.