When I’m building models, I frequently run into situations where I’ve trained multiple models over a few datasets or tasks and I’m curious about how they compare. For instance, it’s clear that if I train two word vector models on random subsets of Wikipedia, the trained models will be “similar” to each other. In contrast, if I train word vectors over a Twitter dataset, the new vectors will be “different” from the Wikipedia-trained vectors.

In order to make it easier to quantify this difference, I wrote repcomp, a python package for comparing embeddings. repcomp supports the following embedding comparison approaches:

• Nearest Neighbors: Fetch the nearest neighbor set of each entity according to embedding distances, and compare model A’s neighbor sets to model B’s neighbor sets.
• Canonical Correlation: Treat embedding components as observations of random variables and compute the canonical correlations between model A and model B.
• Unit Match: Form a unit-to-unit matching between model A’s embedding components and model B’s embedding components and measure the correlations of the matched units.

You can install repcomp with pip install repcomp

We can use repcomp to easily compare two word embedding models that have been pre-trained on Twitter and Wikipedia data:

  import gensim.downloader as api
import numpy as np
from repcomp.comparison import NeighborsComparison

# Load word vectors from gensim

# Build the embedding matrices over the shared vocabularies
shared_vocab = set(glove_wiki_50.vocab.keys()).intersection(
glove_wiki_50_vectors = np.vstack([glove_wiki_50.get_vector(word) for word in shared_vocab])

We can use also use repcomp to:
If you’re interested in contributing to repcomp, please feel free to open up a Pull Request here!