Last week I was at KDD 2018 in London. This was my first time at KDD, and I had the opportunity to present our paper on embeddings at the Common Model Infrastructure workshop. I was really impressed by both the workshops and the main program, and I thought I’d share my thoughts on a few of my favorite presentations from the week.
Yee Whye Teh’s talk was an excellent overview of how meta-learning and “learning-to-learn” algorithms can improve performance in low labeled data domains. He covered a smattering of recent research in this space:
This cool strategy for sharing knowledge between multiple reinforcement learning tasks by simultaneously learning a default policy and several task-specific policies. The task-specific policies are guided by a KL-divergence loss to align as closely as possible to the default policy.
A nice algorithm for avoiding catastrophic forgetting (the phenomenom where a model erases the knowledge that it learned from one task when trying to complete another task) by enforcing a penalty for forgetting weights that were important for earlier tasks.
A new approach for building probabalistic neural models as alternatives to Gaussian Processes that can perform faster inference and represent highly expressive priors.
When I was listening to Yee Whye Teh’s talk, it struck me that one of the reasons why humans are such effective decision makers is our ability to learn new and complex concepts from extremely little data. For example, if I show a child a single picture of an elephant, they will almost certainly have perfect accuracy in classifying pictures of elephants from then on.
Part of the reason why we can do this is that our perceptual systems are extremely good at feature engineering. We know that the characteristic features of the elephant are the trunk, tusks and size, because we have knowledge of other animals that we’ve formed over our lives. Our understanding of the world isn’t siloed: we can learn how to accomplish new tasks because we already have a good understanding of the sort of strategies that have worked in the past on similar tasks.
Comparatively, if I tried to train a simple CNN to recognize elephants in images, I would need to supply hundreds or thousands of examples. This sort of problem-by-problem approach to understanding different kinds of data and solving different kinds of problems without sharing knowledge or insights is suboptimal and data inefficient. The strategies that Yee Whye Teh discusses in his talk can help us reduce this inefficiency.
Chris Ré described how building models for sparsely labeled domains is one of the most challenging parts of modern ML. He also introduced his team’s tool for solving this problem, Snorkel.
There were about 100 talks at KDD about some kind of new “automatic machine learning system” that accepts labeled data and generates a trained model or prediction service. I recall this being the case back when I went to NIPS 2016, and it’s probably been the case at every ML conference since then. To me, this is an indication that the problem of ML algorithm development has been mostly reduced to the problems of constructing labeled datasets and measuring performance. Unfortunately these are pretty tough problems, and probably the single largest differentiating factor between domains where ML is successful and those where it is unsuccessful is the amount of labeled data. For natural image recognition and online product recommendation, labeled data is pretty easy to come by. For other problems like motion sensor classification, insurance pricing or disease prediction, labeled data is a natural bottleneck.
In some of these cases, we need to turn to weak supervision to build our dataset. Often this will involve utilizing noisy signals that we suspect are correlated to the true labels. The approach that Chris Re describes involves constructing and aggregating many weak labeling functions to augment and manufacture training datasets, building models for the noise in these labeling functions., and then using these models to train a classifier. This seems like a promising approach, and the results he demonstrated in the talk and on the Snorkel page were certainly impressive.
The folks from Airbnb gave a great talk on their embedding-based approach for search ranking. I had a few major takeaways:
They used a negative sampling approach to train their embeddings that was reminiscent of Facebook’s StarSpace, but they describe a few negative mining schemes that seemed pretty interesting. For example, they explicitly select some negatives from the same market as the positive samples. This seems like a pretty simple way to use domain knowledge to get the benefits of traditional hard negative mining in a faster and more stable way.
One of the largest challenges in building a system that regularly re-produces embeddings is managing re-training for upstream components that depend on these embeddings. The authors avoid this issue by constructing the upstream models to use cosine distance as a feature rather than the embeddings themselves. This is common practice, but I also feel like this limits the system a bit, since the embeddings themselves likely contain additional information that upstream models may benefit from exploiting.
They implemented a “Similarity Exploration Tool” for visualizing which listings are close to each other in the embedding space. In my experience, these kinds of tools are most useful when we use them compare different embeddings side-by-side. This can help us understand how a new method might improve performance or how embedding spaces change over time. They didn’t show us any examples of this sort of exploration in the paper or talk though.