Some Thoughts on ICLR 2019

I recently attended ICLR 2019 in New Orleans, and I was lucky to have the opportunity to show off our paper on a novel attention module and image understanding dataset. I really enjoyed the entire conference, and I thought I’d share brief overviews of two of my favorite presentations from the workshops and main program.

Jure Leskovec’s Keynote on “Deep Graph Generative Models” at the Representation Learning on Graphs and Manifolds Workshop.

Generative graph modeling problems take a number of forms. For example, one of the most common kinds of problems is simply: given a distribution of graphs defined by some dataset, generate a new graph from that distributon. Another example is: generate a graph that is as close as possible to some graph distribution or target graph but that also optimizes a given constraint. For example, we may use a method like this to build a new molecule that has some amount of solubility or special properties while maintaining its reactive properties.

There are a large number of significant challenges in this domain. For example, the output space (nodes and edges) is both large and highly variable. This makes it challenging to utilize the kinds of techniques that have been successful in vision or short text processing tasks. Even worse, graph representations are non-unique. The same graph can be represented in many different ways, and determining whether two graph structures are identical is NP-hard. This makes it challenging to track convergence.

Jure introduced two example modeling paradigms for solving these problems:

Graph-RNN

Idea: If we generate graphs based on a sequence that defines the addition of new nodes and edges, we can represent the node permutations and different graph representations as different sequence orderings.

GRAPH-RNN generates graphs based on an RNN trained to mimic a graph distribution. The graphs are represented as sequences of nodes and edges, each added one at a time.
This process operates based on two RNNs that reference and modify an adjacency matrix:
- One RNN generates new columns in the matrix (adding vertices)
- The other populates that column with 1/0s (adding edges)
To improve tractability, we can make the assumption that we build based on neighborhoods. When we add a new node, we only add edges to the last N nodes. This lets us generate bigger graphs much easier, and lets each generation stage only involve a fixed size vector.

Graph Convolutional Policy Network

Idea: Use an RL policy to sequentially modify a graph to optimize some sort of objective, subject to some set of constraints.

This requires graph modifications to optimize for both immediate and long term rewards and combines graph representation and reinforcement learning
- Intermediate rewards are given for keeping molecule valid, whereas the larger final reward is given at the end of the graph construction if the molecule satisfies the constraints. This means that short term “invalidations” of the graph are rewarded later on if the molecule goes back to validity. This makes the graph construction and editing processes more flexible than a rigid constraint.
We can also add an additional loss based on a discriminator that attempts to distinguish the generated molecule from a real model (similar to a GAN).
Some example special cases include “maximize this property”, “get this property in a certain range”, “optimize this property within a fixed number of steps so that the overall structure is maintained”.

Jonathan Frankle’s Talk on the Lottery Ticket Hypothesis

Pruning is a common technique to find a small “subnetwork” within a trained larger network that performs nearly as well as the larger network. This technique involves training the network, then deleting edges that are not significantly contributing. Although this technique tends to work well, if we prune a big network and then try to train the pruned subnetwork from scratch, we see significantly reduced performance. The authors demonstrate that, in certain cases, if we keep the initialization of the subnetwork exactly the same as it was during the large network training and then train it from scratch, the subnetwork will end up with the same performance as the full network.

This leads to the lottery ticket hypothesis: A randomly-initialized, dense neural network contains a subnetwork that is initialized such that when trained in isolation it can match the test accuracy of the original network after training for at most the same number of iterations.

This is a really exciting result, because it suggets that a smarter initialization strategy may allow us to achieve significantly improved model performances with much smaller networks. For the moment the only way to find this subnetwork is to train and prune, but that will hopefully change in the future.