As researchers apply Machine Learning to increasingly complex tasks, there is mounting interest in strategies for combining multiple simple models into more powerful algorithms. In this post we will explore some of these techniques. We will use a little bit of language from Category Theory, but not much.
In the following discussion we will use the following notation and terminology: Machine Learning models are functions of the form \(D \rightarrow (X \rightarrow Y)\) where \(D\) is a dataset and \((X \rightarrow Y)\) is a function that maps samples in the space \(X\) to samples in the space \(Y\). The dataset \(D\) may contain pairs of samples \((x,y) \in X \times Y\) (supervised learning), just samples \(x \in X\) (unsupervised learning) or anything else. This is of course a very limited perspective on Machine Learning models. Although this post will focus mainly on supervised and unsupervised learning, there are many more examples of composition in reinforcement learning and beyond.
The most general way to combine Machine Learning models is just to place them “side-by-side”. There are a few ways to do this:
Given models of the forms:\[T_1: D_1 \rightarrow (X_1 \rightarrow Y_1), T_2: D_2 \rightarrow (X_2 \rightarrow Y_2)\]
We can attach them in parallel to get a model:\[h: D_1 \times D_2 \rightarrow (X_1 \times X_2 \rightarrow Y_1 \times Y_2)\]
At both training and inference time the composite model independently executes the component models. We can think of this sort of composition as zooming out our perspective to see the two separate and noninteracting models as part of the same whole. In Backprop as Functor the authors define this sort of composition to be the monoidal product in their category \(Learn\).
For example, say we have a software system that contains two modules: one for training a linear regression on driving records to predict insurance premiums and one for training a decision tree on credit history to predict mortgage approvals. We can think of this system as containing a single module that trains a linear regression \(\times\) decision tree on pairs of driving records and credit history to predict pairs of insurance premiums and credit history.
Given a set of Machine Learning models that accept the same input, there are a number of side-by-side composition strategies, called ensemble methods, that involve running each model on the same input and then applying some kind of aggregation function to their output. For example, if the models in our set all produce outputs in the same space, we could simply train them independently and average their outputs. The models in an ensemble are generally trained in concert, perhaps on different slices of the same dataset.
Another way to combine Machine Learning models is to use the output of one model as the input to another. That is, say we have two models:\[T_1: D_1 \rightarrow (X \rightarrow Y), T_2: D_2 \rightarrow (Y \rightarrow Z)\]
that we combine into a model \(h: D_3 \rightarrow (X \rightarrow Z)\). At inference time, \(h\) operates on some \(x \in X\) by first running the trained version of \(T_1\) to get a \(y \in Y\) and then running the trained version of \(T_2\) on \(y\) to get the output \(z \in Z\). Within this framework, there are a number of ways that we can train \(T_1\) and \(T_2\):
Unsupervised Feature Transformations
The most straightforward form of input-output composition is the class of unsupervised learned feature transformations. In this case \(D_1\) is a dataset of samples from \(X\) and \(T_1: D_1 \rightarrow (X \rightarrow Y)\) is an unsupervised Machine Learning algorithm. In unsupervised feature transformations the learning processes of \(T_1\) and \(T_2\) proceed sequentially: \(T_2\) is trained on the output of \(T_1\), and this training does not begin until \(T_1\) is fully trained. Once \(T_1\) is fully trained we use it and \(D_1\) to create the dataset \(D_2\) of samples in \(Y \times Z\) that we use to train \(T_2\).
Some examples of this include:
- PCA: \(T_1\) learns a linear projection from \(X\) to a subspace \(Y\).
- Standardization: \(T_1\) learns the means/variances of each component of \(X\) and transforms samples from \(X\) by rescaling them to be zero-norm and unit variance.
- GMM: \(T_1\) learns a mapping from \(X\) to the space \(Y\) of vectors of posterior probabilities for each mixture component.
Supervised Feature Transformations
A similar but slightly more complex form of input-output composition is the class of supervised learned feature transformations. In this case \(D_1\) is a dataset of samples from \(X \times Z\) and \(T_1: D_1 \rightarrow (X \rightarrow Y)\) is a Machine Learning algorithm that transforms samples from \(X\) into a form \(Y\) that may be more convenient for a model that aims to generate predictions in \(Z\) to consume. Just like in unsupervised feature tranformations, the learning processes of \(T_1\) and \(T_2\) proceed sequentially and we use the trained version of \(T_1\) and the dataset \(D_1\) to create the dataset \(D_2\) of samples in \(Y \times Z\) that we use to train \(T_2\).
Some simple examples of this include:
- Feature Selection: \(T_1\) transforms \(X\) by removing features that are not useful for predicting \(Z\).
- Supervised Discretization: \(T_1\) learns to represent the samples from \(X\) as vectors of one-hot encoded bins, where the bins are chosen based on the relationship between the distributions of the components of \(X\) and \(Z\).
A more complex example of a supervised feature transformation is the vertical composition of decision trees. If we have two sets of decision rules from which we can build decision trees, we can combine them to form a composite decision tree that first applies all of the rules in the first group and then applies all of the rules in the second group.
End-to-End training is probably both the most complex and most studied form of input-output composition of Machine Learning models. This paper and this paper and this paper all build categories on top of this kind of composition.
In both unsupervised and supervised feature transformations, the training process for \(T_2\) does not begin until \(T_1\) is fully trained. In contrast, in end-to-end training, we train \(T_1\) and \(T_2\) at the same time from a set of samples in \(X \times Z\). We never explicitly construct the datasets \(D_1\) or \(D_2\). In general, we need our Machine Learning models to have a special structure in order to employ this strategy. For example, the Backprop as functor paper defines the notions of request and update functions to characterize this. Because of the chain rule, we can define these functions and employ end-to-end training whenever our models are parameteric and differentiable.
The clearest example of end-to-end training is the composition of layers in a neural network, which we train with Backpropagation.
In meta-learning, or learning to learn, the training or “update” function for one Machine Learning model is defined by another Machine Learning model. In certain cases, like those described in this paper, we can define a notion of composition where \(T_1 \circ T_2\) is a model with an inference function equivalent to that of \(T_1\) and a training function defined based on \(T_2\)’s inference and training functions. This is described in more detail for the parametric and differentiable case here.
This is just a small sample of techniques for building complex models from simple components. Machine Learning is growing rapidly, and there are many more strategies for model composition than are addressed here.