At the heart of Machine Learning is data. In all Machine Learning problems, we use data generated by some process in order to make inferences about that process. In the most general case, we know little to nothing about the data-generating process, and the data itself is just a string of numbers or values, devoid of inherent structure or meaning. However, once we make assumptions about this process or what the data represents, we can build on our data to develop models with impressive predictive power.

One of the most common assumptions about a set of data is that we can model its generating process with some unobserved probability distribution. This assumption is shared across much of classical statistical learning theory and practice, from Valiant’s PAC Learning to Murphy’s Machine Learning: A Probabilistic Perspective. Both the Bayesian and Frequentist perspectives on statistical learning ascribe to this assumption: they disagree only over which other components of the problem should also be modeled as having been drawn from such a distribution (Bayesians assume that the predictor variables and model structure are also drawn from an unobserved probability distribution, whereas Frequentists assume that these are fixed quantities).

One major benefit of this assumption is that it provides a clear path towards constructing a predictive model for our data by studying the generating distribution. For example, both classical Computational Learning Theory and modern Statistical Learning Theory study how our assumptions about an unseen distribution affect our ability to use samples from that distribution to build an accurate model. In general, simpler and less noisy distributions are easier to model with fewer samples, and we can build more effective and data-efficient algorithms when we make stronger assumptions about the structure of the generating distributions.

Another major benefit of this assumption is that it allows us to represent a dirty and unstructured piece of data with a clean mathematical object (the generating distribution). This shift in focus allows us to define complex optimization problems in terms of the distribution’s behavior. For example, in both the Maximum Likelihood (Frequentist) and Maximum a Posteriori (Bayesian) procedures we define an optimization problem in terms of how likely the observed data would be, given some assumptions about the unobserved distribution.

Another common assumption is that there exists some metric function \(d: X \times X \rightarrow \mathbb{R}\) such that the value \(d(x_1, x_2)\) of this function at the pair of data points \(x_1, x_2\) is a meaningful representation of the relationship between \(x_1\) and \(x_2\). Many unsupervised learning algorithms rely only on this assumption. For example, while both K-Means and PCA (Principal Components Analysis) can be derived from a probabilistic perspective, the optimization objectives for both algorithms are meaningful even in the absence of any probabilistic assumptions.

By reasoning about the inferences we would like to make and making assumptions about the data generating process, we can rigorously define a problem statement. Some problem statements, such as the problem statement for maximal linkage overlapping clustering (place all points with pairwise distances less than \(\delta\) into the same cluster) are explicit, whereas others, such as the problem statement for Maximum Likelihood, require more advanced techniques to effectively solve.

The combination of the assumptions we make, problem statements we derive, and optimization techniques we use is extremely complicated. One strategy for developing an understanding of such a complex system is to study it in the simplest setting. By removing everything other than the minimal assumptions and structures, we can derive insights into the system’s fundamental structure. We propose that two of the most important kinds of structure to focus on are Functoriality and Compositionality.


A compositional system is one in which larger or more complex pieces are made up of smaller or simpler ones. When we study a compositional system we focus on the ways in which we can effectively combine smaller pieces of the same kind and understand their compositional hierarchy. Most modern Machine Learning applications, both at the system-level and the model-level, are inherently compositional. By developing a framework to study the composition of these systems, models and algorithms we may be able to better understand, constrain and guide their behaviors.

For example, one of the most important objects of study in Machine Learning is the class of algorithms for finding the minima or maxima of functions. A common algorithm is backpropagation, which is a particular form of gradient descent optimized for neural network models. Fong et al derive a framework for studying how the composition of the steps that the backpropagation algorithm takes over a simple function mirrors the steps that the algorithm takes over a composite function. This framework then provides a blueprint for the ways that we can compose and recombine functions without breaking the optimization procedure.

Today, the construction of most non-trivial Machine Learning systems is largely guided by heuristics about what works well in practice. While the individual components of complex models are generally well-developed mathematically, their composition and combinations tend to be poorly understood. For example, we do not have a good mathematical language to describe how feature preprocessing strategies like MDL (minimum description length) interact with hand-designed optimizers like Adam. By studying the compositional behavior of statistical models and other system components, we may be able to predict the results of these composites and define more intelligent rules around their combinations.


A Functor is a “map between categories that preserves identity morphisms and morphism composition”. Underlying this technical definition is a powerful concept: the functoriality of a transformation is a blueprint for its structure, expressed in terms of the invariants it preserves. If a given transformation is functorial over some pair of categories, then the transformation preserves the structure represented in those categories. One way to develop a deeper understanding of a particular transformation or class of transformations is to identify the most interesting settings under which it is functorial. Once we have done this we can derive extensions or alternate versions of this transformation that preserve this functoriality, as well as modifications that break it. We summarize this with the following recipe:

  • Identify a transformation of interest.
  • Find the most interesting categories over which this transformation is functorial.
  • Figure out what interesting extensions we can make to the transformation such that it remains functorial.

To give a concrete example, Carlsson and Memoli investigate the categories over which certain clustering algorithms are functorial. They demonstrate how certain algorithms are more brittle than others and can lose their functoriality when we change the set of morphisms in the source category.


In order to build a rigorous framework for reasoning about the relationships between the components of a Machine Learning system, we will need powerful tools for understanding the key transformations in the system. By studying the underlying compositional and functorial structure in Machine Learning systems we may be able to improve upon them.