Optimization-Based Meta-Learning

29 minute read

Published:

This is a blog post credit to Richard Dargan, Johanna Karras, Eric Moreno

Optimization-Based Meta-Learning

Based on [Learning to Learn by Gradient Descent by Gradient Descent]

1. Introduction and Background

1.1. Introduction

In many machine learning tasks, models are developed specifically to tackle a given problem. An example of this is a classifier that is trained on images to tell if they contain a dog or not. We can train many different models that are tailored to tackle a problem that we have formulated for it. It is the power of machine learning that we can feed a model data and it can discover trends and patterns that humans have failed to find analytically. We have seen models that have been able to outperform humans and human designed algorithms in various tasks. An example of this is how an AI team was able to beat a professional human team at the game of Dota 2.

With such potential in machine learning technique, one starts to wonder if we can build a model that learns to build models. These models could easily adapt to new environments that they were not trained in. For example, you could train a game bot. This model will learn to train a model on any given game. It could be taught on a small sample of games and when it encounters new games it would learn to play this new game on its own. This process is known as learning to learn.

We can formalize this description a bit as follows. A typical model may be given some error function which it is trying to minimize. For a function $F(a)$, it will find $a \in A$ to minimize $F(a)$. This can be accomplished by different methods, but a typical approach is gradient descent. If the function $F$ is differentiable, we follow the gradient to try to find a minima of the function. Over a number of steps, we adjust the paramter $a$ by the update rule

When using such an update rule, the creator of this model will have to explicity set the learning paramter $\gamma$ by some method they choose. There has been much research into defining the ideal update rules for different models and situations. Now, in a learning to learn method, we instead will let a model learn the best way to teach the underlying model. We define an optimizer function $G$ with parameters $b$ and use gradient descent with the rule

Throughout this blog we will explore a specific implementation of this using recurrent neural networks (RNN) from [Learning to Learn by Gradient Descent by Gradient Descent]. This network design can maintain its own state and update the state with every iteration. In particular, we discuss the use a variant of RNNs known as a Long Short Term Memory network (LSTM).

We will look at some more examples of learning to learn, discuss the theory behind it (Section 2), and discuss the results and possible future work (Section 3)

1.2. Generalization

The idea of creating a model to learn to learn on any problem it is given sounds appealing. Before we continue, we must carefully consider the constraints and be careful in how we are defining the problem itself. The standard approach is to look at a problem analytically and then design an algorithm by hand from the insights that are known. Now, we make the algorithm design the learning problem. This model will then learn how to perform well over a class of related optimization problems.

In a usual problem we are given some training points and expect a well performing model to generalize accurately to points not in the training set. With the learning to learn approach, we make instances of problems the example training points. Then, we expect the model to transfer well to problems outside the training set.

This creates a model with effectively two layers. We have the lower layer which is learning a given problem and an upper layer which is learning how to improve the lower layer on a set of training problems. The second layer should be effective in generalizing the lower layer to any set of related problems. One can further imagine that if this is effective, we could keep adding further layers. The next layer in this structure could learn how to select methods for different classes of problems.

The area of learning to learn, also known as meta-learning, has been under investigation for decades. Early work by [Schmidhuber 1993] involved building networks that can modify their own weights. This allows the network and learning algorithm to be trained by gradient descent but has problems with defining the update rules. More recent work includes [Younger et. al 2001] which creates a system where the backpropagation of one network can feed into another learning network and train at the same time. The work we explore in this post from [Learning to Learn by Gradient Descent by Gradient Descent] is built upon this work in particular.

2. Theory

2.1. Defining the Problem

A huge advantage of meta-learning is its ability to leverage past problems and solutions to learn the solution to a new problem. The input to the meta-learning model is a set of data sets $D$ = {$D_1, D_2, …, D_k$} where each data set $D_i$ contains a set of input to output pairs {$(x^i_1, y^i_1), (x^i_2, y^i_2), …, (x^i_n, y^i_n)$}. The goal of the model is to learn the functions, $f_i$, that produced each data set $D_i$. In other words, to learn $f_i$ such that $f_i(x^i_j) = y^i_j$.

We do so by learning both the “optimizee” $f_i$, but also the optimizer that learns each $f_i$. In this way, we are able to take advantage of previous “knowledge” to learn new functions.

For example, consider the task of learning to classify characters in a different alphabets. Each alphabet is unique, but the process of learning to classify characters within an alphabet is the same.

picture Image from [Lake et al. Science 2015].

2.2 Training and Testing

The set $D$ of datasets are split into training and testing datasets.

K-shot meta-learning is defined as having K training examples per each class. Below is an example of 4-shot meta learning, where each data set contains 4 examples for each image class.

picture [Image Source]

2.3. Objective Function

Let $\theta^*$ be the parameters of the optimizee and let $\phi$ be the parameters of the optimizer. The objective is to maximize the probability of obtaining the true $y$ given data $x$, or equivalently, to minimize the loss between the learned function output and the true $y$ values:

$\theta^*$ = arg max$_{\theta}$ $E_f$ $[P(y|x)]$
$\theta^*$ = arg min$_{\theta}$ $E_f$ $[L(\phi)]$

for $(x^i_j, y^i_j) \in D_i$ and for $D_i \in D$.

There are various ways for modeling the objective function. We will explore some examples below, including non-parametric, optimizaton-based, and RNN optimizer methods.

2.4. Non-Parametric Approach

In this approach, we train a siamese neural network to classify images. This was implemented [Kock, Zemel, and Salakhutdinov, 2015] for one-shot image classification. Essentially, the neural network is trained between pairs of images to distinguish similar and different objects, then generalized to classify images in a set.

The structure of the network is below, notice that the “twin networks” on top and bottom share the same weights. picture [Kock, Zemel, and Salakhutdinov, 2015]

The loss function is defined below: picture [Kock, Zemel, and Salakhutdinov, 2015]

2.5. Optimization Based Approach

In an optimization-based approach, we optimize a neural network to model $P(y|x)$ by alternatively optimizing the parameters of the optimizer and optimizee. The general framework is below:

  1. For each $D_i$ in our training set, optimize the optimizee parameters $\phi_i \xleftarrow{} \theta - \alpha \bigtriangledown_{\theta}L(\theta, D_i)$
  2. For each $D_j$ in our testing set, update the optimizer parameters: $\theta^* \xleftarrow{} \theta - \alpha \bigtriangledown_{\theta}L(\theta, D_j)$

A well-known optimization approach, MAML (Model Agnostic Meta Learning), can be visualized graphically below:

picture [Finn et al., 2017]

Its implementation closely reassemebles the general framework described above.

picture [Finn et al., 2017]

One interesting adaption of the optimization approach replaces the gradient of the loss function with a neural network in [Ravi & Larochelle, ICLR 2017]. This implementation is further developed and is explored in greater detail in the following RNN Optimizer Approach section.

2.6. RNN Optimizer Approach

This model-based approach, implemented in [Learning to Learn by Gradient Descent by Gradient Descent], uses a recurrent neural network to take advantage of memory in updating the optimizee and optimizer. Our objective is:

$L = E_f[f_{\theta}(\theta^*(f, \phi))]$

Note that the loss is a function of both the learned optimizer variables, $\theta^*$ the optimizee parameters $\phi$. Taking time-steps $t=1…T$ we rewrite the objective to take into account previous optimization steps:

$\theta^*$ = arg min$_{\theta}$ $E_f$ $[\sum_{t=1}^T w_tf(\theta_t)]$

Graphical Representation

Let $m$ denote the optimizer and let $\theta_t$ denote its parameters at time $t$. The optimizer is updated by standard gradient descent:

$\theta_{t+1} = \theta_t - \alpha_t \frac{\partial L_{t+1}}{\partial \theta_t}$

where $\alpha_t$ is the learning rate.

picture [Image Source]

The dashed lines are where we do not perform back-propagation when training the optimizer $m$.

LSTM Meta-Learning

An explicit model for the optimizer RNN that was implemented in [Ravi and Larochelle (2017)] was an LSTM. An LSTM is useful because it allows the model to incorporate information from past gradient steps, like momentum, and the cell-state update for an LSTM has similar form the usual gradient descent update rule.

The paper builds off of this approach by implementing “coordinate-wise LSTM optimizer” that operates on each parameter using a two-layer LSTM.

picture [Image Source]

The benefit of operating on each parameter $\theta^i$ separately is to allow for a large number of parameters without building a massive fully connected RNN. The coordinate-wise LSTM optimizer has separate hidden states per each coordinate, but shares parameters across each LSTM.

3. Experimental Results

3.1. Experimental Setup

There are various different types of optimization-based meta-learners including the previously mentioned LSTM Meta-Learner [Ravi & Larochelle (2017)], Model Agnostic Meta-Learning [Finn, et al. 2017], and Reptile [Nichol, Achiam & Schulman, 2018]. For the specific LSTM implementation in [Learning to Learn by Gradient Descent by Gradient Descent], the model was tested on a variety of different training instances ranging from Quadratic functions, MNIST image classifiers, CIFAR-10 image classifiers, and Neural Art. The optimizer for these training instances were parameterized as a generic two-layer LSTM with 20 hidden units per layer and trained individually for each application. The optimizer used Backpropagation Through Time (BPTT) to minimize the equation specified in Section 2.2 using ADAM with a random learning rate chosen to fit the problem best. This arbitrary design for the optimizer leaves an opportunity for improvement in further work with alternative types of optimizer architecture and even a higher-level optimizer RNNs for the optimizer itself. After each training epoch the optimizer is evaluated on the validation training dataset and after early stopping, the best optimizer is chosen. As a reminder, this optimizer simply updates the parameters of the optimizee as training is ongoing. The final trainings of the optimizee are compared with fixed optimization schemes commonly used in Deep Learning: SGD, RMSprop, ADAM, Nesterov’s accelerated gradient (NAG). These fixed optimization schemes are tuned my hand to give the best possible final error by changing their parameters like learning rate.

3.2. Quadratic Functions

Attempting to minimize quadratic functions is a commonly used test in meta-learning because it is a simple yet applicable minimization problem. In many scientific domains it can be applied to the minimization of an energy or cost function, with or without constraints. In low dimensional spaces, traditional algorithms can easily find global minima, but higher dimensional spaces (like the synthetic 10-dimensional quadratic functions used in this experiment), can be a good primitive in machine learning and optimization. In this application functions of the following form are minimized:

for 10x10 W-matrices and 10-dimensional y-vectors drawn from a Gaussian distribution. In this setup, the optimizers were trained following a standard procedure, optimizing random quadratic functions in a training dataset and testing on newly sampled functions from the same distribution.

The results for this experiment, shown in the plot below, demonstrate the power of learned optimization. They compare the performance of the LSTM-optimized minimizer to commonly used ML optimizers on randomly sampled 10-dimensional quadratic function.

performance for quadratic optimization

Figure 1: Performance of LSTM-optimizer model versus commonly used machine learning optimizers on a 10-dimensional quadratic minimization problem.

Impressively, the LSTM-optimized model consistently performs at a lower loss than all the algorithms across 100 minimization steps. As this minimization problem is a simulation of a high-dimensional stochastic landscape present inside machine learning problems, the performance of this LSTM-optimizer model proves that the “learning to learn” models are indeed learning to more efficiently navigate this landscape. This minimization problem already proves that the LSTM-optimizer is ready to approach a variety of different minimization-based problem settings, but the applications can also go far beyond this basic experiment.

3.3. MNIST Image Recognition & Generalization

MNIST is the most basic image recognition database that consists of handwritten white digits on a black background. It is a good baseline test for simple architectures, but also relatively easy to perform well on. To stretch this optimizer to its limits the authors, this experiment tested the optimizations on small-parameter MLPs. Additionally, these optimizers were trained on base networks and then network architectures and training procedures were modified at test time to explore how well the optimizers could generalize to other models beyond what they were trained on.

Beginning with the performance on MNIST, the optimizer was trained to optimize a base network MLP with one hidden layer of 20 units and a sigmoid activation function for 100 steps. Demonstrated in figure 2 is the performance of this LSTM-optimizer versus the same commonly used algorithms as in the quadratic function case. Additionally, demonstrated in this figure is the generalizability of the optimizer to go beyond 200 steps (100 steps past the original training time), testing whether this approach can be generalized beyond the trained optimization time.

performance for MNIST optimization

Figure 2: Performance of LSTM-optimizer model versus commonly used machine learning optimizers on MNIST image classification on both the 100 steps it is trained (right) and 100 steps proceeding the allotted training period (left).

Additionally, the optimizer was tested on modified network architecture to evaluate whether its optimization strategy was generally applicable to a wide range of models. The following figures show both the strengths and weaknesses of the LSTM-optimizer which could be valuable to a further application.

performance for MNIST generalizability optimization

Figure 3: Performance of LSTM-optimizer after switching to alternative optimizee architectures model versus commonly-used machine learning optimizers on MNIST image classification. Left: Increasing the number of hidden units from 20 to 40 inside the MLP architecture. Center: Increasing the number of hidden layers from 1 to 2 inside the MLP architecture. Right: Training the MLP using ReLU activation functions instead of sigmoid.

These tests of generalizability demonstrate important limitations of the LSTM-optimizer. The optimizer is able to generalize to larger parameter optimizee architectures both in terms of hidden units and hidden layers (meaning it will likely be able to generalize to fewer parameter optimizees if necessary). However, when switching the activation function (Figure 3, right), the performance falls apart demonstrating a limitation of the architecture. However, this is not unexpected.

Unlike switching the internal parameters of the optimizee architecture, which likely doesn’t change how the optimizee navigates the convex higher dimensional space, changing the activation function of the optimizee should completely change how the optimizer interprets the space it is in. This happens because activation functions are tailored to problems for their specific properties. For example, ReLU is known for being more computationally efficient, have a non-vanishing gradient, and tend to converge better in image classification problems [Krizhevsky et al. (2012)], while sigmoid stops from blowing up the activation. These key alterations in the activation function could contribute to how the optimizer interacts with its optimizee, in turn affecting the performance.

These results contribute to the sense that the optimizer has not completely learned to generalize to alternative architectures. A solution to this could be to implement multiple side-by-side LSTMs, each with a specific purpose like choosing the number of hidden units, hidden layers, activation functions, etc. Alternatively, higher-level optimizers for the optimizer could be implemented, layering the optimization decisions and allowing for further generalizability to completely alternative architectures, something that a one-layer LSTM cannot generalize to. These ideas will be further developed in the Section 4: Future Work.

3.4. CIFAR-10 and Multiple LSTMs

CIFAR-10 is a more difficult image classification problem that usually requires convolutional neural networks (CNNs) to solve. The dataset can used with 10 image labels (CIFAR-10), 5 held-out image labels (CIFAR-5), or 2 held-out image labels (CIFAR-2). The optimize model used for these experiments includes three convolutional layers with max pooling followed by a fully connected layer with 32 hidden units, ReLU activations, and batch normalization. The optimizer model remains the same as the previous experiments, a two-layer LSTM with 20 hidden units, but with an important twist. It was found in testing that a single LSTM was not enough to optimize both the CNN and fully connected layers in the optimizee models at the same time. This suggests that the LSTM-optimizer can only work on one ML architecture at once. To solve this, two LSTMs were introduced into the optimizer instead of one. One LSTM proposed parameter updates for the CNN layers while the other proposed parameter updates for the fully connected layers.

Plotted in Figure 4 is the performance of CNN + fully connected layers with the double-LSTM optimizer versus the commonly used optimizers in machine learning. As a further proof of concept for generalizability, when training CIFAR-5 and CIFAR-2, the double LSTM model was trained both on only the held-out dataset (LSTM) but also on the full CIFAR-10 dataset and only evaluated with CIFAR-5 and CIFAR-2 held-out test sets (LSTM-sub). If the LSTM-sub was able to train on the CIFAR-10 dataset but perform well on CIFAR-5 and CIFAR-2 datasets, then it is further proof that the optimizer model can generalize from higher dimensional problems to lower dimensional problems.

performance for CIFAR generalizability optimization

Figure 4: Performance of double LSTM-optimizer on CIFAR image classification. Left: Standard CIFAR-10 image classification. Center: Standard CIFAR-5 image classification but with an additional model whose optimizer was trained for CIFAR-10(LSTM-sub). Right: Standard CIFAR-2 image classification but with an additional model whose optimizer was trained for CIFAR-10 (LSTM-sub).

Promisingly, the results show that multiple LSTM-optimizers working together can optimize different ML architectures which are tied together. This is encouraging for future work as it shows that optimized models don’t need to be limited to a single architecture. This allowed the optimizer to generalize to an even larger the space of models. Additionally, the optimizer trained on CIFAR-10 but evaluated on CIFAR-5 and CIFAR-2 held-out test set proves that even this double LSTM optimizer can generalize to a lower dimensional disjoint problem space. Thus, within the bounds of this experiment, a trained LSTM optimizer was proven to generalize to both a higher dimensional (Section 3.3) and low dimensional problem space (Section 3.4).

3.5. Comparisons with different ML-Optimizers

LSTM-optimizer experiments like these were also furthered a year later in [Ravi & Larochelle (2017)], by comparing the performance on small training samples of ImageNet. This mini-ImageNet set is proposed as a benchmark on learning the complexity of ImageNet images without requiring the resources needed to run on the full ImageNet dataset - again a test of whether LSTM-optimizers can work with smaller datasets where deep neural networks fall apart. However, instead of simply comparing the LSTM-optimizer with traditional optimization techniques, it was compared against other meta-learning techniques such as Matching networks [Vinyals et al. (2016)], and nearest neighbor baselines and a finetuned meta-learner model. These algorithms were tested in the few-shot domain, where traditional neural networks generalize poorly.

performance for ImageNet few-shot domain

Figure 5: Average performance of LSTM-optimizer on few-shot ImageNet classification compared to other common meta-learning techniques. Best results are marked in bold within 95% confidence intervals.

The results once again suggest that the LSTM-optimizer should be the optimizer of choice in the meta-learning space. On both the 1-shot and 5-shot classes it was able to provide the best classification accuracies suggesting that the LSTM-optimizer should be preferred over other optimizers when datasets are small.

4. Future Work

4.1. RNN experimentation

RNNs were used as the optimizer agents in these models because of their ability to maintain their current state and a knowledge of previous states to dynamically update itself as it iterates. In this sense the RNN has a notion of memory which allows it to exhibit temporal dynamic behavior, precisely what is needed for the problem. The specific type of RNN chosen for this problem was long short-term memory network (LSTM). However, there are a wide variety of RNNs beyond LSTMs that have their own strengths and weaknesses.

LSTM vs GRU

Figure 6: Graphical illustration of (a) LSTM and (b) gated recurrent units. (a) i, f and o are the input, forget and output gates, respectively. c and c˜ denote the memory cell and the new memory cell content. (b) r and z are the reset and update gates, and h and h˜ are the activation and the candidate activation.

An interesting and newer alternative pick for RNNs would be the gated recurrent unit (GRU) [Chung et al. (2014)]. LSTMs add new memory content to the cell (input gates), partially forget existing memory (forget gates), and incorporate modulate the amount of memory content exposure (output gates). This allows LSTMs to decide whether to keep existing memory and if it detects an important feature early in the training stage, it can easily carry the information forward. GRUs, on the other hand, adaptively capture dependencies by not incorporating separate memory cells. The GRU has an update gate that decides how much the unit updates its content but does not have a mechanism to control how much of the state it exposes to the unit, but instead exposes the whole state each time. In this sense, the LSTM controls the exposure of memory content while the GRU exposes it full content without any control. While less import, the LSTM controls the new memory content being added to memory independently of the forget gate while the GRU does not independently control the amount of activation being added from the update gate.

It is hard to theoretically judge which of these RNNs will perform better at test time, and while they usually perform similarly, GRUs stand out in some cases like basic sequence prediction and sequential variational autoencoders. Even if the models perform similarly, GRUs are more computationally efficient due to their less complex structure. In the original paper, [Learning to Learn by Gradient Descent by Gradient Descent], the authors modified their LSTM to decrease its complexity. This computation problem will surely be compounded when adding multiple LSTMs or even multiple layers of LSTMs (as will be later discussed). The solution to this problem could be to replace the LSTM with a similarly performing architecture, but with better computational efficiency.

Another just-published RNN architecture that could be applied to this problem is the Legendre Memory Unit (LMU) [Voelker et al. (2019)], which maintains information over long periods of time. The LMU is mathematically designed to orthogonalize it’s continuous time history by solving ordinary differential equations (ODEs) using the Legendre polynomials. Results from Voelker et al. show that the LMU outperforms equivalently sized LSTMs and improves memory capacity by two orders of magnitude, leading to a reducion in training and inference time. This reduction in computational complexity due to the efficiency in handling temporal dependencies spanning 100,000 time steps and the few internal state-variables combined with the ability to learn task-relevant time-scales allows us to imagine a much more efficient training for the optimizer which can converge quicker on SGD global minima.

SIDENOTE: Unlike the other RNN architectures, LMUs can be implemented on spiking neural networks (SNNs) [Vreeken (2004)]. SNNs very closely mimic the behavior of biological neurons and can be implemented on neuromorphic chips - hardware that also closely mimics the behavior of neurons by sending signals that are action potentials, spikes or pulses, instead of the standard binary signals sent by modern computer hardware. In this sense, SNNs attempt to mimic the behavior of the human brain, something that artificial neural networks (ANNs) promised to do but never completed on. If the final goal of optimization is to develop a model that has the mathematical skills of a computer but the intelligence of a human, this could be a step towards that goal.

4.2. Layered RNN Learners

As was alluded to many times in the blog post, there could be a benefit to layering even higher level RNNs in the learning to learn algorithm. If the paper is titled learning to learn by gradient descent by gradient descent, then this approach would be characterized as learning to learn to learn by gradient descent by gradient descent by gradient descent. As we saw, there are various limitations to the learning to learn LSTM-optimizer. First, the ADAM optimizer used in the training of the LSTM, while the paper was actively attempting to move away from fixed optimizers like ADAM. However, we could treat the LSTM-optimizer itself as another optimizee model for a higher-level LSTM (or RNN) optimizer.

Higher-level LSTM Figure 7: Proposed modifications to the LSTM-optimizer by adding an additional higher-level LSTM (right) that can propose parameter updates to the LSTM-optimizee.

As we have already seen that the LSTM-optimizer provides a boost in performance over traditional optimizers, there is reason to believe adding a higher-level LSTM could provide better performance over ADAM, thus providing better performance for the final optimizeee model or at least allowing it to converge in fewer steps as shown in Section 3.2. Additionally, since the LSTM-optimizer can generalize well to higher-dimensional (Section 3.3) or lower-dimensional (Section 3.4) problem spaces it is not limited by the parameters chosen for the LSTM-optimizee during training. This higher-level optimizer also could address the problem encountered in Section 3.3 where the optimizer was not able to generalize to a different kind of activation function.

In the same sense that the LSTM-optimizer was able to abstract the problem space for the optimizee and provide it with options for training variability, so could the LSTM-optimizer present the LSTM-optimizee with a larger problem space to work with, allowing it to learn alternative activation functions beyond those it was trained with. Importantly, as shown in Section 3.5, these LSTM optimizers can train with small datasets outperforming other meta-learning techniques. Adding another LSTM-layer shouldn’t increase the problem complexity an overwhelming amount, but this remains to be tested.

5. Conclusion & Final Thoughts

Optimization-based meta-learning provides a new frontier in the problem of learning to learn. By placing dynamically-updating and memory-wielding RNN models as optimizers, one can move away from hand-designed freatures to learned features in optimizee models. Against state-of-the-art optimization methods used across machine learning, LSTM-based optimizers outperformed in a variety of different problem settings and proved their versatility.

With quadratic functions, LSTM-optimizers prove that they can navigate the minimization landscape. With MNIST, they prove that they can generalize to more complex problem spaces without training. With CIFAR-10, they prove that they can work in groups across different optimizee architectures. Finally, with ImageNet they prove that they can outperform other optimization-based meta-learning optimization.

In a sense, this model replaces the ML-trainers’ optimization pathway (their brain) with a much more mathematically gifted, but all around less intelligent neural network. The question becomes - how far can this analogy be pushed? Can multiple different RNN-layers more effectively simulate the intelligence of the human brain? Can different RNN-optimizers, like LMUs which mimic biological neurons and retain memory, carry us further than traditional artificial neural networks? What is the limit to the number of layers that RNN-optimizers can be stacked and how many models will this complex structure be able to generalize to? These are all interesting questions to explore. “””