Online Learning Techniques
An overview of online learning techniques, focusing on those that are most effective for the practitioner.
Online learning — a popular research area within the deep learning community — has wide applications in the industrial setting. Scenarios in which data becomes sequentially available to a learner are very common, including dynamic e-commerce recommendations, on-device learning, or federated learning as examples where the full dataset might not be available at the same time. We will tackle the topic of online learning from the viewpoint of a practitioner, answering questions such as:
- What are problems faced by online learning models?
- What are go-to solutions for training a model online?
- What level of performance should I expect from a model that is trained online?
Our objective is to make practitioners aware of the options that exist within the online learning space, providing a viable solution to the scenario in which a model must learn from new data that is constantly becoming available. Not only does such a training setup eliminate latency-ridden, offline re-training procedures (i.e., the model is updated in real time), but it more closely reflects how intelligent systems learn in the world around us — when a human learns a skill, they do not require several hours of GPU training to leverage their new knowledge!
What is online learning?
Let’s define online learning as a training scenario in which the full dataset is never available to the model at the same time. Rather, the model is exposed to the portions of the dataset sequentially and expected to learn the full training task through such partial exposures. Typically, after being exposed to a certain portion of the dataset, the model is not allowed to re-visit this data later. Otherwise, the model could simply loop over the dataset and perform a normal training procedure.
Does the setup matter?
Given that so many different experimental scenarios exist for the study of online learning techniques, the choice of experimental setup is important. For example, models trained for incremental learning scenarios rarely perform well in the streaming setting so it’s important to be specific about the exact learning scenario. Fortunately, many of the training techniques used for all types of online learning are very similar and only require slight modifications to make them more impactful in a given setting.
Why is online learning difficult?
A Naive Approach
One of the simplest approaches to online learning is to maintain a single model that is fine-tuned on new data as it arrives. In this case, data becomes sequentially available to the model, and previous data cannot be re-visited. Therefore, the model is updated/fine-tuned in real-time as new data arrives and slowly learns the task of interest over time. If the incoming stream of data were independent, identically distributed random variables (or i.i.d.) then this fine-tuning approach would work quite well— the stream of data would be sampled evenly from the distribution of training data (e.g., an even sampling of all the different classes within a classification task), and the model would get exposed to all data in a balanced manner. Over time, if data continues to become available, the model will begin to perform quite well.
Unfortunately in many practical applications, the incoming data stream is non-i.i.d, meaning that it does not sample from the underlying dataset in an unbiased manner. For example, all data examples exposed to the model could be from a single class within a classification task. More practically, consider a deep learning model being used on a fashion-based e-commerce site. On a normal day, the model learns from customer activity on the site, sampling from the same group of products. However, one day a new line of products may be added to the e-commerce site, catalyzing a lot of customer activity around a single product group. In this case, the model is exposed to a massive amount of data in relation to a single topic/product, leading to an imbalanced exposure between the new and existing products on the site. Obviously, this will complicate the online learning process, but how?
In the online learning community, the main issue that has been identified for training models in an online fashion over non-i.i.d. data streams is catastrophic forgetting in which the model forgets how to classify previous data as it is exposed to new data. For example, consider a data set with 10 classes, and assume that the online learning model has already been trained on classes one and two. Then, assume that the model receives new data that is only sampled from classes three and four. If the model is fine-tuned on this new data without access to any of the previously-learned data, it will begin to perform well on classes three and four but most likely deteriorate in performance on classes one and two. In other words, it will suffer from catastrophic forgetting!
The goal of online learning is to figure out how to eliminate catastrophic forgetting.
What Approaches Exist?
Many approaches have been proposed for reducing catastrophic forgetting in the online learning domain. Let’s broadly partition these methods into the following categories: architectural modification, regularization, distillation, replay, rebalancing, and other.
Overview. The idea behind architectural modification is simple: as you receive new data, add more parameters to your model to increase its capacity. Such parameters can be added in a structured (e.g., adding entirely new neurons or filters to the architecture) or unstructured (e.g., adding new connections between existing neurons) manner. Moreover, updating the augmented model (i.e., the model with added parameters) after new data is received can be done in two ways: (1) simply update the model without restrictions or (2) use masking/selective plasticity strategies to ensure only non-important neurons(i.e., those that don’t impact performance on previous data) are updated. In both cases, the goal of such an approach is to allow the model to perform well on both new and old data by ensuring it is never kept from expanding its knowledge of the underlying learning problem due to restricted capacity. By always adding more parameters, we ensure the model can continue to learn from new data.
Discussion. Although architectural modification techniques have seen success in small-scale online learning problems, they have two major properties that limit their potential. First, because the architecture is being constantly expanded or enlarged, the memory requirements of these techniques is generally large/unbounded, which may become intractable in the case of large-scale models or datasets. Ideally, it would be better if the memory-usage of an online learning algorithm was not dependent upon the amount of data being received. Additionally, the majority of such methodologies are dependent upon the existence of “task boundaries” (i.e., pre-determined break points in the incoming data stream, such as the batches present in incremental learning). Such task boundaries provide obvious points during the online training process during which parameters/modules can be added to the network. Some architectural modification methodologies are completely dependent on the existence of such task boundaries, but these boundaries are not always present (e.g., during streaming learning). As such, reliance upon tasks boundaries limits the applicability of such techniques in some scenarios.
Overview. Regularization techniques for online learning typically try to (i) identify parameters that are “important”and (ii) induce regularization terms during training that prevent such parameters from being changed too much. Typically, important parameters are defined as those that will deteriorate network performance when updated/perturbed. Numerous different heuristics for importance have been proposed, but they all share the common goal of characterizing whether modifying that parameter will harm the network’s performance with respect to old data that is no longer present. By ensuring that important parameters are not modified during online training, the performance of the network on old data is preserved, as parameter updates for new data are consolidated to regions that are not relevant to the network’s behavior.
Discussion. Similar to architectural modification approaches, regularization-based online learning methodologies show promise at smaller scales. However, when used in large-scale experiments, such approaches tend to not be highly effective and computing parameter importances becomes extremely expensive for large, deep learning models. For this reason, regularization-based approaches to online learning are typically not considered to be useful for large-scale online learning applications.
Overview. Distillation methods for online learning are inspired by the concept of knowledge distillation within deep learning. Originally, knowledge distillation was proposed to “distill” the knowledge of a large “teacher” network into the parameters of a smaller “student” network, by training the student to match the output of the teacher over a dataset. Somewhat differently, online learning methodologies adopt distillation so that the knowledge of previous models (i.e., those trained on older data) can be distilled into the current network being learned to ensure historical knowledge is not lost. The methodology is quite similar to that of normal knowledge distillation. The main difference is that the teacher and student networks are typically of the same size/architecture but taken from different points in the online training phase.
Discussion. Distillation is a commonly used approach within the online learning community that has worked well even at larger scales. However, follow-up work, when considering the use of distillation for online learning, showed that distillation is less effective when previous data is cached for use during fine-tuning. In fact, some work even argued that adding distillation to the loss is unnecessary — possibly even harmful — when an explicit memory of previous data is maintained for use in online updates. As such, distillation methodologies, although they remain popular, are questionably effective when memory of previous data examples is allowed. As such, methods that store previous data examples for use during online updates — collectively referred to as replay (or rehearsal) techniques — have become a go-to approach.
Overview. The term “replay” broadly refers to online learning methodologies that store exemplars from previous portions of the dataset. Then, when new data arrives, these stored exemplars can be incorporated into the online learning process to prevent catastrophic forgetting. For example, these previous data examples could be added into a distillation loss to avoid network output from deviating too much from previous settings. More commonly, previous data exemplars could be simply sampled (i.e., as in a mini batch) for combination with new data during the online learning process. In the batch incremental setting, previous examples would be mixed with the batch of new data during fine-tuning to ensure old knowledge is not lost. Similarly, streaming approaches would incorporate randomly sampled exemplars from previous classes into online updates, thus ensuring knowledge is maintained.
Discussion. Replay mechanisms are a now a core component of most online learning methodologies due to the scale-agnostic success of these methodologies in various applications. Although storing previous data examples can be memory intensive, performing replay provides drastic benefits to online learning performance. In fact, replay has been shown to eliminate catastrophic forgetting if sufficient data exemplars are maintained in the buffer. Due to its simplicity and practical effectiveness, replay has become extremely popular in the online learning community.
Overview. Several recent works in batch-incremental learning have noticed that models learned in an online fashion tend to be biased towards the most recently observed data (i.e., those in the most recent batch). As such, several techniques, which I refer to as rebalancing techniques, were proposed to eliminate such imbalance. The core idea behind such approaches is to ensure predictions are not biased towards newer data (e.g., in the classification setting, a biased model would predict nearly all data as one of the classes within the most recently observed batch of training data). Rather, the magnitude of predictions should be balanced between all classes or types of data, agnostic of when such data was encountered during the training process.
Discussion. Prediction bias is a known, measurable issue within incremental learning (i.e., both task and batch/class incremental). Furthermore, adding rebalancing is shown to drastically improve incremental learning performance, even on large-scale datasets such as ImageNet. As such, rebalancing methods are worth employing in this domain. In general, it is probably best to utilize methods that do not require any validation set to perform rebalancing (i.e., this simply avoids the trouble of creating a validation set). Beyond the incremental learning setting (e.g., in streaming learning), it is not clear whether classification bias follows the same patterns as in incremental learning. However, adding rebalancing is unlikely to damage the performance of the online learning model.
Somewhat similarly to replay, several works have proposed the use of generative models to “hallucinate” examples from previous classes. Another somewhat popular area of study is dual memory techniques for online learning. Such methodologies are inspired by the brain and try to mimic the biological process of memory consolidation. Some other less studied but notable approaches to online learning include sparse-coding methodologies, ensemble-based methods, and methodologies that modify the activation function within the neural network to avoid catastrophic forgetting. Although these methodologies are less popular and have fallen out of favor in comparison to common approaches like replay and distillation, it is still useful to have such techniques in mind to gain a holistic understanding of the area and (hopefully) provide ideas for future innovation.
So...What should I use?
The utility of existing methods can be summarized simply as follows:
- Architectural modification and regularization are less used because they suffer certain drawbacks and tend to not perform as well at scale.
- Distillation is very popular but is questionably effective when replay is allowed.
- Replay is widely considered to be the go-to approach for mitigating catastrophic forgetting and has been shown to work extremely well in large-scale online learning experiments.
- Rebalancing is important in the incremental learning setting, as it eliminates biases that form towards recently observed data.
Therefore, the best “bang for your buck” in the online learning domain would be using replay-based online learning methods. A replay-based approach for large-scale deep learning scenarios performs well and is memory efficient. Other approaches demonstrate that simply maintaining a buffer of previous data for use during online updates is an extremely powerful tool. With this in mind, performing replay seems to be a good choice in almost all online learning scenarios.
In addition to replay, using distillation may improve performance in some cases, though in some cases it has been found that distillation is not useful when combined with replay. A combination of distillation and replay that seems to perform very well even at larger scales, thus showing that distillation can positively impact online learning performance in certain scenarios. Furthermore, if one is training a model using incremental learning, it is important to utilize rebalancing, as bias within the classification layer can significantly deteriorate performance.
This paper is an abstract of a post by Alegion Research Scientist Cameron Wolfe.
Read the full article here.