MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ
Researchers from Meta AI and Stanford University have developed a metric for pruning AI datasets which improves training scalability from a power-law to exponential-decay. The metric uses self-supervised learning and performs comparably to existing metrics which require more compute power.
The metric and experiments were described in a paper published on arXiv. Although the performance of large AI models has been shown to scale with dataset size, this scaling follows a power-law relationship, such that increasing performance by a few percentage points may require an order-of-magnitude more data samples. Using statistical mechanics, the researchers showed that by properly pruning the dataset, performance can scale via an exponential-decay relationship. Since existing pruning techniques either perform poorly or else are compute-intensive, the team developed a new technique that uses a self-supervised AI model that can estimate a pruning metric with less computation. According to senior researcher Surya Ganguli:
Overall this work suggests that our current ML practice of collecting large amounts of random data is highly inefficient, leading to huge redundancy in the data, which we show mathematically is the origin of very slow, unsustainable power law scaling of error with dataset size.
In 2020, OpenAI published their research on trends in the accuracy of neural language models trained with different architectures, number of parameters, computing cycles, and dataset sizes. The key finding was that model performance, particularly the training loss value, exhibits a power-law relationship with each of three scale factors: number of model parameters, the size of the training dataset, and the amount of compute power used in training. This power-law relationship means that, holding compute and parameters constant, improvements in performance require large amounts of additional data; for example, according to the Meta team, “an additional 2 billion pre-training data points (starting from 1 billion) leads to an accuracy gain on ImageNet of a few percentage points.”
An exponential-decay relationship, by contrast, would require less additional data to achieve a similar performance improvement. The Meta team began by developing a theoretical model of performance improvement from data pruning using statistical mechanics. The key operation is to determine the margin of each training example, which indicates whether the example is “easy” (large margin) or “hard” (smaller margin). The researchers found that the best pruning strategy depends on the initial dataset size: for small datasets, it was best to keep the easy examples, whereas for larger datasets, keeping harder examples was better. The team also found that as initial dataset size increases, pruning more examples from the dataset is required to achieve an exponential-decay in performance
However, the best existing metrics for dataset pruning required a substantial amount of compute power as well as labeled datasets, making them impractical for pruning training datasets for large foundation models, which are trained on unlabeled datasets. To solve this, the Meta researchers developed a “simple, scalable, self-supervised pruning metric.” To compute the metric, the team used k-means clustering on an embedding space from a pre-trained model. The pruning metric for each dataset example is its distance from the nearest cluster centroid. Using this metric, the researchers could train a model using only 75% of the ImageNet dataset, “without accuracy loss.”
Ari Morcos, a member of the Meta research team, joined a Reddit discussion about the work, answering several users’ questions. In response to a question about the need to train a separate model to compute the pruning metric, Morcos replied:
One of the ideas we propose at the end of the paper is the idea of “foundation datasets”, which would be assembled over many trainings, and then could be used to broadly pre-train. In this case, the cost of finding the foundation data would be amortized over all the future pre-trainings.
The investigation of scaling laws for AI training is an active research topic. In 2020, InfoQ covered OpenAI’s original paper which outlined scaling laws for language models. More recently, researchers from Google and DeepMind investigated the scaling properties of different model architectures, finding that the “scaling coefficient differs greatly” depending on the architecture, and that some model architectures do not scale at all.