MMS • Anthony Alford
Google AI researchers published two new metrics for measuring the quality of audio and video generated by deep-learning networks, the Fréchet Audio Distance (FAD) and Fréchet Video Distance (FVD). The metrics have been shown to have a high correlation with human evaluations of quality.
In a recent blog post, software engineers Kevin Kilgour and Thomas Unterthiner described the work done by their teams, which builds on previous research in measuring the quality of images generated by neural networks. The teams showed how their new metrics can detect noise added to sound or video, respectively, and how well their metrics tracked with human evaluation of sound or video quality. FAD was evaluated by ranking series of pairs of distorted audio samples, and its choices had a correlation of 0.39 with those of human judges. FVD was evaluated similarly by ranking pairs of videos generated by deep-learning models; it agreed with human rankers between 60% and 80%, depending on the generation criteria used.
The success of deep-learning models has partly been driven by the availability of large, high-quality datasets such as ImageNet. These datasets also provide a “ground truth” against which models can be evaluated. The recent popular application of deep-learning to generate new images posed a new problem: how to measure the quality of the output? Common metrics such as signal-to-noise ratios or mean squared error cannot be applied, since there is no “ground truth” answer for images or other data generated by these networks.
Since the goal is to create output that looks or sounds realistic to humans, the data can be scored by human judges, but that is neither scalable nor necessarily objective. The initial metric proposed by the inventors of GANs was the Inception score (IS). This metric was calculated by applying a pre-trained Inception image classifier to the images and computing statistics on the results. This metric is “closely related to the objective used for training generative models” and was shown to correlate strongly with human judgment of quality.
However, the Inception score metric does have some shortcomings; in particular, it is sensitive to changes in the underlying Inception model used. Unterthiner and others at the LIT AI Laboratory at Johannes Kepler University in Austria developed the Fréchet Inception Distance (FID). Instead of using the classification output of an Inception model, FID uses a hidden layer of an Inception model to compute embeddings for input images. The embeddings are calculated for a set of generated images and a set of real-world (or baseline) images. The resulting datasets are treated as data generated by multivariate Gaussian distributions, and the two distributions are compared using the Fréchet distance. One advantage of the FID over IS is that FID increases as noise is added to an image, compared to IS, which could remain flat or even decrease.
Google’s new metrics extend this idea of calculating embeddings for generated data and comparing the statistics with those of baseline data. For FAD, the team used the VGGish to calculate embeddings, and for FVD, an Inflated 3D Convnet. To validate the usefulness of their metrics, the researchers calculated the metric value for datasets created by adding noise to their baselines; the expectation is that the scores increase as noise is added, which did indeed happen. The team also compared their metric results to human evaluations, finding correlation between their metric and human judgment, and also that their new metric agreed with human judges more consistently than did other commonly used metrics.