Do better ImageNet classifiers assess perceptual similarity better?

Manoj Kumar; Neil Houlsby; Nal Kalchbrenner; Ekin D. Cubuk

より良い ImageNet 分類子は、知覚的類似性をより適切に評価しますか?

事前にトレーニングされた深い特徴の空間で測定された画像間の知覚距離は、画像の類似性を評価する上で、以前の低レベルのピクセルベースのメトリックよりも優れています。 AlexNet や VGG などの古くて精度の低いモデルの知覚的類似性を捉える機能はよく知られていますが、最新のより正確なモデルはあまり研究されていません。このホワイトペーパーでは、ImageNet 分類子が知覚的類似性に対してどの程度うまく機能するかを評価するための大規模な実証研究を紹介します。まず、ImageNet の精度と、ResNet、EfficientNet、Vision Transformers などの最新のネットワークの知覚スコアとの間に逆相関があることを観察します。つまり、優れた分類子は、より悪い知覚スコアを達成します。次に、深度、幅、トレーニングステップ数、重み減衰、ラベルスムージング、およびドロップアウトを変化させて、ImageNet の精度と知覚スコアの関係を調べます。精度が高いほど、知覚スコアはある程度改善されますが、中精度から高精度の領域では、精度と知覚スコアの間にパレートフロンティアが見られます。歪みの不変性、空間周波数感度、代替知覚機能など、多くのもっともらしい仮説を使用して、この関係をさらに調査します。興味深いことに、ImageNet でのみ 5 エポック未満でトレーニングされた浅い ResNets と ResNets を発見しました。その出現した Perceptual Score は、監視された人間の知覚判断で直接トレーニングされた以前の最良のネットワークと一致します。

Perceptual distances between images, as measured in the space of pre-trained deep features, have outperformed prior low-level, pixel-based metrics on assessing image similarity. While the capabilities of older and less accurate models such as AlexNet and VGG to capture perceptual similarity are well known, modern and more accurate models are less studied. In this paper, we present a large-scale empirical study to assess how well ImageNet classifiers perform on perceptual similarity. First, we observe a inverse correlation between ImageNet accuracy and Perceptual Scores of modern networks such as ResNets, EfficientNets, and Vision Transformers: that is better classifiers achieve worse Perceptual Scores. Then, we examine the ImageNet accuracy/Perceptual Score relationship on varying the depth, width, number of training steps, weight decay, label smoothing, and dropout. Higher accuracy improves Perceptual Score up to a certain point, but we uncover a Pareto frontier between accuracies and Perceptual Score in the mid-to-high accuracy regime. We explore this relationship further using a number of plausible hypotheses such as distortion invariance, spatial frequency sensitivity, and alternative perceptual functions. Interestingly we discover shallow ResNets and ResNets trained for less than 5 epochs only on ImageNet, whose emergent Perceptual Score matches the prior best networks trained directly on supervised human perceptual judgements.

updated: Thu Sep 08 2022 15:15:28 GMT+0000 (UTC)

published: Wed Mar 09 2022 18:45:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト