Grounding inductive biases in natural images:invariance stems from variations in data

Diane Bouchacourt; Mark Ibrahim; Ari S. Morcos

自然画像における誘導バイアスの接地: データの変動に起因する不変性

目に見えず、分布外の可能性があるサンプルでうまく機能するには、入力の変動要因に影響を与える変換に関して、機械学習モデルが予測可能な応答を持つことが望ましいです。不変性は通常、手作業で作成されたデータ拡張によって実現されますが、標準のデータ拡張は実際のデータの変化を説明する変換に対処しますか?以前の研究は合成データに焦点を当てていましたが、ここでは実際のデータセットである ImageNet の変動要因を特徴付け、これらの要因の変化に関する標準の残差ネットワークと最近提案されたビジョントランスフォーマーの両方の不変性を研究しようとします。標準的な拡張は、並進ネットワークなどの畳み込みアーキテクチャに組み込まれた (近似の) 並進不変性にもかかわらず、パフォーマンスの向上の大部分を回復することで、並進とスケールの正確な組み合わせに依存していることを示しています。実際、残差ネットワークと視覚変換モデルでは、誘導バイアスが著しく異なるにもかかわらず、スケールと平行移動の不変性が類似していることがわかりました。トレーニングデータ自体が不変性の主な原因であり、データ拡張は学習した不変性をさらに増加させるだけであることを示しています。興味深いことに、トレーニングプロセスからもたらされた不変性は、私たちが見つけた ImageNet の変動要因と一致しています。最後に、ImageNet の変動の主な要因は主に外観に関連しており、各クラスに固有であることがわかりました。

To perform well on unseen and potentially out-of-distribution samples, it is desirable for machine learning models to have a predictable response with respect to transformations affecting the factors of variation of the input. Invariance is commonly achieved through hand-engineered data augmentation, but do standard data augmentations address transformations that explain variations in real data? While prior work has focused on synthetic data, we attempt here to characterize the factors of variation in a real dataset, ImageNet, and study the invariance of both standard residual networks and the recently proposed vision transformer with respect to changes in these factors. We show standard augmentation relies on a precise combination of translation and scale, with translation recapturing most of the performance improvement -- despite the (approximate) translation invariance built in to convolutional architectures, such as residual networks. In fact, we found that scale and translation invariance was similar across residual networks and vision transformer models despite their markedly different inductive biases. We show the training data itself is the main source of invariance, and that data augmentation only further increases the learned invariances. Interestingly, the invariances brought from the training process align with the ImageNet factors of variation we found. Finally, we find that the main factors of variation in ImageNet mostly relate to appearance and are specific to each class.

updated: Wed Jun 09 2021 14:58:57 GMT+0000 (UTC)

published: Wed Jun 09 2021 14:58:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト