ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

Robert Geirhos; Patricia Rubisch; Claudio Michaelis; Matthias Bethge; Felix A. Wichmann; Wieland Brendel

ImageNet でトレーニングされた CNN は、テクスチャに偏っています。形状バイアスを増やすと、精度とロバスト性が向上します

畳み込みニューラルネットワーク (CNN) は、オブジェクト形状のますます複雑な表現を学習することによってオブジェクトを認識すると一般的に考えられています。最近のいくつかの研究では、画像テクスチャのより重要な役割が示唆されています。ここでは、これらの相反する仮説を、テクスチャ形状の手がかりが衝突する画像で CNN と人間の観察者を評価することにより、定量的テストにかけます。 ImageNet でトレーニングされた CNN は、形状ではなくテクスチャを認識することに強く偏っていることを示しています。これは、人間の行動の証拠とはまったく対照的であり、根本的に異なる分類戦略を明らかにしています。次に、ImageNet でテクスチャベースの表現を学習する同じ標準アーキテクチャ (ResNet-50) が、ImageNet の様式化されたバージョンである「Stylized-ImageNet」でトレーニングすると、代わりに形状ベースの表現を学習できることを示します。これにより、十分に管理された精神物理学ラボ環境 (97 人の観察者による合計 48,560 の精神物理学試験である 9 つの実験) での人間の行動パフォーマンスにはるかに適合し、オブジェクト検出パフォーマンスの向上や、これまでにない堅牢性など、予想外のいくつかの緊急の利点がもたらされます。形状ベースの表現の利点を際立たせる、さまざまな画像の歪み。

Convolutional Neural Networks (CNNs) are commonly thought to recognise objects by learning increasingly complex representations of object shapes. Some recent studies suggest a more important role of image textures. We here put these conflicting hypotheses to a quantitative test by evaluating CNNs and human observers on images with a texture-shape cue conflict. We show that ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies. We then demonstrate that the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on "Stylized-ImageNet", a stylized version of ImageNet. This provides a much better fit for human behavioural performance in our well-controlled psychophysical lab setting (nine experiments totalling 48,560 psychophysical trials across 97 observers) and comes with a number of unexpected emergent benefits such as improved object detection performance and previously unseen robustness towards a wide range of image distortions, highlighting advantages of a shape-based representation.

updated: Wed Nov 09 2022 23:15:15 GMT+0000 (UTC)

published: Thu Nov 29 2018 15:04:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト