Ultrafast Image Categorization in Biology and Neural Models

Jean-Nicolas Jérémie; Laurent U Perrinet

生物学および神経モデルにおける超高速画像分類

人間は画像を非常に効率的に分類することができ、特に動物の存在を非常に迅速に検出できます。最近、畳み込みニューラルネットワーク (CNN) に基づく深層学習アルゴリズムは、広範囲の視覚的分類タスクにおいて人間よりも高い精度を達成しました。ただし、これらの人工ネットワークが通常トレーニングおよび評価されるタスクは高度に特殊化される傾向があり、画像回転後に精度が低下するなど、一般化が不十分です。この点において、動物の認識など、より一般的なタスクに関しては、生物学的視覚システムは人工システムよりも柔軟で効率的です。生物学的ニューラルネットワークと人工ニューラルネットワークの比較をさらに進めるために、動物または人工物の存在の検出という、人間にとって生態学的に関連する 2 つの独立したタスクに関して標準 VGG 16 CNN を再トレーニングしました。ネットワークを再トレーニングすると、心理物理学的タスクで報告されているパフォーマンスと同等の人間のようなレベルのパフォーマンスが達成されることを示します。さらに、モデルの出力を組み合わせると分類がより適切になることを示します。実際、人工物（建物など）を含む写真には動物（ライオンなど）があまり写らない傾向があります。さらに、これらの再トレーニングされたモデルは、回転 (上下逆または傾斜した画像など) やグレースケール変換に対する堅牢性など、人間の精神物理学的に予期しない行動観察を再現することができました。最後に、このようなパフォーマンスを達成するために必要な CNN レイヤーの数を定量化し、わずか数レイヤーで超高速画像分類の良好な精度を達成できることを示し、画像認識には視覚オブジェクトの詳細な逐次分析が必要であるという考えに異議を唱えました。

Humans are able to categorize images very efficiently, in particular to detect the presence of an animal very quickly. Recently, deep learning algorithms based on convolutional neural networks (CNNs) have achieved higher than human accuracy for a wide range of visual categorization tasks. However, the tasks on which these artificial networks are typically trained and evaluated tend to be highly specialized and do not generalize well, e.g., accuracy drops after image rotation. In this respect, biological visual systems are more flexible and efficient than artificial systems for more general tasks, such as recognizing an animal. To further the comparison between biological and artificial neural networks, we re-trained the standard VGG 16 CNN on two independent tasks that are ecologically relevant to humans: detecting the presence of an animal or an artifact. We show that re-training the network achieves a human-like level of performance, comparable to that reported in psychophysical tasks. In addition, we show that the categorization is better when the outputs of the models are combined. Indeed, animals (e.g., lions) tend to be less present in photographs that contain artifacts (e.g., buildings). Furthermore, these re-trained models were able to reproduce some unexpected behavioral observations from human psychophysics, such as robustness to rotation (e.g., an upside-down or tilted image) or to a grayscale transformation. Finally, we quantified the number of CNN layers required to achieve such performance and showed that good accuracy for ultrafast image categorization can be achieved with only a few layers, challenging the belief that image recognition requires deep sequential analysis of visual objects.

updated: Wed May 31 2023 05:30:51 GMT+0000 (UTC)

published: Sat May 07 2022 11:19:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト