Guiding Visual Attention in Deep Convolutional Neural Networks Based on Human Eye Movements

Leonard E. van Dyck; Sebastian J. Denzler; Walter R. Gruber

人間の眼球運動に基づく深い畳み込みニューラルネットワークにおける視覚的注意の誘導

ディープ畳み込みニューラルネットワーク（DCNN）は、もともと生物学的視覚の原理に触発され、オブジェクト認識の現在の最良の計算モデルに進化し、その結果、ニューロイメージングおよびニューラル時系列データとの比較を通じて、腹側視覚経路との強力なアーキテクチャおよび機能の並列性を示しています。ディープラーニングの最近の進歩によりこの類似性が低下しているように思われるため、計算論的神経科学は、生物学的妥当性をリバースエンジニアリングして有用なモデルを取得することに挑戦しています。以前の研究では、生物学的に着想を得たアーキテクチャがモデルの人間らしさを増幅できることが示されていましたが、この研究では、純粋にデータ駆動型のアプローチを調査します。人間の視線追跡データを使用してトレーニング例を直接変更し、それによって、人間の凝視の焦点に向かって、または焦点から離れて、自然画像でのオブジェクト認識中にモデルの視覚的注意を導きます。 GradCAM顕著性マップを介して、人間の参加者の視線追跡データに対して、さまざまな操作タイプ（つまり、標準、人間のような注意、および人間のような注意）を比較および検証します。私たちの結果は、提案されたガイド付きフォーカス操作が負の方向に意図したとおりに機能し、人間に似ていないモデルが人間と比較して大幅に異なる画像部分に焦点を合わせていることを示しています。観察された効果は、カテゴリ固有であり、有生性と顔の存在によって強化され、フィードフォワード処理が完了した後にのみ発生し、顔検出に強い影響を示しました。ただし、このアプローチでは、人間らしさの大幅な増加は見られませんでした。 DCNNでの明白な視覚的注意の可能なアプリケーションと、顔検出の理論へのさらなる影響について説明します。

Deep Convolutional Neural Networks (DCNNs) were originally inspired by principles of biological vision, have evolved into best current computational models of object recognition, and consequently indicate strong architectural and functional parallelism with the ventral visual pathway throughout comparisons with neuroimaging and neural time series data. As recent advances in deep learning seem to decrease this similarity, computational neuroscience is challenged to reverse-engineer the biological plausibility to obtain useful models. While previous studies have shown that biologically inspired architectures are able to amplify the human-likeness of the models, in this study, we investigate a purely data-driven approach. We use human eye tracking data to directly modify training examples and thereby guide the models' visual attention during object recognition in natural images either towards or away from the focus of human fixations. We compare and validate different manipulation types (i.e., standard, human-like, and non-human-like attention) through GradCAM saliency maps against human participant eye tracking data. Our results demonstrate that the proposed guided focus manipulation works as intended in the negative direction and non-human-like models focus on significantly dissimilar image parts compared to humans. The observed effects were highly category-specific, enhanced by animacy and face presence, developed only after feedforward processing was completed, and indicated a strong influence on face detection. With this approach, however, no significantly increased human-likeness was found. Possible applications of overt visual attention in DCNNs and further implications for theories of face detection are discussed.

updated: Tue Jun 21 2022 17:59:23 GMT+0000 (UTC)

published: Tue Jun 21 2022 17:59:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト