TDAN: Top-Down Attention Networks for Enhanced Feature Selectivity in CNNs

Shantanu Jaiswal; Basura Fernando; Cheston Tan

TDAN：CNNの機能選択性を強化するためのトップダウンアテンションネットワーク

畳み込みニューラルネットワーク（CNN）の注意モジュールは、複数のコンピュータービジョンタスクでネットワークのパフォーマンスを向上させる効果的な方法です。多くの作品は、チャネル、空間、および自己注意の適切なモデリングを通じてより効果的なモジュールを構築することに焦点を当てていますが、それらは主にフィードフォワード方式で動作します。その結果、注意メカニズムは、単一の入力機能アクティベーションの表現能力に強く依存し、トップダウンの情報フローを通じて「何をどこで見るか」を指定できる、意味的に豊富な高レベルのアクティベーションを組み込むことで恩恵を受けることができます。このようなフィードバック接続は霊長類の視覚野でも一般的であり、神経科学者は霊長類の視覚的注意の重要な要素として認識しています。したがって、この作業では、入力のトップダウンチャネルと空間変調を実行するために「ビジュアルサーチライト」を繰り返し生成し、その結果、各計算ステップでより選択的な機能のアクティブ化を出力する軽量トップダウン（TD）アテンションモジュールを提案します。私たちの実験は、TDをCNNに統合すると、ImageNet-1k分類でのパフォーマンスが向上し、パラメーターとメモリの効率が向上する一方で、著名なアテンションモジュールよりも優れていることを示しています。さらに、私たちのモデルは、推論中の入力解像度の変化に対してより堅牢であり、明示的な監視なしに各計算ステップで個々のオブジェクトまたは機能をローカライズすることによって「注意をシフト」することを学びます。この機能により、細粒度およびマルチラベル分類の改善に加えて、弱教師ありオブジェクトのローカリゼーションでResNet50が5％改善されます。

Attention modules for Convolutional Neural Networks (CNNs) are an effective method to enhance performance of networks on multiple computer-vision tasks. While many works focus on building more effective modules through appropriate modelling of channel-, spatial- and self-attention, they primarily operate in a feedfoward manner. Consequently, the attention mechanism strongly depends on the representational capacity of a single input feature activation, and can benefit from incorporation of semantically richer higher-level activations that can specify "what and where to look" through top-down information flow. Such feedback connections are also prevalent in the primate visual cortex and recognized by neuroscientists as a key component in primate visual attention. Accordingly, in this work, we propose a lightweight top-down (TD) attention module that iteratively generates a "visual searchlight" to perform top-down channel and spatial modulation of its inputs and consequently outputs more selective feature activations at each computation step. Our experiments indicate that integrating TD in CNNs enhances their performance on ImageNet-1k classification and outperforms prominent attention modules while being more parameter and memory efficient. Further, our models are more robust to changes in input resolution during inference and learn to "shift attention" by localizing individual objects or features at each computation step without any explicit supervision. This capability results in 5% improvement for ResNet50 on weakly-supervised object localization besides improvements in fine-grained and multi-label classification.

updated: Fri Nov 26 2021 12:35:17 GMT+0000 (UTC)

published: Fri Nov 26 2021 12:35:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト