Contextual Transformer Networks for Visual Recognition

Yehao Li; Ting Yao; Yingwei Pan; Tao Mei

視覚認識のためのコンテキストトランスフォーマーネットワーク

自己注意を持ったTransformerは、自然言語処理分野に革命をもたらし、最近、多数のコンピュータービジョンタスクで競争力のある結果をもたらすTransformerスタイルのアーキテクチャ設計の出現を促しています。それにもかかわらず、既存の設計のほとんどは、2Dフィーチャマップ上で直接自己注意を使用して、各空間位置で分離されたクエリとキーのペアに基づいて注意マトリックスを取得しますが、隣接キー間の豊富なコンテキストは十分に活用されていません。この作業では、視覚認識のための新しいTransformerスタイルのモジュール、つまりContextual Transformer（CoT）ブロックを設計します。このような設計は、入力キー間のコンテキスト情報を十分に活用して、動的注意マトリックスの学習をガイドし、視覚的表現の能力を強化します。技術的には、CoTブロックは最初に、3×3の畳み込みを介して入力キーをコンテキストでエンコードし、入力の静的なコンテキスト表現を導きます。さらに、エンコードされたキーを入力クエリと連結して、2つの連続する1×1畳み込みを通じて動的なマルチヘッドアテンションマトリックスを学習します。学習した注意マトリックスに入力値を掛けて、入力の動的なコンテキスト表現を実現します。静的および動的なコンテキスト表現の融合は、最終的に出力として取得されます。私たちのCoTブロックは、ResNetアーキテクチャの各3×3畳み込みを簡単に置き換えることができ、Contextual Transformer Networks（CoTNet）という名前のTransformerスタイルのバックボーンを生成できるという点で魅力的です。幅広いアプリケーション（画像認識、オブジェクト検出、インスタンスセグメンテーションなど）での広範な実験を通じて、より強力なバックボーンとしてのCoTNetの優位性を検証します。ソースコードはhttps://github.com/JDAI-CV/CoTNetで入手できます。

Transformer with self-attention has led to the revolutionizing of natural language processing field, and recently inspires the emergence of Transformer-style architecture design with competitive results in numerous computer vision tasks. Nevertheless, most of existing designs directly employ self-attention over a 2D feature map to obtain the attention matrix based on pairs of isolated queries and keys at each spatial location, but leave the rich contexts among neighbor keys under-exploited. In this work, we design a novel Transformer-style module, i.e., Contextual Transformer (CoT) block, for visual recognition. Such design fully capitalizes on the contextual information among input keys to guide the learning of dynamic attention matrix and thus strengthens the capacity of visual representation. Technically, CoT block first contextually encodes input keys via a 3×3 convolution, leading to a static contextual representation of inputs. We further concatenate the encoded keys with input queries to learn the dynamic multi-head attention matrix through two consecutive 1×1 convolutions. The learnt attention matrix is multiplied by input values to achieve the dynamic contextual representation of inputs. The fusion of the static and dynamic contextual representations are finally taken as outputs. Our CoT block is appealing in the view that it can readily replace each 3×3 convolution in ResNet architectures, yielding a Transformer-style backbone named as Contextual Transformer Networks (CoTNet). Through extensive experiments over a wide range of applications (e.g., image recognition, object detection and instance segmentation), we validate the superiority of CoTNet as a stronger backbone. Source code is available at https://github.com/JDAI-CV/CoTNet.

updated: Mon Jul 26 2021 16:00:21 GMT+0000 (UTC)

published: Mon Jul 26 2021 16:00:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト