Transformer in Transformer

Kai Han; An Xiao; Enhua Wu; Jianyuan Guo; Chunjing Xu; Yunhe Wang

トランスフォーマーのトランスフォーマー

Transformerは、もともとNLPタスクに適用された一種の自己注意ベースのニューラルネットワークです。最近、コンピュータビジョンの問題を解決するために、純粋なトランスベースのモデルが提案されています。これらのビジュアルトランスフォーマーは通常、各パッチ内の固有の構造情報を無視しながら、画像を一連のパッチとして表示します。この論文では、パッチレベルとピクセルレベルの両方の表現をモデル化するための新しいTransformer-iN-Transformer（TNT）モデルを提案します。各TNTブロックでは、外側のトランスフォーマーブロックを使用してパッチの埋め込みを処理し、内側のトランスフォーマーブロックを使用してピクセルの埋め込みから局所的な特徴を抽出します。ピクセルレベルの機能は、線形変換レイヤーによってパッチ埋め込みのスペースに投影され、パッチに追加されます。 TNTブロックを積み重ねることにより、画像認識用のTNTモデルを構築します。 ImageNetベンチマークとダウンストリームタスクの実験は、提案されたTNTアーキテクチャの優位性と効率性を示しています。たとえば、私たちのTNTは、ImageNetで81.3％のトップ1精度を達成します。これは、同様の計算コストでDeiTよりも1.5％高くなります。コードはhttps://github.com/huawei-noah/noah-research/tree/master/TNTで入手できます。

Transformer is a type of self-attention-based neural networks originally applied for NLP tasks. Recently, pure transformer-based models are proposed to solve computer vision problems. These visual transformers usually view an image as a sequence of patches while they ignore the intrinsic structure information inside each patch. In this paper, we propose a novel Transformer-iN-Transformer (TNT) model for modeling both patch-level and pixel-level representation. In each TNT block, an outer transformer block is utilized to process patch embeddings, and an inner transformer block extracts local features from pixel embeddings. The pixel-level feature is projected to the space of patch embedding by a linear transformation layer and then added into the patch. By stacking the TNT blocks, we build the TNT model for image recognition. Experiments on ImageNet benchmark and downstream tasks demonstrate the superiority and efficiency of the proposed TNT architecture. For example, our TNT achieves 81.3% top-1 accuracy on ImageNet which is 1.5% higher than that of DeiT with similar computational cost. The code will be available at https://github.com/huawei-noah/noah-research/tree/master/TNT.

updated: Sat Feb 27 2021 03:12:16 GMT+0000 (UTC)

published: Sat Feb 27 2021 03:12:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト