On the Connection between Local Attention and Dynamic Depth-wise Convolution

Qi Han; Zejia Fan; Qi Dai; Lei Sun; Ming-Ming Cheng; Jiaying Liu; Jingdong Wang

局所的注意と動的深さ方向の畳み込みの間の関係について

Vision Transformer（ViT）は、視覚認識において最先端のパフォーマンスを実現し、バリアントであるLocalVisionTransformerはさらに改善を行います。 Local Vision Transformerの主要コンポーネントであるローカルアテンションは、小さなローカルウィンドウ上で個別にアテンションを実行します。ローカルアテンションをチャネルごとにローカルに接続されたレイヤーと言い換えて、2つのネットワーク正則化方法、スパース接続と重み共有、および重み計算から分析します。スパース接続：チャネル間の接続はなく、各位置は小さなローカルウィンドウ内の位置に接続されます。重みの共有：1つの位置の接続の重みは、チャネル間またはチャネルの各グループ内で共有されます。動的な重み：接続の重みは、各画像インスタンスに従って動的に予測されます。ローカルアテンションは、スパース接続における深さ方向の畳み込みとその動的バージョンに似ていることを指摘します。主な違いは、重みの共有にあります。深さ方向の畳み込みは、空間位置全体で接続の重み（カーネルの重み）を共有します。深さ方向の畳み込みと計算の複雑さが低い動的バリアントに基づくモデルは、ImageNet分類、COCOオブジェクト検出、およびADEセマンティックに関して、Local VisionTransformerのインスタンスであるSwinTransformerと同等か、場合によってはわずかに優れていることを経験的に観察しています。セグメンテーション。これらの観察結果は、LocalVisionTransformerが2つの正則化形式と動的な重みを利用してネットワーク容量を増やすことを示唆しています。コードはhttps://github.com/Atten4Vis/DemystifyLocalViTで入手できます。

Vision Transformer (ViT) attains state-of-the-art performance in visual recognition, and the variant, Local Vision Transformer, makes further improvements. The major component in Local Vision Transformer, local attention, performs the attention separately over small local windows. We rephrase local attention as a channel-wise locally-connected layer and analyze it from two network regularization manners, sparse connectivity and weight sharing, as well as weight computation. Sparse connectivity: there is no connection across channels, and each position is connected to the positions within a small local window. Weight sharing: the connection weights for one position are shared across channels or within each group of channels. Dynamic weight: the connection weights are dynamically predicted according to each image instance. We point out that local attention resembles depth-wise convolution and its dynamic version in sparse connectivity. The main difference lies in weight sharing - depth-wise convolution shares connection weights (kernel weights) across spatial positions. We empirically observe that the models based on depth-wise convolution and the dynamic variant with lower computation complexity perform on-par with or sometimes slightly better than Swin Transformer, an instance of Local Vision Transformer, for ImageNet classification, COCO object detection and ADE semantic segmentation. These observations suggest that Local Vision Transformer takes advantage of two regularization forms and dynamic weight to increase the network capacity. Code is available at https://github.com/Atten4Vis/DemystifyLocalViT.

updated: Wed Apr 06 2022 08:37:12 GMT+0000 (UTC)

published: Tue Jun 08 2021 11:47:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト