Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition

Ryota Hashiguchi; Toru Tamaki

効率的な行動認識のための時間的シフトによる交差注意を備えたビジョントランスフォーマー

Temporal Shift Module (TSM) が提案されて以来、特徴シフトは CNN ベースのモデルを使用したアクション認識に役立つことが示されています。これは、後期融合によるフレーム単位の特徴抽出に基づいており、層の特徴は時間的な相互作用のために時間方向に沿ってシフトされます。 Vision Transformer (ViT) に基づく最近のモデルである TokenShift も一時的な特徴シフトメカニズムを使用しますが、ViT の Multi-head Self-Attention (MSA) の構造を完全には活用していません。本稿では、アテンション構造を十分に活用するマルチヘッドセルフ/クロスアテンション (MSCA) を提案します。 TokenShift は、連続するフレーム (時間 t+1 および t-1) で時間的にシフトされる機能を備えたフレーム単位の ViT に基づいています。対照的に、提案された MSCA はフレーム単位の ViT で MSA を置き換え、一部の MSA ヘッドは現在のフレームではなく連続するフレームに注意を向けます。計算コストは、注目するターゲットを変更するだけなので、フレーム単位の ViT および TokenShift と同じです。連続するフレームからキー、クエリ、および値のどれを取得するかについての選択肢があり、これらのバリアントを実験的に Kinetics400 と比較しました。また、提案された MSCA がヘッドディメンションの代わりに ViT のパッチディメンションに沿って使用される他のバリアントも調査します。実験結果では、亜種である MSCA-KV が最高のパフォーマンスを示し、TokenShift よりも 0.1%、ViT よりも 1.2% 優れていることが示されています。

Feature shifts have been shown to be useful for action recognition with CNN-based models since Temporal Shift Module (TSM) was proposed. It is based on frame-wise feature extraction with late fusion, and layer features are shifted along the time direction for the temporal interaction. TokenShift, a recent model based on Vision Transformer (ViT), also uses the temporal feature shift mechanism, which, however, does not fully exploit the structure of Multi-head Self-Attention (MSA) in ViT. In this paper, we propose Multi-head Self/Cross-Attention (MSCA), which fully utilizes the attention structure. TokenShift is based on a frame-wise ViT with features temporally shifted with successive frames (at time t+1 and t-1). In contrast, the proposed MSCA replaces MSA in the frame-wise ViT, and some MSA heads attend to successive frames instead of the current frame. The computation cost is the same as the frame-wise ViT and TokenShift as it simply changes the target to which the attention is taken. There is a choice about which of key, query, and value are taken from the successive frames, then we experimentally compared these variants with Kinetics400. We also investigate other variants in which the proposed MSCA is used along the patch dimension of ViT, instead of the head dimension. Experimental results show that a variant, MSCA-KV, shows the best performance and is better than TokenShift by 0.1% and then ViT by 1.2%.

updated: Mon Nov 14 2022 01:41:09 GMT+0000 (UTC)

published: Fri Apr 01 2022 14:06:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト