Long-term Leap Attention, Short-term Periodic Shift for Video Classification

Hao Zhang; Lechao Cheng; Yanbin Hao; Chong-Wah Ngo

ビデオ分類のための長期的な飛躍的注意、短期的な定期的なシフト

ビデオトランスフォーマーは、静的ビジョントランスフォーマーよりも計算負荷が大きくなります。これは、前者が2次複雑性（T ^ 2N ^ 2）の現在の注目の下で、後者よりもT倍長いシーケンスを処理するためです。既存の作品は、時間軸を空間軸の単純な拡張として扱い、時間的冗長性を利用せずに、一般的なプーリングまたはローカルウィンドウ処理のいずれかによって時空間シーケンスを短縮することに焦点を当てています。ただし、ビデオには当然、隣接するフレーム間に冗長な情報が含まれています。これにより、視覚的に類似したフレームへの注意を拡張して抑制することができる可能性があります。この仮説に基づいて、（2TN ^ 2）の複雑さを備えた、ビデオトランスフォーマー用の長期的な「LeapAttention」（LA）、短期的な「Periodic Shift」（P-Shift）モジュールであるLAPSを提案します。。具体的には、「LA」は長期フレームをペアにグループ化し、注意を介して各個別のペアをリファクタリングします。「P-Shift」は、短期間のダイナミクスの喪失に立ち向かうために、時間的な隣人の間で機能を交換します。バニラ2DアテンションをLAPSに置き換えることで、静的トランスフォーマーをビデオトランスフォーマーに適合させることができ、余分なパラメーターはゼロで、計算のオーバーヘッドは無視できます（約2.6％）。標準のKinetics-400ベンチマークでの実験は、LAPS変圧器が、CNNと変圧器SOTAの間で、精度、FLOP、およびパラメーターの点で競争力のあるパフォーマンスを達成できることを示しています。プロジェクトを\sloppyhttps://github.com/VideoNetworks/LAPS-transformer{magentahttps://github.com/VideoNetworks/LAPS-transformer}でオープンソース化します。

Video transformer naturally incurs a heavier computation burden than a static vision transformer, as the former processes T times longer sequence than the latter under the current attention of quadratic complexity (T^2N^2). The existing works treat the temporal axis as a simple extension of spatial axes, focusing on shortening the spatio-temporal sequence by either generic pooling or local windowing without utilizing temporal redundancy. However, videos naturally contain redundant information between neighboring frames; thereby, we could potentially suppress attention on visually similar frames in a dilated manner. Based on this hypothesis, we propose the LAPS, a long-term ``Leap Attention'' (LA), short-term ``Periodic Shift'' (P-Shift) module for video transformers, with (2TN^2) complexity. Specifically, the ``LA'' groups long-term frames into pairs, then refactors each discrete pair via attention. The ``P-Shift'' exchanges features between temporal neighbors to confront the loss of short-term dynamics. By replacing a vanilla 2D attention with the LAPS, we could adapt a static transformer into a video one, with zero extra parameters and neglectable computation overhead (∼2.6%). Experiments on the standard Kinetics-400 benchmark demonstrate that our LAPS transformer could achieve competitive performances in terms of accuracy, FLOPs, and Params among CNN and transformer SOTAs. We open-source our project in \sloppy https://github.com/VideoNetworks/LAPS-transformer{magentahttps://github.com/VideoNetworks/LAPS-transformer} .

updated: Tue Jul 12 2022 13:30:15 GMT+0000 (UTC)

published: Tue Jul 12 2022 13:30:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト