SSAN: Separable Self-Attention Network for Video Representation Learning

Xudong Guo; Xun Guo; Yan Lu

SSAN：ビデオ表現学習のための分離可能な自己注意ネットワーク

自己注意は、長距離の依存関係をモデル化する効果があるため、ビデオ表現学習にうまく適用されています。既存のアプローチは、空間的次元と時間的次元に沿ったペアワイズ相関を同時に計算するだけで依存関係を構築します。ただし、空間相関と時間相関は、シーンと時間的推論の異なるコンテキスト情報を表します。直感的には、最初に空間コンテキスト情報を学習すると、時間モデリングに役立ちます。この論文では、空間的文脈を時間的モデリングで効率的に使用できるように、空間的および時間的相関を順次モデル化する分離可能自己注意（SSA）モジュールを提案します。 SSAモジュールを2DCNNに追加することにより、ビデオ表現学習用のSSAネットワーク（SSAN）を構築します。ビデオアクション認識のタスクでは、私たちのアプローチは、Something-SomethingおよびKinetics-400データセットの最先端の方法よりも優れています。私たちのモデルは、ネットワークが浅く、モダリティが少ないモデルよりもパフォーマンスが優れていることがよくあります。さらに、ビデオ表現とテキスト埋め込みの均一性を示す、ビデオ検索の視覚言語タスクにおけるメソッドの意味学習能力を検証します。 MSR-VTTおよびYoucook2データセットでは、SSAによって学習されたビデオ表現により、最先端のパフォーマンスが大幅に向上します。

Self-attention has been successfully applied to video representation learning due to the effectiveness of modeling long range dependencies. Existing approaches build the dependencies merely by computing the pairwise correlations along spatial and temporal dimensions simultaneously. However, spatial correlations and temporal correlations represent different contextual information of scenes and temporal reasoning. Intuitively, learning spatial contextual information first will benefit temporal modeling. In this paper, we propose a separable self-attention (SSA) module, which models spatial and temporal correlations sequentially, so that spatial contexts can be efficiently used in temporal modeling. By adding SSA module into 2D CNN, we build a SSA network (SSAN) for video representation learning. On the task of video action recognition, our approach outperforms state-of-the-art methods on Something-Something and Kinetics-400 datasets. Our models often outperform counterparts with shallower network and fewer modalities. We further verify the semantic learning ability of our method in visual-language task of video retrieval, which showcases the homogeneity of video representations and text embeddings. On MSR-VTT and Youcook2 datasets, video representations learnt by SSA significantly improve the state-of-the-art performance.

updated: Thu May 27 2021 10:02:04 GMT+0000 (UTC)

published: Thu May 27 2021 10:02:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト