TCGL: Temporal Contrastive Graph for Self-supervised Video Representation Learning

Yang Liu; Keze Wang; Lingbo Liu; Haoyuan Lan; Liang Lin

TCGL：自己監視型ビデオ表現学習のための時間的対照グラフ

ビデオの自己監視学習は困難な作業であり、豊富な時空間知識を活用し、ラベルのない大量のビデオから効果的な監視信号を生成するには、モデルからの大きな表現力が必要です。ただし、既存の方法では、ラベルのない動画の時間的多様性を高めることができず、マルチスケールの時間的依存関係を明示的にモデル化することを無視しています。これらの制限を克服するために、ビデオ内のマルチスケールの時間依存性を利用し、スニペット間およびスニペット内の時間依存性を共同でモデル化する、Temporal Contrastive Graph Learning（TCGL）という名前の新しいビデオ自己監視学習フレームワークを提案します。ハイブリッドグラフ対照学習戦略による時間表現学習。具体的には、離散コサイン変換の周波数領域分析に基づいて、ビデオからモーションエンハンスド時空間表現を抽出するために、時空間知識発見（STKD）モジュールが最初に導入されました。ラベルのないビデオのマルチスケールの時間依存性を明示的にモデル化するために、TCGLは、フレームとスニペットの順序に関する事前知識をグラフ構造、つまりスニペット内/スニペット間時間対照グラフ（TCG）に統合します。次に、特定の対照学習モジュールが、異なるグラフビューのノード間の一致を最大化するように設計されています。ラベルのないビデオの監視信号を生成するために、ビデオスニペット間の関係知識を活用してグローバルコンテキスト表現を学習し、チャネルごとの機能を適応的に再調整する適応スニペット順序予測（ASOP）モジュールを導入します。実験結果は、大規模な行動認識およびビデオ検索ベンチマークにおける最先端の方法に対するTCGLの優位性を示しています。コードはhttps://github.com/YangLiu9208/TCGLで公開されています。

Video self-supervised learning is a challenging task, which requires significant expressive power from the model to leverage rich spatial-temporal knowledge and generate effective supervisory signals from large amounts of unlabeled videos. However, existing methods fail to increase the temporal diversity of unlabeled videos and ignore elaborately modeling multi-scale temporal dependencies in an explicit way. To overcome these limitations, we take advantage of the multi-scale temporal dependencies within videos and proposes a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL), which jointly models the inter-snippet and intra-snippet temporal dependencies for temporal representation learning with a hybrid graph contrastive learning strategy. Specifically, a Spatial-Temporal Knowledge Discovering (STKD) module is first introduced to extract motion-enhanced spatial-temporal representations from videos based on the frequency domain analysis of discrete cosine transform. To explicitly model multi-scale temporal dependencies of unlabeled videos, our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG). Then, specific contrastive learning modules are designed to maximize the agreement between nodes in different graph views. To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module which leverages the relational knowledge among video snippets to learn the global context representation and recalibrate the channel-wise features adaptively. Experimental results demonstrate the superiority of our TCGL over the state-of-the-art methods on large-scale action recognition and video retrieval benchmarks.The code is publicly available at https://github.com/YangLiu9208/TCGL.

updated: Wed Jan 05 2022 03:44:26 GMT+0000 (UTC)

published: Tue Dec 07 2021 09:27:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト