Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

Rui Qian; Yuxi Li; Huabin Liu; John See; Shuangrui Ding; Xian Liu; Dian Li; Weiyao Lin

マルチレベルの特徴最適化による自己教師ありビデオ表現学習の強化

自己教師ありビデオ表現学習の核心は、ラベルのないビデオから一般的な機能を構築することです。ただし、最近の作品は主に高レベルのセマンティクスに焦点を当てており、一般的なビデオの理解に不可欠な低レベルの表現とそれらの時間的関係を無視しています。これらの課題に対処するために、このペーパーでは、学習したビデオ表現の一般化と時間モデリング機能を改善するためのマルチレベルの機能最適化フレームワークを提案します。具体的には、ナイーブでプロトタイプの対照学習から得られた高レベルの特徴を利用して分布グラフを作成し、低レベルおよび中レベルの特徴学習のプロセスをガイドします。また、モーションパターンの学習を強化するために、マルチレベル機能から単純な時間モデリングモジュールを考案します。実験は、グラフ制約と時間モデリングを使用したマルチレベルの特徴最適化が、ビデオ理解における表現能力を大幅に向上させることができることを示しています。

The crux of self-supervised video representation learning is to build general features from unlabeled videos. However, most recent works have mainly focused on high-level semantics and neglected lower-level representations and their temporal relationship which are crucial for general video understanding. To address these challenges, this paper proposes a multi-level feature optimization framework to improve the generalization and temporal modeling ability of learned video representations. Concretely, high-level features obtained from naive and prototypical contrastive learning are utilized to build distribution graphs, guiding the process of low-level and mid-level feature learning. We also devise a simple temporal modeling module from multi-level features to enhance motion pattern learning. Experiments demonstrate that multi-level feature optimization with the graph constraint and temporal modeling can greatly improve the representation ability in video understanding.

updated: Wed Aug 04 2021 17:16:18 GMT+0000 (UTC)

published: Wed Aug 04 2021 17:16:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト