RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

Peihao Chen; Deng Huang; Dongliang He; Xiang Long; Runhao Zeng; Shilei Wen; Mingkui Tan; Chuang Gan

RSPNet：教師なしビデオ表現学習のための相対速度知覚

ラベルのないビデオのみから動きと外観の両方の特徴を学習しようとする教師なしビデオ表現学習を研究します。これは、アクション認識などのダウンストリームタスクに再利用できます。ただし、このタスクは次の理由で非常に困難です。1）ビデオ内の非常に複雑な時空間情報。 2）トレーニング用のラベル付きデータの欠如。静止画像の表現学習とは異なり、動きと外観の両方の特徴を適切にモデル化するための適切な自己教師ありタスクを構築することは困難です。より最近では、ビデオ再生速度予測を通じてビデオ表現を学習するためのいくつかの試みがなされてきた。ただし、ビデオの正確な速度ラベルを取得することは簡単ではありません。さらに重要なことに、学習したモデルはモーションパターンに焦点を合わせる傾向があるため、外観の特徴を十分に学習できない可能性があります。この論文では、相対的な再生速度がモーションパターンとより一致していることを観察し、したがって、表現学習のためのより効果的で安定した監視を提供します。したがって、再生速度を認識し、2つのビデオクリップ間の相対速度をラベルとして活用する新しい方法を提案します。このようにして、速度をよく認識し、より良いモーション機能を学習することができます。さらに、外観の特徴を確実に学習するために、外観に焦点を当てたタスクをさらに提案します。このタスクでは、モデルを強制して2つのビデオクリップ間の外観の違いを認識します。 2つのタスクを一緒に最適化すると、2つのダウンストリームタスク、つまりアクション認識とビデオ検索のパフォーマンスが一貫して向上することを示します。注目すべきことに、UCF101データセットでの行動認識では、事前トレーニングにラベル付きデータを使用せずに93.7％の精度を達成します。これは、ImageNetの教師あり事前トレーニングモデルよりも優れています。コードと事前トレーニング済みモデルは、https：//github.com/PeihaoChen/RSPNetにあります。

We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition. This task, however, is extremely challenging due to 1) the highly complex spatial-temporal information in videos; and 2) the lack of labeled data for training. Unlike the representation learning for static images, it is difficult to construct a suitable self-supervised task to well model both motion and appearance features. More recently, several attempts have been made to learn video representation through video playback speed prediction. However, it is non-trivial to obtain precise speed labels for the videos. More critically, the learnt models may tend to focus on motion pattern and thus may not learn appearance features well. In this paper, we observe that the relative playback speed is more consistent with motion pattern, and thus provide more effective and stable supervision for representation learning. Therefore, we propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels. In this way, we are able to well perceive speed and learn better motion features. Moreover, to ensure the learning of appearance features, we further propose an appearance-focused task, where we enforce the model to perceive the appearance difference between two video clips. We show that optimizing the two tasks jointly consistently improves the performance on two downstream tasks, namely action recognition and video retrieval. Remarkably, for action recognition on UCF101 dataset, we achieve 93.7% accuracy without the use of labeled data for pre-training, which outperforms the ImageNet supervised pre-trained model. Code and pre-trained models can be found at https://github.com/PeihaoChen/RSPNet.

updated: Mon Mar 15 2021 10:52:53 GMT+0000 (UTC)

published: Tue Oct 27 2020 16:42:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト