Long Short View Feature Decomposition via Contrastive Video Representation Learning

Nadine Behrmann; Mohsen Fayyaz; Juergen Gall; Mehdi Noroozi

対照的なビデオ表現学習による長短ビュー特徴分解

自己監視ビデオ表現方法は、通常、ビデオの時間属性の表現に焦点を合わせています。ただし、定常属性と非定常属性の役割についてはあまり検討されていません。ビデオ全体で類似したままの定常機能により、ビデオレベルのアクションクラスの予測が可能になります。時間的に変化する属性を表す非定常機能は、アクションのセグメンテーションなど、よりきめ細かい時間的理解を伴うダウンストリームタスクにとってより有益です。両方のタイプの特徴をキャプチャする単一の表現は最適ではないと主張し、長いビデオシーケンスと短いサブシーケンスなどの長いビューと短いビューからの対照的な学習を介して、表現空間を定常および非定常の特徴に分解することを提案します。。静止機能は短いビューと長いビューの間で共有されますが、非静止機能は対応する長いビューと一致するように短いビューを集約します。私たちのアプローチを経験的に検証するために、私たちの非定常機能がアクションセグメンテーションでよりよく機能する一方で、私たちの定常機能がアクション認識ダウンストリームタスクで特にうまく機能することを示します。さらに、学習した表現を分析し、定常的な特徴がより時間的に安定した静的な属性をキャプチャするのに対し、非定常的な特徴はより時間的に変化する属性を包含することを発見します。

Self-supervised video representation methods typically focus on the representation of temporal attributes in videos. However, the role of stationary versus non-stationary attributes is less explored: Stationary features, which remain similar throughout the video, enable the prediction of video-level action classes. Non-stationary features, which represent temporally varying attributes, are more beneficial for downstream tasks involving more fine-grained temporal understanding, such as action segmentation. We argue that a single representation to capture both types of features is sub-optimal, and propose to decompose the representation space into stationary and non-stationary features via contrastive learning from long and short views, i.e. long video sequences and their shorter sub-sequences. Stationary features are shared between the short and long views, while non-stationary features aggregate the short views to match the corresponding long view. To empirically verify our approach, we demonstrate that our stationary features work particularly well on an action recognition downstream task, while our non-stationary features perform better on action segmentation. Furthermore, we analyse the learned representations and find that stationary features capture more temporally stable, static attributes, while non-stationary features encompass more temporally varying ones.

updated: Thu Sep 23 2021 18:54:34 GMT+0000 (UTC)

published: Thu Sep 23 2021 18:54:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト