Spatio-Temporal Crop Aggregation for Video Representation Learning

Sepehr Sameni; Simon Jenni; Paolo Favaro

ビデオ表現学習のための時空間クロップ集約

トレーニング時と推論時の両方で高いスケーラビリティを享受する新しい方法である、ビデオ表現学習用の時空間クロップ集約 (SCALE) を提案します。私たちのモデルは、事前トレーニングされたバックボーンで抽出されたビデオクリップレベルの特徴のセットから学習することにより、長距離ビデオ特徴を構築します。モデルをトレーニングするために、マスクされたクリップ機能予測で構成される自己教師付き目標を提案します。ビデオクリップのランダムなセットを抽出することによって入力にスパース性を適用し、スパース入力のみを再構築することによって損失関数にスパース性を適用します。さらに、単一のビデオクリップに適用される事前トレーニング済みのバックボーンの潜在空間で作業することにより、次元削減を使用します。これらの手法により、私たちの方法はトレーニングが非常に効率的であるだけでなく、転移学習においても非常に効果的になります。ビデオ表現が、一般的なアクション分類とビデオ理解データセットに対する線形、非線形、および KNN プロービングを使用して最先端のパフォーマンスを生み出すことを示します。

We propose Spatio-temporal Crop Aggregation for video representation LEarning (SCALE), a novel method that enjoys high scalability at both training and inference time. Our model builds long-range video features by learning from sets of video clip-level features extracted with a pre-trained backbone. To train the model, we propose a self-supervised objective consisting of masked clip feature prediction. We apply sparsity to both the input, by extracting a random set of video clips, and to the loss function, by only reconstructing the sparse inputs. Moreover, we use dimensionality reduction by working in the latent space of a pre-trained backbone applied to single video clips. These techniques make our method not only extremely efficient to train but also highly effective in transfer learning. We demonstrate that our video representation yields state-of-the-art performance with linear, non-linear, and KNN probing on common action classification and video understanding datasets.

updated: Mon Mar 13 2023 10:31:11 GMT+0000 (UTC)

published: Wed Nov 30 2022 14:43:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト