Self-supervised and Weakly Supervised Contrastive Learning for Frame-wise Action Representations

Minghao Chen; Renbo Tu; Chenxi Huang; Yuqi Lin; Boxi Wu; Deng Cai

フレームごとのアクション表現のための自己教師ありおよび弱教師ありの対照学習

アクション表現学習に関する以前の研究では、短いビデオクリップのグローバル表現に焦点を当てていました。対照的に、ビデオの配置などの多くの実用的なアプリケーションでは、長いビデオの集中的な表現を学習することが強く要求されます。この論文では、特に長いビデオの場合、自己教師ありまたは弱教師ありの方法でフレーム単位のアクション表現を学習するための、対照的なアクション表現学習 (CARL) の新しいフレームワークを紹介します。具体的には、畳み込みとトランスフォーマーを組み合わせることにより、空間的および時間的コンテキストの両方を考慮する、シンプルだが効果的なビデオエンコーダーを紹介します。自己教師あり学習における最近の大規模な進歩に着想を得て、一連の時空間データを 2 つのバージョンに拡張することによって得られる 2 つの関連するビューに適用される新しいシーケンスコントラストロス (SCL) を提案します。 1 つは、2 つの拡張ビューのシーケンス類似性とタイムスタンプ距離の以前のガウス分布の間の KL ダイバージェンスを最小化することにより、埋め込みスペースを最適化する自己教師ありバージョンです。もう 1 つは、動的タイムラッピング (DTW) によるビデオレベルのラベルを使用して、ビデオ間でより多くのサンプルペアを構築する、監視が弱いバージョンです。 FineGym、PennAction、および Pouring データセットでの実験では、ダウンストリームのきめの細かいアクション分類とさらに高速な推論において、私たちの方法が以前の最先端技術よりも優れていることが示されています。驚くべきことに、以前の作品のようにペアになったビデオのトレーニングを行わなくても、自己教師ありバージョンは、ビデオの位置合わせときめ細かなフレーム検索タスクでも優れたパフォーマンスを示します。

Previous work on action representation learning focused on global representations for short video clips. In contrast, many practical applications, such as video alignment, strongly demand learning the intensive representation of long videos. In this paper, we introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner, especially for long videos. Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context by combining convolution and transformer. Inspired by the recent massive progress in self-supervised learning, we propose a new sequence contrast loss (SCL) applied to two related views obtained by expanding a series of spatio-temporal data in two versions. One is the self-supervised version that optimizes embedding space by minimizing KL-divergence between sequence similarity of two augmented views and prior Gaussian distribution of timestamp distance. The other is the weakly-supervised version that builds more sample pairs among videos using video-level labels by dynamic time wrapping (DTW). Experiments on FineGym, PennAction, and Pouring datasets show that our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference. Surprisingly, although without training on paired videos like in previous works, our self-supervised version also shows outstanding performance in video alignment and fine-grained frame retrieval tasks.

updated: Thu Mar 02 2023 04:44:43 GMT+0000 (UTC)

published: Tue Dec 06 2022 16:42:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト