Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

Jiangliu Wang; Jianbo Jiao; Linchao Bao; Shengfeng He; Wei Liu; Yun-hui Liu

時空間統計を明らかにすることによる自己監視ビデオ表現学習

この論文は、自己教師ありビデオ表現学習問題に対処するための新しい口実タスクを提案します。具体的には、ラベルのないビデオクリップが与えられた場合、最大の動きの空間的位置と支配的な方向、時間軸に沿った最大の色の多様性の空間的位置と支配的な色など、一連の時空間統計要約を計算します。次に、ニューラルネットワークが構築およびトレーニングされ、入力としてビデオフレームが与えられた場合の統計的要約が生成されます。学習の難しさを軽減するために、正確な空間デカルト座標の代わりに、いくつかの空間分割パターンを使用して大まかな空間位置をエンコードします。私たちのアプローチは、人間の視覚系が視野内の急速に変化する内容に敏感であり、視覚内容を理解するために大まかな空間位置についての印象だけを必要とするという観察に触発されています。提案されたアプローチの有効性を検証するために、4つの3Dバックボーンネットワーク、つまりC3D、3D-ResNet、R（2 + 1）D、およびS3D-Gを使用して広範な実験を行います。結果は、私たちのアプローチが、アクション認識、ビデオ検索、動的シーン認識、アクション類似性ラベリングを含む4つのダウンストリームビデオ分析タスクで、これらのバックボーンネットワーク全体の既存のアプローチよりも優れていることを示しています。ソースコードは、https：//github.com/laura-wang/video_repres_stsで公開されています。

This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts.

updated: Fri Jan 29 2021 02:41:22 GMT+0000 (UTC)

published: Mon Aug 31 2020 08:31:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト