Motion-Augmented Self-Training for Video Recognition at Smaller Scale

Kirill Gavrilyuk; Mihir Jain; Ilia Karmanov; Cees G. M. Snoek

小規模なビデオ認識のためのモーション拡張セルフトレーニング

このホワイトペーパーの目的は、ラベルのないビデオコレクションで3D畳み込みニューラルネットワークを自己トレーニングして、小規模なビデオコレクションに展開することです。小さいビデオデータセットは外観よりも動きの恩恵を受けるため、オプティカルフローを使用してネットワークをトレーニングするよう努めていますが、推論中の計算は避けています。最初のモーション拡張セルフトレーニングレジームを提案します。これをMotionFitと呼びます。まず、小さなラベル付きのビデオコレクションでのモーションモデルの教師ありトレーニングから始めます。モーションモデルを使用して、ラベルのない大規模なビデオコレクションの疑似ラベルを生成します。これにより、外観モデルを使用してこれらの疑似ラベルを予測することを学習することで、知識を伝達できます。さらに、追加の補助タスクがなくても、疑似ラベリングの品質を向上させるためのシンプルで効率的な方法として、マルチクリップロスを紹介します。また、これまでの作品では見逃されていた、外観モデルのセルフトレーニング中のビデオの時間的粒度も考慮に入れています。その結果、アクション認識やクリップ検索などのビデオダウンストリームタスクに適した強力なモーション拡張表現モデルが得られます。小規模なビデオデータセットでは、MotionFitは、同じ量のクラスラベルを使用して、知識の伝達の代替案を5％〜8％、ビデオのみの自己監視を1％〜7％、半教師あり学習を9％〜18％上回っています。。

The goal of this paper is to self-train a 3D convolutional neural network on an unlabeled video collection for deployment on small-scale video collections. As smaller video datasets benefit more from motion than appearance, we strive to train our network using optical flow, but avoid its computation during inference. We propose the first motion-augmented self-training regime, we call MotionFit. We start with supervised training of a motion model on a small, and labeled, video collection. With the motion model we generate pseudo-labels for a large unlabeled video collection, which enables us to transfer knowledge by learning to predict these pseudo-labels with an appearance model. Moreover, we introduce a multi-clip loss as a simple yet efficient way to improve the quality of the pseudo-labeling, even without additional auxiliary tasks. We also take into consideration the temporal granularity of videos during self-training of the appearance model, which was missed in previous works. As a result we obtain a strong motion-augmented representation model suited for video downstream tasks like action recognition and clip retrieval. On small-scale video datasets, MotionFit outperforms alternatives for knowledge transfer by 5%-8%, video-only self-supervision by 1%-7% and semi-supervised learning by 9%-18% using the same amount of class labels.

updated: Tue May 04 2021 17:43:19 GMT+0000 (UTC)

published: Tue May 04 2021 17:43:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト