Unsupervised Visual Representation Learning by Tracking Patches in Video

Guangting Wang; Yizhou Zhou; Chong Luo; Wenxuan Xie; Wenjun Zeng; Zhiwei Xiong

ビデオのパッチを追跡することによる教師なし視覚表現学習

人間の目は幼児期から中期にかけて追跡能力を発達させ続けるという事実に触発されて、視覚表現を学習するためのコンピュータビジョンシステムのプロキシタスクとして追跡を使用することを提案します。子供たちがプレイするCatchゲームをモデルにして、3D-CNNモデル用のCatch-the-Patch（CtP）ゲームを設計し、ビデオ関連のタスクに役立つ視覚的表現を学習します。提案された事前トレーニングフレームワークでは、特定のビデオから画像パッチを切り取り、事前に設定された軌道に従ってスケーリングおよび移動させます。プロキシタスクは、最初のフレームのターゲットバウンディングボックスのみを指定して、一連のビデオフレーム内の画像パッチの位置とサイズを推定することです。複数の画像パッチを同時に使用すると、明らかなメリットがもたらされることがわかりました。パッチをランダムに非表示にすることで、ゲームの難易度をさらに高めます。主流のベンチマークに関する広範な実験は、他のビデオ事前トレーニング方法に対するCtPの優れたパフォーマンスを示しています。さらに、CtPで事前トレーニングされた機能は、教師ありアクション認識タスクによってトレーニングされた機能よりもドメインギャップの影響を受けにくくなっています。両方がKinetics-400でトレーニングされたとき、CtPで事前トレーニングされた表現が、Something-Somethingデータセットで完全に監視された対応物よりもはるかに高いアクション分類精度を達成することを発見して嬉しく思います。コードはオンラインで入手できます：github.com/microsoft/CtP。

Inspired by the fact that human eyes continue to develop tracking ability in early and middle childhood, we propose to use tracking as a proxy task for a computer vision system to learn the visual representations. Modelled on the Catch game played by the children, we design a Catch-the-Patch (CtP) game for a 3D-CNN model to learn visual representations that would help with video-related tasks. In the proposed pretraining framework, we cut an image patch from a given video and let it scale and move according to a pre-set trajectory. The proxy task is to estimate the position and size of the image patch in a sequence of video frames, given only the target bounding box in the first frame. We discover that using multiple image patches simultaneously brings clear benefits. We further increase the difficulty of the game by randomly making patches invisible. Extensive experiments on mainstream benchmarks demonstrate the superior performance of CtP against other video pretraining methods. In addition, CtP-pretrained features are less sensitive to domain gaps than those trained by a supervised action recognition task. When both trained on Kinetics-400, we are pleasantly surprised to find that CtP-pretrained representation achieves much higher action classification accuracy than its fully supervised counterpart on Something-Something dataset. Code is available online: github.com/microsoft/CtP.

updated: Thu May 06 2021 09:46:42 GMT+0000 (UTC)

published: Thu May 06 2021 09:46:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト