Self-supervised learning of class embeddings from video

Olivia Wiles; A. Sophia Koepke; Andrew Zisserman

ビデオからのクラス埋め込みの自己監視学習

この作業では、ビデオで自己監視型学習を使用して、ポーズおよび形状情報をエンコードするクラス固有の画像埋め込みを学習する方法について説明します。電車の時間に、オブジェクトクラス（人間の上半身など）の同じビデオの2つのフレームが抽出され、それぞれが埋め込みにエンコードされます。これらの埋め込みを条件として、デコーダーネットワークは1つのフレームを別のフレームに変換するタスクを担当します。長距離変換を正常に実行するために（たとえば、ある画像で下げた手首を別の画像で上げた同じ手首にマッピングする必要があります）、階層的な確率的ネットワークデコーダーモデルを導入します。トレーニングが完了すると、埋め込みはさまざまなダウンストリームタスクおよびドメインに使用できます。 3つの異なる変形可能なオブジェクトクラス（人間の全身、上半身、顔）でアプローチを定量的に示し、学習した埋め込みが実際に一般化することを実験的に示します。同じデータセットでトレーニングされた他の自己監視型メソッドと比較して最先端のパフォーマンスを実現し、完全に監視されたメソッドのパフォーマンスにアプローチします。

This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information. At train time, two frames of the same video of an object class (e.g. human upper body) are extracted and each encoded to an embedding. Conditioned on these embeddings, the decoder network is tasked to transform one frame into another. To successfully perform long range transformations (e.g. a wrist lowered in one image should be mapped to the same wrist raised in another), we introduce a hierarchical probabilistic network decoder model. Once trained, the embedding can be used for a variety of downstream tasks and domains. We demonstrate our approach quantitatively on three distinct deformable object classes -- human full bodies, upper bodies, faces -- and show experimentally that the learned embeddings do indeed generalise. They achieve state-of-the-art performance in comparison to other self-supervised methods trained on the same datasets, and approach the performance of fully supervised methods.

updated: Mon Oct 28 2019 14:18:17 GMT+0000 (UTC)

published: Mon Oct 28 2019 14:18:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト