UniVIP: A Unified Framework for Self-Supervised Visual Pre-training

Zhaowen Li; Yousong Zhu; Fan Yang; Wei Li; Chaoyang Zhao; Yingying Chen; Zhiyang Chen; Jiahao Xie; Liwei Wu; Rui Zhao; Ming Tang; Jinqiao Wang

UniVIP：自己監視型視覚事前トレーニングのための統合フレームワーク

自己監視学習（SSL）は、ラベルのない大量のデータを活用する上で有望です。ただし、一般的なSSLメソッドの成功は、ImageNetのような単一中心のオブジェクト画像に限定されており、シーンとインスタンス間の相関関係、およびシーン内のインスタンスのセマンティックの違いを無視しています。上記の問題に対処するために、単一中心オブジェクトまたは非アイコニックデータセットのいずれかで用途の広い視覚表現を学習するための新しい自己監視フレームワークである統合自己監視視覚事前トレーニング（UniVIP）を提案します。フレームワークは、3つのレベルでの表現学習を考慮に入れます：1）シーン-シーンの類似性、2）シーン-インスタンスの相関、3）インスタンス-インスタンスの識別。学習中、インスタンスの識別を自動的に測定するために最適なトランスポートアルゴリズムを採用します。大規模な実験では、非アイコニックCOCOで事前トレーニングされたUniVIPが、画像分類、半教師あり学習、オブジェクト検出、セグメンテーションなどのさまざまなダウンストリームタスクで最先端の転送パフォーマンスを実現することが示されています。さらに、私たちの方法は、ImageNetなどの単一中心のオブジェクトデータセットを活用し、線形プロービングの同じ事前トレーニングエポックでBYOLを2.5％アウトパフォームし、COCOデータセットの現在の自己監視オブジェクト検出方法を上回り、その普遍性と潜在的。

Self-supervised learning (SSL) holds promise in leveraging large amounts of unlabeled data. However, the success of popular SSL methods has limited on single-centric-object images like those in ImageNet and ignores the correlation among the scene and instances, as well as the semantic difference of instances in the scene. To address the above problems, we propose a Unified Self-supervised Visual Pre-training (UniVIP), a novel self-supervised framework to learn versatile visual representations on either single-centric-object or non-iconic dataset. The framework takes into account the representation learning at three levels: 1) the similarity of scene-scene, 2) the correlation of scene-instance, 3) the discrimination of instance-instance. During the learning, we adopt the optimal transport algorithm to automatically measure the discrimination of instances. Massive experiments show that UniVIP pre-trained on non-iconic COCO achieves state-of-the-art transfer performance on a variety of downstream tasks, such as image classification, semi-supervised learning, object detection and segmentation. Furthermore, our method can also exploit single-centric-object dataset such as ImageNet and outperforms BYOL by 2.5% with the same pre-training epochs in linear probing, and surpass current self-supervised object detection methods on COCO dataset, demonstrating its universality and potential.

updated: Mon Mar 14 2022 10:04:04 GMT+0000 (UTC)

published: Mon Mar 14 2022 10:04:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト