Saliency Guided Contrastive Learning on Scene Images

Meilin Chen; Yizhou Wang; Shixiang Tang; Feng Zhu; Haiyang Yang; Lei Bai; Rui Zhao; Donglian Qi; Wanli Ouyang

シーン画像の顕著性ガイド付き対照学習

自己教師あり学習は、ラベル付けされていない大量のデータを活用する上で有望です。ただし、その成功は高度に精選されたデータセット (ImageNet など) に大きく依存しており、依然として人間によるクリーニングが必要です。あまりキュレーションされていないシーン画像から表現を直接学習することは、自己教師あり学習をより高いレベルに押し上げるために不可欠です。シンプルで明確なセマンティック情報を含む精選された画像とは異なり、シーン画像は複雑なシーンや複数のオブジェクトを含むことが多いため、より複雑でモザイクになっています。実行可能であるにもかかわらず、最近の研究では、シーン画像内のオブジェクト表現に対する対照的な学習のための最も識別力のある領域を発見することがほとんど見落とされていました。この作業では、学習中にモデルの出力から導出された顕著性マップを活用して、これらの識別領域を強調し、対照的な学習全体を導きます。具体的には、顕著性マップは、最初にその識別領域を正のペアとしてトリミングする方法をガイドし、次にその顕著性スコアによって異なる作物間の対照的な損失を再評価します。私たちの方法は、シーン画像の自己教師あり学習のパフォーマンスを、ImageNet 線形評価で +1.1、+4.3、+2.2 の Top1 精度、1% および 10% の ImageNet ラベルを使用した半教師あり学習でそれぞれ大幅に改善します。顕著性マップに関する私たちの洞察が、シーンデータから学習するより汎用的な教師なし表現に関する将来の研究の動機となることを願っています。

Self-supervised learning holds promise in leveraging large numbers of unlabeled data. However, its success heavily relies on the highly-curated dataset, e.g., ImageNet, which still needs human cleaning. Directly learning representations from less-curated scene images is essential for pushing self-supervised learning to a higher level. Different from curated images which include simple and clear semantic information, scene images are more complex and mosaic because they often include complex scenes and multiple objects. Despite being feasible, recent works largely overlooked discovering the most discriminative regions for contrastive learning to object representations in scene images. In this work, we leverage the saliency map derived from the model's output during learning to highlight these discriminative regions and guide the whole contrastive learning. Specifically, the saliency map first guides the method to crop its discriminative regions as positive pairs and then reweighs the contrastive losses among different crops by its saliency scores. Our method significantly improves the performance of self-supervised learning on scene images by +1.1, +4.3, +2.2 Top1 accuracy in ImageNet linear evaluation, Semi-supervised learning with 1% and 10% ImageNet labels, respectively. We hope our insights on saliency maps can motivate future research on more general-purpose unsupervised representation learning from scene data.

updated: Wed Feb 22 2023 15:54:07 GMT+0000 (UTC)

published: Wed Feb 22 2023 15:54:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト