Exploring Localization for Self-supervised Fine-grained Contrastive Learning

Di Wu; Siyuan Li; Zelin Zang; Stan Z. Li

自己教師ありきめの細かい対照的学習のためのローカリゼーションの探索

自己教師あり対照学習は、視覚表現の学習において大きな可能性を示しています。画像分類やオブジェクト検出などのさまざまなダウンストリームタスクでの成功にもかかわらず、きめの細かいシナリオのための自己監視型の事前トレーニングは十分に検討されていません。現在の対照的な方法は、背景/前景テクスチャを記憶する傾向があるため、前景オブジェクトのローカライズに制限があることを指摘します。分析によると、識別可能なテクスチャ情報とローカライゼーションを抽出することを学習することは、きめの細かい自己監視型の事前トレーニングにとって同様に重要であることが示唆されています。調査結果に基づいて、クロスビュー顕著性アライメント（CVSA）を導入します。これは、最初に画像の顕著性領域を新しいビュー生成として切り取り、交換し、次にクロスビューアライメントを介して前景オブジェクトにローカライズするようにモデルを導きます。損失。小規模および大規模の両方のきめの細かい分類ベンチマークに関する広範な実験は、CVSA が学習された表現を大幅に改善することを示しています。

Self-supervised contrastive learning has demonstrated great potential in learning visual representations. Despite their success in various downstream tasks such as image classification and object detection, self-supervised pre-training for fine-grained scenarios is not fully explored. We point out that current contrastive methods are prone to memorizing background/foreground texture and therefore have a limitation in localizing the foreground object. Analysis suggests that learning to extract discriminative texture information and localization are equally crucial for fine-grained self-supervised pre-training. Based on our findings, we introduce cross-view saliency alignment (CVSA), a contrastive learning framework that first crops and swaps saliency regions of images as a novel view generation and then guides the model to localize on foreground objects via a cross-view alignment loss. Extensive experiments on both small- and large-scale fine-grained classification benchmarks show that CVSA significantly improves the learned representation.

updated: Tue Oct 11 2022 06:31:41 GMT+0000 (UTC)

published: Wed Jun 30 2021 02:56:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト