Object-aware Contrastive Learning for Debiased Scene Representation

Sangwoo Mo; Hyunwoo Kang; Kihyuk Sohn; Chun-Liang Li; Jinwoo Shin

偏りのないシーン表現のためのオブジェクト認識対照学習

対照的な自己監視学習は、さまざまなデータ拡張に対して不変性を強制することにより、ラベルのない画像から視覚的表現を学習するという印象的な結果を示しています。ただし、学習された表現は、さまざまなオブジェクトまたはオブジェクトと背景の偽のシーン相関にコンテキスト的に偏っていることが多く、ダウンストリームタスクでの一般化に悪影響を与える可能性があります。この問題に取り組むために、最初に（a）自己監視方式でオブジェクトをローカライズし、次に（b）推測されたオブジェクトの場所を考慮して適切なデータ拡張を介してシーンの相関をデバイアスする、新しいオブジェクト認識対照学習フレームワークを開発します。（a）については、対照的に訓練されたモデルを使用して、他の画像と比較して画像内で最も識別力のある領域（オブジェクトなど）を見つける対照クラス活性化マップ（ContraCAM）を提案します。 ContraCAMをさらに改善して、反復的な改良手順により、複数のオブジェクトと形状全体を検出します。（b）については、ContraCAMに基づく2つのデータ拡張、オブジェクト認識ランダムクロップとバックグラウンドミックスアップを紹介します。これらは、対照的な自己教師あり学習中のコンテキストバイアスとバックグラウンドバイアスをそれぞれ削減します。私たちの実験は、特にマルチオブジェクト画像の下でトレーニングされた場合、または背景（および分布）シフトされた画像の下で評価された場合に、表現学習フレームワークの有効性を示しています。

Contrastive self-supervised learning has shown impressive results in learning visual representations from unlabeled images by enforcing invariance against different data augmentations. However, the learned representations are often contextually biased to the spurious scene correlations of different objects or object and background, which may harm their generalization on the downstream tasks. To tackle the issue, we develop a novel object-aware contrastive learning framework that first (a) localizes objects in a self-supervised manner and then (b) debias scene correlations via appropriate data augmentations considering the inferred object locations. For (a), we propose the contrastive class activation map (ContraCAM), which finds the most discriminative regions (e.g., objects) in the image compared to the other images using the contrastively trained models. We further improve the ContraCAM to detect multiple objects and entire shapes via an iterative refinement procedure. For (b), we introduce two data augmentations based on ContraCAM, object-aware random crop and background mixup, which reduce contextual and background biases during contrastive self-supervised learning, respectively. Our experiments demonstrate the effectiveness of our representation learning framework, particularly when trained under multi-object images or evaluated under the background (and distribution) shifted images.

updated: Fri Jul 30 2021 19:24:07 GMT+0000 (UTC)

published: Fri Jul 30 2021 19:24:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト