Discriminative Sampling of Proposals in Self-Supervised Transformers for Weakly Supervised Object Localization

Shakeeb Murtaza; Soufiane Belharbi; Marco Pedersoli; Aydin Sarraf; Eric Granger

弱い教師ありオブジェクトのローカリゼーションのための自己教師ありトランスフォーマーにおける提案の識別サンプリング

ドローンは、ますます多くの視覚認識アプリケーションに採用されています。携帯電話基地局の検査における最近の開発は、ドローンベースの資産監視であり、ドローンの自律飛行は、連続する航空画像で関心のあるオブジェクトの位置を特定することによって誘導されます。このホワイトペーパーでは、イメージクラスラベルのみに基づいて深い弱教師付きオブジェクトローカリゼーション (WSOL) モデルをトレーニングし、高い信頼性でオブジェクトを特定する方法を提案します。ローカライザーをトレーニングするために、疑似ラベルは自己監視型ビジョントランスフォーマー (SST) から効率的に収集されます。ただし、SST はシーンをさまざまなオブジェクトパーツを含む複数のマップに分解し、明示的な監視信号に依存しないため、必要な WSOL のように、対象のオブジェクトと他のオブジェクトを区別できません。この問題に対処するために、さまざまなトランスヘッドによって生成された複数のマップを活用して、深い WSOL モデルをトレーニングするための疑似ラベルを取得することを提案します。特に、識別領域を識別するために CNN 分類器に依存する新しい識別提案サンプリング (DiPS) メソッドが導入されています。次に、特定のクラスに属するオブジェクトを正確にローカライズできるアクティベーションマップを生成するための WSOL モデルをトレーニングするために、これらの領域からフォアグラウンドピクセルとバックグラウンドピクセルがサンプリングされます。挑戦的な TelDrone データセットに関する経験的結果は、提案されたアプローチが、生成されたマップよりも広範囲のしきい値にわたって最先端の方法よりも優れていることを示しています。また、CUB データセットで結果を計算し、この方法を他のタスクに適用できることを示しました。

Drones are employed in a growing number of visual recognition applications. A recent development in cell tower inspection is drone-based asset surveillance, where the autonomous flight of a drone is guided by localizing objects of interest in successive aerial images. In this paper, we propose a method to train deep weakly-supervised object localization (WSOL) models based only on image-class labels to locate object with high confidence. To train our localizer, pseudo labels are efficiently harvested from a self-supervised vision transformers (SSTs). However, since SSTs decompose the scene into multiple maps containing various object parts, and do not rely on any explicit supervisory signal, they cannot distinguish between the object of interest and other objects, as required WSOL. To address this issue, we propose leveraging the multiple maps generated by the different transformer heads to acquire pseudo-labels for training a deep WSOL model. In particular, a new Discriminative Proposals Sampling (DiPS) method is introduced that relies on a CNN classifier to identify discriminative regions. Then, foreground and background pixels are sampled from these regions in order to train a WSOL model for generating activation maps that can accurately localize objects belonging to a specific class. Empirical results on the challenging TelDrone dataset indicate that our proposed approach can outperform state-of-art methods over a wide range of threshold values over produced maps. We also computed results on CUB dataset, showing that our method can be adapted for other tasks.

updated: Sun Nov 20 2022 02:55:22 GMT+0000 (UTC)

published: Fri Sep 09 2022 18:33:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト