Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut

Yangtao Wang; Xi Shen; Shell Hu; Yuan Yuan; James Crowley; Dominique Vaufreydaz

正規化されたカットを使用した教師なしオブジェクト検出のための自己監視トランスフォーマー

自己蒸留損失（DINO）を使用した自己監視学習で訓練された変圧器は、顕著な前景オブジェクトを強調する注意マップを生成することが示されています。このホワイトペーパーでは、自己監視型トランスフォーマー機能を使用して画像からオブジェクトを検出するグラフベースのアプローチを示します。ビジュアルトークンは、トークンの類似性に基づく接続スコアを表すエッジを持つ加重グラフのノードとして表示されます。次に、正規化されたグラフカットを使用して前景オブジェクトをセグメント化し、自己相似領域をグループ化します。一般化された固有分解を使用したスペクトルクラスタリングを使用してグラフカット問題を解決し、その絶対値がトークンが前景オブジェクトに属する可能性を示すため、2番目に小さい固有ベクトルがカットソリューションを提供することを示します。その単純さにもかかわらず、このアプローチは教師なしオブジェクト検出のパフォーマンスを大幅に向上させます。VOC07、VOC12、およびCOCO20Kでそれぞれ6.9％、8.1％、および8.1％のマージンで、最近の最先端のLOSTよりも向上します。第2段階のクラスに依存しない検出器（CAD）を追加することにより、パフォーマンスをさらに向上させることができます。私たちの提案する方法は、教師なし顕著性検出と弱教師あり物体検出に簡単に拡張できます。教師なし顕著性検出については、以前の最先端技術と比較して、ECSSD、DUTS、DUT-OMRONでそれぞれ4.9％、5.2％、12.9％のIoUを改善します。弱く監視されたオブジェクトの検出については、CUBとImageNetで競争力のあるパフォーマンスを実現します。

Transformers trained with self-supervised learning using self-distillation loss (DINO) have been shown to produce attention maps that highlight salient foreground objects. In this paper, we demonstrate a graph-based approach that uses the self-supervised transformer features to discover an object from an image. Visual tokens are viewed as nodes in a weighted graph with edges representing a connectivity score based on the similarity of tokens. Foreground objects can then be segmented using a normalized graph-cut to group self-similar regions. We solve the graph-cut problem using spectral clustering with generalized eigen-decomposition and show that the second smallest eigenvector provides a cutting solution since its absolute value indicates the likelihood that a token belongs to a foreground object. Despite its simplicity, this approach significantly boosts the performance of unsupervised object discovery: we improve over the recent state of the art LOST by a margin of 6.9%, 8.1%, and 8.1% respectively on the VOC07, VOC12, and COCO20K. The performance can be further improved by adding a second stage class-agnostic detector (CAD). Our proposed method can be easily extended to unsupervised saliency detection and weakly supervised object detection. For unsupervised saliency detection, we improve IoU for 4.9%, 5.2%, 12.9% on ECSSD, DUTS, DUT-OMRON respectively compared to previous state of the art. For weakly supervised object detection, we achieve competitive performance on CUB and ImageNet.

updated: Wed Feb 23 2022 14:27:36 GMT+0000 (UTC)

published: Wed Feb 23 2022 14:27:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト