Emergent Correspondence from Image Diffusion

Luming Tang; Menglin Jia; Qianqian Wang; Cheng Perng Phoo; Bharath Hariharan

画像拡散による緊急対応

画像間の対応関係を見つけることは、コンピュータービジョンの基本的な問題です。この論文では、明示的な監視なしで画像拡散モデルに対応関係が現れることを示します。我々は、拡散ネットワークからこの暗黙の知識を画像特徴として抽出する単純な戦略、すなわち DIffusion FeaTures (DIFT) を提案し、それらを使用して実際の画像間の対応関係を確立します。タスク固有のデータや注釈に対する追加の微調整や監視を行わなくても、DIFT は、意味論的、幾何学的、時間的な対応関係の特定において、弱く監視された手法や競合する既製の機能の両方を上回るパフォーマンスを発揮できます。特にセマンティック対応に関しては、Stable Diffusion の DIFT は、困難な SPair-71k ベンチマークにおいて、DINO と OpenCLIP をそれぞれ 19 精度ポイントと 14 精度ポイント上回っています。全体的なパフォーマンスは同等でありながら、18 カテゴリーのうち 9 カテゴリーで最先端の教師付き手法を上回っています。プロジェクトページ: https://difffusionfeatures.github.io

Finding correspondences between images is a fundamental problem in computer vision. In this paper, we show that correspondence emerges in image diffusion models without any explicit supervision. We propose a simple strategy to extract this implicit knowledge out of diffusion networks as image features, namely DIffusion FeaTures (DIFT), and use them to establish correspondences between real images. Without any additional fine-tuning or supervision on the task-specific data or annotations, DIFT is able to outperform both weakly-supervised methods and competitive off-the-shelf features in identifying semantic, geometric, and temporal correspondences. Particularly for semantic correspondence, DIFT from Stable Diffusion is able to outperform DINO and OpenCLIP by 19 and 14 accuracy points respectively on the challenging SPair-71k benchmark. It even outperforms the state-of-the-art supervised methods on 9 out of 18 categories while remaining on par for the overall performance. Project page: https://diffusionfeatures.github.io

updated: Tue Jun 06 2023 17:33:19 GMT+0000 (UTC)

published: Tue Jun 06 2023 17:33:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト