Correlational Image Modeling for Self-Supervised Visual Pre-Training

Wei Li; Jiahao Xie; Chen Change Loy

自己管理型視覚事前トレーニングのための相関画像モデリング

相関画像モデリング (CIM) を紹介します。これは、自己監視型の視覚的事前トレーニングに対する斬新で驚くほど効果的なアプローチです。私たちの CIM は単純な口実タスクを実行します。入力画像 (コンテキスト) から画像領域 (手本) をランダムにトリミングし、手本とコンテキストの間の相関マップを予測します。 3 つの主要な設計により、重要で意味のある自己監視タスクとしての相関画像モデリングが可能になります。まず、有用な手本とコンテキストのペアを生成するために、さまざまなスケール、形状、回転、および変換を使用して画像領域をトリミングすることを検討します。次に、オンラインエンコーダーとターゲットエンコーダーを含むブートストラップ学習フレームワークを採用しています。事前トレーニング中、前者は手本を入力として受け取り、後者はコンテキストを変換します。第 3 に、単純なクロスアテンションブロックを介して出力相関マップをモデル化します。このブロック内では、コンテキストがクエリとして機能し、手本が値とキーを提供します。 CIM は、自己監視型および転送ベンチマークで現在の最先端技術と同等以上のパフォーマンスを発揮することを示しています。

We introduce Correlational Image Modeling (CIM), a novel and surprisingly effective approach to self-supervised visual pre-training. Our CIM performs a simple pretext task: we randomly crop image regions (exemplars) from an input image (context) and predict correlation maps between the exemplars and the context. Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task. First, to generate useful exemplar-context pairs, we consider cropping image regions with various scales, shapes, rotations, and transformations. Second, we employ a bootstrap learning framework that involves online and target encoders. During pre-training, the former takes exemplars as inputs while the latter converts the context. Third, we model the output correlation maps via a simple cross-attention block, within which the context serves as queries and the exemplars offer values and keys. We show that CIM performs on par or better than the current state of the art on self-supervised and transfer benchmarks.

updated: Thu Mar 23 2023 05:41:37 GMT+0000 (UTC)

published: Wed Mar 22 2023 15:48:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト