Siamese Image Modeling for Self-Supervised Vision Representation Learning

Chenxin Tao; Xizhou Zhu; Gao Huang; Yu Qiao; Xiaogang Wang; Jifeng Dai

自己監視型視覚表現学習のためのシャム画像モデリング

自己教師あり学習（SSL）は、さまざまなダウンストリームビジョンタスクで優れたパフォーマンスを提供します。 2つの主流のSSLフレームワーク、つまりインスタンス識別（ID）とマスクされた画像モデリング（MIM）が提案されています。 IDは、機能の崩壊を回避しながら、同じ画像からのさまざまなビューの表現をまとめます。線形プロービングではうまく機能しますが、検出性能は劣ります。一方、MIMは、マスクされた画像を指定して元のコンテンツを再構築します。密な予測には優れていますが、線形プロービングではうまく機能しません。それらの区別は、セマンティックアラインメントまたは空間感度のいずれかの表現要件を無視することによって引き起こされます。具体的には、（1）セマンティックアラインメントでは、セマンティックに類似したビューを近くの表現に投影する必要があります。これは、異なるビューを強力な拡張と対比することで実現できます。（2）空間感度には、画像内の局所構造をモデル化する必要があります。したがって、マスクされた画像を使用して密な表現を予測することは、画像コンテンツの条件付き分布をモデル化するため、有益です。これらの分析に基づいて、シャム画像モデリング（SIM）を提案します。これは、同じ画像から異なる拡張を使用した別のマスクされたビューに基づいて、拡張ビューの密な表現を予測します。この方法では、2つのブランチを持つシャムネットワークを使用します。オンラインブランチは、最初のビューをエンコードし、これら2つのビュー間の相対位置に従って2番目のビューの表現を予測します。ターゲットブランチは、2番目のビューをエンコードすることによってターゲットを生成します。このようにして、IDとMIMをそれぞれ使用して、同等の線形プロービングと高密度予測のパフォーマンスを実現できます。また、グローバルな損失なしに適切な線形プロービング結果が得られることも示しています。コードはhttps://github.com/fundamentalvision/Siamese-Image-Modelingでリリースされます。

Self-supervised learning (SSL) has delivered superior performance on a variety of downstream vision tasks. Two main-stream SSL frameworks have been proposed, i.e., Instance Discrimination (ID) and Masked Image Modeling (MIM). ID pulls together the representations of different views from the same image, while avoiding feature collapse. It does well on linear probing but is inferior in detection performance. On the other hand, MIM reconstructs the original content given a masked image. It excels at dense prediction but fails to perform well on linear probing. Their distinctions are caused by neglecting the representation requirements of either semantic alignment or spatial sensitivity. Specifically, we observe that (1) semantic alignment demands semantically similar views to be projected into nearby representation, which can be achieved by contrasting different views with strong augmentations; (2) spatial sensitivity requires to model the local structure within an image. Predicting dense representations with masked image is therefore beneficial because it models the conditional distribution of image content. Driven by these analysis, we propose Siamese Image Modeling (SIM), which predicts the dense representations of an augmented view, based on another masked view from the same image but with different augmentations. Our method uses a Siamese network with two branches. The online branch encodes the first view, and predicts the second view's representation according to the relative positions between these two views. The target branch produces the target by encoding the second view. In this way, we are able to achieve comparable linear probing and dense prediction performances with ID and MIM, respectively. We also demonstrate that decent linear probing result can be obtained without a global loss. Code shall be released at https://github.com/fundamentalvision/Siamese-Image-Modeling.

updated: Tue Jul 05 2022 09:31:25 GMT+0000 (UTC)

published: Thu Jun 02 2022 17:59:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト