Siamese Image Modeling for Self-Supervised Vision Representation Learning

Chenxin Tao; Xizhou Zhu; Weijie Su; Gao Huang; Bin Li; Jie Zhou; Yu Qiao; Xiaogang Wang; Jifeng Dai

自己教師あり視覚表現学習のためのシャム画像モデリング

自己教師あり学習 (SSL) は、さまざまなダウンストリームビジョンタスクで優れたパフォーマンスを発揮します。 2 つのメインストリーム SSL フレームワーク、つまり、インスタンス識別 (ID) とマスクイメージモデリング (MIM) が提案されています。 ID は、特徴の崩壊を回避しながら、同じ画像の異なるビューからの表現をまとめます。各画像内の局所構造をモデル化する必要がある空間感度がありません。一方、MIM は、マスクされた画像を指定して元のコンテンツを再構築します。代わりに、意味的に類似したビューを近くの表現に投影する必要があるため、セマンティックアラインメントが適切ではありません。このジレンマに対処するために、(1) 異なる画像ビューを強力な拡張と一致させることでセマンティックアラインメントを達成できることを観察します。 (2) 空間感度は、マスクされた画像を使用して密な表現を予測することで恩恵を受けることができます。これらの分析に基づいて、同じ画像から異なる拡張を使用してマスクされた別のビューに基づいて、拡張ビューの密な表現を予測する Siamese Image Modeling (SiameseIM) を提案します。 SiameseIM は、2 つのブランチを持つシャムネットワークを使用します。オンラインブランチは最初のビューをエンコードし、これら 2 つのビュー間の相対的な位置に従って 2 番目のビューの表現を予測します。ターゲットブランチは、2 番目のビューをエンコードすることによってターゲットを生成します。 SiameseIM は、ImageNet の微調整と線形プロービング、COCO と LVIS の検出、ADE20k セマンティックセグメンテーションなど、幅広いダウンストリームタスクで ID と MIM の両方を凌駕できます。改善は、少数のショット、ロングテール、および堅牢性に関係するシナリオでより重要です。コードは https://github.com/fundamentalvision/Siamese-Image-Modeling でリリースされます。

Self-supervised learning (SSL) has delivered superior performance on a variety of downstream vision tasks. Two main-stream SSL frameworks have been proposed, i.e., Instance Discrimination (ID) and Masked Image Modeling (MIM). ID pulls together representations from different views of the same image, while avoiding feature collapse. It lacks spatial sensitivity, which requires modeling the local structure within each image. On the other hand, MIM reconstructs the original content given a masked image. It instead does not have good semantic alignment, which requires projecting semantically similar views into nearby representations. To address this dilemma, we observe that (1) semantic alignment can be achieved by matching different image views with strong augmentations; (2) spatial sensitivity can benefit from predicting dense representations with masked images. Driven by these analysis, we propose Siamese Image Modeling (SiameseIM), which predicts the dense representations of an augmented view, based on another masked view from the same image but with different augmentations. SiameseIM uses a Siamese network with two branches. The online branch encodes the first view, and predicts the second view's representation according to the relative positions between these two views. The target branch produces the target by encoding the second view. SiameseIM can surpass both ID and MIM on a wide range of downstream tasks, including ImageNet finetuning and linear probing, COCO and LVIS detection, and ADE20k semantic segmentation. The improvement is more significant in few-shot, long-tail and robustness-concerned scenarios. Code shall be released at https://github.com/fundamentalvision/Siamese-Image-Modeling.

updated: Wed Nov 16 2022 14:45:30 GMT+0000 (UTC)

published: Thu Jun 02 2022 17:59:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト