Self-Supervised Learning from Non-Object Centric Images with a Geometric Transformation Sensitive Architecture

Taeho Kim; Jong-Min Lee

幾何学的変換に敏感なアーキテクチャを使用した非オブジェクト中心の画像からの自己教師あり学習

ほとんどの不変ベースの自己教師ありメソッドは、幾何学的変換から不変表現を事前トレーニングして学習するために、単一のオブジェクト中心の画像 (ImageNet 画像など) に依存しています。ただし、画像がオブジェクト中心でない場合、トリミングによって画像のセマンティクスが大幅に変更される可能性があります。さらに、モデルが幾何学的変換の影響を受けなくなるため、位置情報を取得するのに苦労する可能性があります。このため、幾何学的変換に敏感な機能を学習するように設計された幾何学的変換に敏感なアーキテクチャを提案します。特に、4 回転、ランダムクロップ、マルチクロップに焦点を当てています。私たちの方法は、教師の特徴マップのプーリングと回転、および回転の予測を介して、これらの変換に敏感なターゲットを使用することにより、生徒が敏感になることを奨励します。さらに、マルチ作物に鈍感なトレーニングはローカルからグローバルへの対応を促進するため、モデルは長期的な依存関係を捉えることができます。パッチ対応損失を使用して、画像のビュー間の対応を強制する代わりに、同様の機能を持つパッチ間の対応を促進します。このアプローチにより、長期的な依存関係をより適切な方法で捉えることができます。私たちのアプローチは、幾何学的変換に依存しない表現を学習する他の方法と比較して、事前トレーニングデータとして非オブジェクト中心の画像を使用する場合のパフォーマンスの向上を示しています。画像分類、セマンティックセグメンテーション、検出、インスタンスセグメンテーションなどのタスクで DINO ベースラインを上回り、4.9 Top-1 Acc、3.3 mIoU、3.4 AP^b、2.7 AP^m の改善が見られます。コードと事前トレーニング済みモデルは、https://github.com/bok3948/GTSA で公開されています。

Most invariance-based self-supervised methods rely on single object-centric images (e.g., ImageNet images) for pretraining, learning invariant representations from geometric transformations. However, when images are not object-centric, the semantics of the image can be significantly altered due to cropping. Furthermore, as the model becomes insensitive to geometric transformations, it may struggle to capture location information. For this reason, we propose a Geometric Transformation Sensitive Architecture designed to learn features that are sensitive to geometric transformations, specifically focusing on four-fold rotation, random crop, and multi-crop. Our method encourages the student to be sensitive by using targets that are sensitive to those transforms via pooling and rotating of the teacher feature map and predicting rotation. Additionally, as training insensitively to multi-crop encourages local-to-global correspondence, the model can capture long-term dependencies. We use patch correspondence loss to encourage correspondence between patches with similar features, instead of enforcing correspondence between views of the image. This approach allows us to capture long-term dependencies in a more appropriate way. Our approach demonstrates improved performance when using non-object-centric images as pretraining data compared to other methods that learn geometric transformation-insensitive representations. We surpass the DINO baseline in tasks including image classification, semantic segmentation, detection, and instance segmentation with improvements of 4.9 Top-1 Acc, 3.3 mIoU, 3.4 AP^b, and 2.7 AP^m. Code and pretrained models are publicly available at: https://github.com/bok3948/GTSA

updated: Mon May 08 2023 12:54:16 GMT+0000 (UTC)

published: Mon Apr 17 2023 06:32:37 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト