Self-Supervised Learning from Non-Object Centric Images with a Geometric Transformation Sensitive Architecture

Taeho Kim; Jong-Min Lee

幾何学的変換に敏感なアーキテクチャによる非オブジェクト中心画像からの自己教師あり学習

不変性に基づく自己教師あり手法のほとんどは、幾何学的変換に対して不変である特徴を事前トレーニングおよび学習するために、単一のオブジェクト中心の画像 (ImageNet 画像など) に依存しています。ただし、画像がオブジェクト中心ではない場合、トリミングにより画像のセマンティクスが大幅に変更される可能性があります。さらに、モデルが幾何学的変換の影響を受けなくなると、位置情報を取得するのが難しくなる可能性があります。このため、特に 4 倍回転、ランダムクロップ、およびマルチクロップに焦点を当て、幾何学的変換に敏感になるように設計された幾何学変換敏感アーキテクチャを提案します。私たちの方法では、教師の特徴マップをプールして回転させることで回転を予測し、それらの変換に応じて変化するターゲットを使用することで、生徒が敏感になるように促します。さらに、パッチ対応損失を使用して、同様の機能を持つパッチ間の対応を促進します。このアプローチにより、複数の作物に対して鈍感になることを学習するときに発生する、ローカルからグローバルへの対応を促進することによって長期依存関係を取得するよりも、より適切な方法で長期依存関係を取得できるようになります。私たちのアプローチは、幾何学的変換の影響を受けないようにモデルをトレーニングする他の方法と比較して、非オブジェクト中心の画像を事前トレーニングデータとして使用する場合のパフォーマンスの向上を示しています。画像分類、セマンティックセグメンテーション、検出、インスタンスセグメンテーションを含むタスクにおいて DINO[Caron et al.[2021b]] のベースラインを上回り、4.9 Top-1 Acc、3.3 mIoU、3.4 AP^b、2.7 AP^m の改善を実現しました。コードと事前トレーニングされたモデルは、https://github.com/bok3948/GTSA で公開されています。

Most invariance-based self-supervised methods rely on single object-centric images (e.g., ImageNet images) for pretraining, learning features that invariant to geometric transformation. However, when images are not object-centric, the semantics of the image can be significantly altered due to cropping. Furthermore, as the model becomes insensitive to geometric transformations, it may struggle to capture location information. For this reason, we propose a Geometric Transformation Sensitive Architecture designed to be sensitive to geometric transformations, specifically focusing on four-fold rotation, random crop, and multi-crop. Our method encourages the student to be sensitive by predicting rotation and using targets that vary with those transformations through pooling and rotating the teacher feature map. Additionally, we use patch correspondence loss to encourage correspondence between patches with similar features. This approach allows us to capture long-term dependencies in a more appropriate way than capturing long-term dependencies by encouraging local-to-global correspondence, which occurs when learning to be insensitive to multi-crop. Our approach demonstrates improved performance when using non-object-centric images as pretraining data compared to other methods that train the model to be insensitive to geometric transformation. We surpass DINO[Caron et al.[2021b]] baseline in tasks including image classification, semantic segmentation, detection, and instance segmentation with improvements of 4.9 Top-1 Acc, 3.3 mIoU, 3.4 AP^b, and 2.7 AP^m. Code and pretrained models are publicly available at: https://github.com/bok3948/GTSA

updated: Thu May 11 2023 11:02:47 GMT+0000 (UTC)

published: Mon Apr 17 2023 06:32:37 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト