MIMIC: Masked Image Modeling with Image Correspondences

Kalyani Marathe; Mahtab Bigverdi; Nishat Khan; Tuhin Kundu; Aniruddha Kembhavi; Linda G. Shapiro; Ranjay Krishna

MIMIC: 画像対応によるマスクされた画像モデリング

今日のコンピュータービジョンにおけるピクセル単位の密な予測タスクの多く (深度推定とセマンティックセグメンテーション) は、事前トレーニングされた画像表現に依存しています。したがって、効果的な事前トレーニングデータセットを厳選することが重要です。残念ながら、効果的な事前トレーニングデータセットはマルチビューシーンを含むデータセットであり、シミュレートされた環境からの注釈付き 3D メッシュ、点群、およびカメラパラメーターを使用して厳選されただけです。私たちは、アノテーションを必要としないデータセットキュレーションメカニズムを提案します。オープンソースのビデオデータセットと合成 3D 環境から、1.3M の MIMIC-1M と 3.1M のマルチビュー画像ペアの MIMIC-3M の 2 つのデータセットをマイニングします。さまざまなマスク画像モデリング目標を使用して複数の自己教師ありモデルをトレーニングして、次の結果を示しました。MIMIC-3M でトレーニングされた表現は、深度推定、セマンティックセグメンテーション、表面法線、姿勢推定などの複数の下流タスクでアノテーションを使用してマイニングされた表現よりも優れています。また、ダウンストリームのトレーニングデータが数ショットに制限されている場合、フリーズされた表現よりも優れたパフォーマンスを発揮します。データセット (MIMIC-3M) が大きくなると、パフォーマンスが大幅に向上します。これは、私たちのキュレーション方法が任意にスケールしてさらに大きなデータセットを生成できるため、有望です。 MIMIC コード、データセット、および事前トレーニングされたモデルは、https://github.com/RAIVNLab/MIMIC でオープンソース化されています。

Many pixelwise dense prediction tasks-depth estimation and semantic segmentation in computer vision today rely on pretrained image representations. Therefore, curating effective pretraining datasets is vital. Unfortunately, the effective pretraining datasets are those with multi-view scenes and have only been curated using annotated 3D meshes, point clouds, and camera parameters from simulated environments. We propose a dataset-curation mechanism that does not require any annotations. We mine two datasets: MIMIC-1M with 1.3M and MIMIC-3M with 3.1M multi-view image pairs from open-sourced video datasets and from synthetic 3D environments. We train multiple self-supervised models with different masked image modeling objectives to showcase the following findings: Representations trained on MIMIC-3M outperform those mined using annotations on multiple downstream tasks, including depth estimation, semantic segmentation, surface normals, and pose estimation. They also outperform representations that are frozen and when downstream training data is limited to few-shot. Larger dataset (MIMIC-3M) significantly improves performance, which is promising since our curation method can arbitrarily scale to produce even larger datasets. MIMIC code, dataset, and pretrained models are open-sourced at https://github.com/RAIVNLab/MIMIC.

updated: Tue Jun 27 2023 00:40:12 GMT+0000 (UTC)

published: Tue Jun 27 2023 00:40:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト