Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations

Vibashan VS; Ning Yu; Chen Xing; Can Qin; Mingfei Gao; Juan Carlos Niebles; Vishal M. Patel; Ran Xu

マスク不要の OVIS: 手動のマスクアノテーションを使用しないオープン語彙インスタンスセグメンテーション

既存のインスタンスセグメンテーションモデルは、ベース (トレーニング) カテゴリからの手動マスクアノテーションを使用して、タスク固有の情報を学習します。これらのマスクアノテーションには多大な人間の労力が必要であり、新しい (新しい) カテゴリにアノテーションを付けるスケーラビリティが制限されます。この問題を軽減するために、Open-Vocabulary (OV) メソッドは、大規模な画像キャプションペアと視覚言語モデルを活用して、新しいカテゴリを学習します。要約すると、OV メソッドは、基本注釈からの強力な監督を使用してタスク固有の情報を学習し、画像とキャプションのペアからの弱い監督を使用して新しいカテゴリ情報を学習します。強い監督と弱い監督の間のこの違いは、基本カテゴリの過剰適合につながり、その結果、新しいカテゴリへの一般化が不十分になります。この作業では、提案されたマスクフリー OVIS パイプラインを使用して、視覚言語モデルによって生成された疑似マスクアノテーションからベースカテゴリと新規カテゴリの両方を弱教師付きの方法で学習することにより、この問題を克服します。私たちの方法は、画像とキャプションのペアに存在するオブジェクトの事前トレーニング済み視覚言語モデルのローカリゼーション機能を活用して、疑似マスクアノテーションを自動的に生成します。次に、生成された疑似マスクアノテーションを使用してインスタンスセグメンテーションモデルを監視し、パイプライン全体を人件費のかかるインスタンスレベルのアノテーションとオーバーフィッティングから解放します。私たちの広範な実験は、疑似マスクのみでトレーニングされた方法が、手動マスクでトレーニングされた最近の最先端の方法と比較して、MS-COCO データセットと OpenImages データセットの mAP スコアを大幅に改善することを示しています。コードとモデルは https://vibashan.github.io/ovis-web/ で提供されています。

Existing instance segmentation models learn task-specific information using manual mask annotations from base (training) categories. These mask annotations require tremendous human effort, limiting the scalability to annotate novel (new) categories. To alleviate this problem, Open-Vocabulary (OV) methods leverage large-scale image-caption pairs and vision-language models to learn novel categories. In summary, an OV method learns task-specific information using strong supervision from base annotations and novel category information using weak supervision from image-captions pairs. This difference between strong and weak supervision leads to overfitting on base categories, resulting in poor generalization towards novel categories. In this work, we overcome this issue by learning both base and novel categories from pseudo-mask annotations generated by the vision-language model in a weakly supervised manner using our proposed Mask-free OVIS pipeline. Our method automatically generates pseudo-mask annotations by leveraging the localization ability of a pre-trained vision-language model for objects present in image-caption pairs. The generated pseudo-mask annotations are then used to supervise an instance segmentation model, freeing the entire pipeline from any labour-expensive instance-level annotations and overfitting. Our extensive experiments show that our method trained with just pseudo-masks significantly improves the mAP scores on the MS-COCO dataset and OpenImages dataset compared to the recent state-of-the-art methods trained with manual masks. Codes and models are provided in https://vibashan.github.io/ovis-web/.

updated: Wed Mar 29 2023 17:58:39 GMT+0000 (UTC)

published: Wed Mar 29 2023 17:58:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト