GroupViT: Semantic Segmentation Emerges from Text Supervision

Jiarui Xu; Shalini De Mello; Sifei Liu; Wonmin Byeon; Thomas Breuel; Jan Kautz; Xiaolong Wang

GroupViT：セマンティックセグメンテーションがテキスト監視から出現

グループ化と認識は、視覚的なシーンの理解の重要なコンポーネントです。たとえば、オブジェクトの検出やセマンティックセグメンテーションなどです。エンドツーエンドの深層学習システムでは、画像領域のグループ化は通常、ピクセルレベルの認識ラベルからのトップダウン監視を介して暗黙的に行われます。代わりに、この論文では、グループ化メカニズムを深いネットワークに戻すことを提案します。これにより、テキストの監視だけでセマンティックセグメントが自動的に出現します。階層的なGroupingVisionTransformer（GroupViT）を提案します。これは、通常のグリッド構造表現を超えて、画像領域を徐々に大きくなる任意の形状のセグメントにグループ化することを学習します。 GroupViTは、対照的な損失を介して、大規模な画像テキストデータセットでテキストエンコーダーと共同でトレーニングします。 GroupViTは、テキストの監視のみで、ピクセルレベルの注釈なしで、セマンティック領域をグループ化することを学習し、ゼロショット方式で、つまり、さらに微調整することなく、セマンティックセグメンテーションのタスクに正常に移行します。これは、PASCAL VOC 2012で52.3％mIoU、PASCAL Contextデータセットで22.4％mIoUのゼロショット精度を達成し、より高いレベルの監視を必要とする最先端の転送学習方法に対して競争力を発揮します。コードはhttps://github.com/NVlabs/GroupViThttps://github.com/NVlabs/GroupViTでオープンソース化されています。

Grouping and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grouping of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments. We train GroupViT jointly with a text encoder on a large-scale image-text dataset via contrastive losses. With only text supervision and without any pixel-level annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i.e., without any further fine-tuning. It achieves a zero-shot accuracy of 52.3% mIoU on the PASCAL VOC 2012 and 22.4% mIoU on PASCAL Context datasets, and performs competitively to state-of-the-art transfer-learning methods requiring greater levels of supervision. We open-source our code at https://github.com/NVlabs/GroupViThttps://github.com/NVlabs/GroupViT.

updated: Thu May 19 2022 00:43:22 GMT+0000 (UTC)

published: Tue Feb 22 2022 18:56:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト