Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

Qihang Yu; Ju He; Xueqing Deng; Xiaohui Shen; Liang-Chieh Chen

畳み込みダイハード: 単一の凍結畳み込み CLIP を使用したオープン語彙セグメンテーション

オープン語彙のセグメンテーションは、オープンなカテゴリのセットからオブジェクトをセグメント化して認識する必要がある難しいタスクです。この課題に対処する 1 つの方法は、CLIP などのマルチモーダルモデルを活用して、共有埋め込みスペースで画像とテキストの機能を提供し、クローズド語彙認識とオープン語彙認識の間のギャップを埋めることです。したがって、既存の方法では、この問題に取り組むために 2 段階のフレームワークが採用されることが多く、入力は最初にマスクジェネレーターを通過し、次に予測されたマスクとともに CLIP モデルを通過します。このプロセスでは、画像から特徴を複数回抽出する必要がありますが、非効果的で非効率的な場合があります。対照的に、共有の Frozen Convolutional CLIP バックボーンを使用して、すべてを 1 ステージのフレームワークに構築することを提案します。これにより、現在の 2 ステージのパイプラインが大幅に簡素化されるだけでなく、精度とコストのトレードオフが大幅に向上します。提案された FC-CLIP は、以下の観察から恩恵を受けています。凍結された CLIP バックボーンは、オープン語彙分類の能力を維持し、強力なマスク生成器としても機能します。また、畳み込み CLIP は、以前に使用されたものよりも大きな入力解像度によく一般化します。対照的な画像とテキストの事前トレーニング。 COCO パノプティックデータのみでトレーニングし、ゼロショット方式でテストした場合、FC-CLIP は、ADE20K で 26.8 PQ、16.8 AP、および 34.1 mIoU、Mapillary Vista で 18.2 PQ、27.9 mIoU、Mapillary Vista で 44.0 PQ、26.8 AP、56.2 mIoU を達成しました。 Cityscapes は、ADE20K で +4.2 PQ、+2.4 AP、+4.2 mIoU、Mapillary Vistas で +4.0 PQ、および Cityscapes で +20.1 PQ と、それぞれ従来技術を上回っています。さらに、FC-CLIP のトレーニングおよびテスト時間は、同じ従来技術よりも 7.5 倍および 6.6 倍大幅に高速であり、使用するパラメータは 5.9 分の 1 です。 FC-CLIP はまた、さまざまなオープン語彙セマンティックセグメンテーションデータセットにわたって新しい最先端のパフォーマンスを確立します。コードは https://github.com/bytedance/fc-clip にあります

Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining. When training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer parameters. FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code at https://github.com/bytedance/fc-clip

updated: Tue Nov 14 2023 19:10:49 GMT+0000 (UTC)

published: Fri Aug 04 2023 17:59:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト