Global Knowledge Calibration for Fast Open-Vocabulary Segmentation

Kunyang Han; Yong Liu; Jun Hao Liew; Henghui Ding; Yunchao Wei; Jiajun Liu; Yitong Wang; Yansong Tang; Yujiu Yang; Jiashi Feng; Yao Zhao

高速オープン語彙セグメンテーションのためのグローバル知識キャリブレーション

CLIP などの事前トレーニング済みのビジョン言語モデルの最近の進歩により、テキスト入力のみから任意の概念をセグメンテーションできるようになりました。このプロセスは、一般にオープン語彙セマンティックセグメンテーション (OVS) と呼ばれます。ただし、既存の OVS 技術は根本的な課題に直面しています。トレーニングされた分類子は、トレーニング中に観測された基本クラスに過剰適合する傾向があり、その結果、見えないクラスへの汎化パフォーマンスが最適ではなくなります。この問題を軽減するために、最近の研究では、追加の凍結された事前トレーニング済み CLIP を分類に使用することが提案されています。それにもかかわらず、このアプローチでは、マスクごとに CLIP ビジョンエンコーダーを繰り返しフォワードパスする必要があるため、計算上のオーバーヘッドが大きくなり、実際のアプリケーションでは実用的ではなくなります。この課題に対処するために、私たちの目的は、推論中に CLIP 画像エンコーダーの余分な計算負荷なしで、同等以上のパフォーマンスを発揮できる高速 OVS モデルを開発することです。この目的のために、既知のクラスを微調整するときに一般化可能な表現を保持するというコアアイデアを提案します。具体的には、トレーニングカテゴリごとに同義語のセットを生成するテキスト多様化戦略を導入します。これにより、学習した表現が特定の既知のカテゴリ名に崩壊するのを防ぎます。さらに、CLIP の一般化可能な知識を保持するために、テキストガイド付きの知識蒸留法を採用しています。広範な実験により、提案されたモデルがさまざまなデータセットにわたって堅牢な一般化パフォーマンスを達成することが実証されています。さらに、オープン語彙ビデオセグメンテーションの予備調査を実行し、ビデオドメインでの将来のオープン語彙研究を容易にするベンチマークを提示します。

Recent advancements in pre-trained vision-language models, such as CLIP, have enabled the segmentation of arbitrary concepts solely from textual inputs, a process commonly referred to as open-vocabulary semantic segmentation (OVS). However, existing OVS techniques confront a fundamental challenge: the trained classifier tends to overfit on the base classes observed during training, resulting in suboptimal generalization performance to unseen classes. To mitigate this issue, recent studies have proposed the use of an additional frozen pre-trained CLIP for classification. Nonetheless, this approach incurs heavy computational overheads as the CLIP vision encoder must be repeatedly forward-passed for each mask, rendering it impractical for real-world applications. To address this challenge, our objective is to develop a fast OVS model that can perform comparably or better without the extra computational burden of the CLIP image encoder during inference. To this end, we propose a core idea of preserving the generalizable representation when fine-tuning on known classes. Specifically, we introduce a text diversification strategy that generates a set of synonyms for each training category, which prevents the learned representation from collapsing onto specific known category names. Additionally, we employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP. Extensive experiments demonstrate that our proposed model achieves robust generalization performance across various datasets. Furthermore, we perform a preliminary exploration of open-vocabulary video segmentation and present a benchmark that can facilitate future open-vocabulary research in the video domain.

updated: Sat Jul 15 2023 05:10:22 GMT+0000 (UTC)

published: Thu Mar 16 2023 09:51:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト