Zero-guidance Segmentation Using Zero Segment Labels

Pitchaporn Rewatbowornwong; Nattanat Chatthee; Ekapol Chuangsuwanich; Supasorn Suwajanakorn

ゼロセグメントラベルを使用したゼロガイダンスセグメンテーション

CLIP は、新しいエキサイティングな共同視覚言語アプリケーションを可能にしました。そのうちの 1 つは、任意のテキストクエリが与えられた任意のセグメントを見つけることができるオープン語彙セグメンテーションです。私たちの研究では、テキストクエリや事前定義されたクラスの形式でのユーザーガイダンスなしでセマンティックセグメントを発見し、自然言語を使用して自動的にラベル付けすることが可能かどうかを尋ねます。微調整やセグメンテーションデータセットなしでこの問題を解決するために、新しい問題のゼロガイダンスセグメンテーションと、2 つの事前トレーニング済みジェネラリストモデル DINO と CLIP を活用する最初のベースラインを提案します。一般的な考え方は、最初に画像を小さなオーバーセグメントに分割し、それらを CLIP の視覚言語空間にエンコードし、それらをテキストラベルに変換し、意味的に類似したセグメントを一緒にマージすることです。ただし、重要な課題は、認識に役立つグローバルコンテキスト情報とローカルコンテキスト情報のバランスを取るセグメント固有の埋め込みにビジュアルセグメントをエンコードする方法です。私たちの主な貢献は、CLIP 内のアテンションレイヤーを分析することによって 2 つのコンテキストのバランスを取る、新しいアテンションマスキング手法です。また、この新しいタスクを評価するためのいくつかの指標も紹介します。 CLIPの生来の知識により、私たちの方法は、美術館の群衆の中からモナリザの絵を正確に見つけることができます.プロジェクトページ: https://zero-guide-seg.github.io/.

CLIP has enabled new and exciting joint vision-language applications, one of which is open-vocabulary segmentation, which can locate any segment given an arbitrary text query. In our research, we ask whether it is possible to discover semantic segments without any user guidance in the form of text queries or predefined classes, and label them using natural language automatically? We propose a novel problem zero-guidance segmentation and the first baseline that leverages two pre-trained generalist models, DINO and CLIP, to solve this problem without any fine-tuning or segmentation dataset. The general idea is to first segment an image into small over-segments, encode them into CLIP's visual-language space, translate them into text labels, and merge semantically similar segments together. The key challenge, however, is how to encode a visual segment into a segment-specific embedding that balances global and local context information, both useful for recognition. Our main contribution is a novel attention-masking technique that balances the two contexts by analyzing the attention layers inside CLIP. We also introduce several metrics for the evaluation of this new task. With CLIP's innate knowledge, our method can precisely locate the Mona Lisa painting among a museum crowd. Project page: https://zero-guide-seg.github.io/.

updated: Thu Mar 23 2023 16:15:07 GMT+0000 (UTC)

published: Thu Mar 23 2023 16:15:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト