VLT: Vision-Language Transformer and Query Generation for Referring Segmentation

Henghui Ding; Chang Liu; Suchen Wang; Xudong Jiang

VLT: Vision-Language Transformer と参照セグメンテーションのためのクエリ生成

セグメンテーションを参照してマルチモーダル情報間の深い相互作用を促進し、ビジョン言語機能に対する全体的な理解を強化するためのビジョン言語トランスフォーマー (VLT) フレームワークを提案します。特に画像と対話する場合、言語表現の動的な強調を理解するにはさまざまな方法があります。しかし、既存のトランスフォーマー作品の学習クエリはトレーニング後に固定されており、言語表現のランダム性と膨大な多様性に対応できません。この問題に対処するために、言語表現の多様な理解を表すために入力固有のクエリの複数のセットを動的に生成するクエリ生成モジュールを提案します。これらの多様な理解の中から最適なものを見つけて、より良いマスクを生成するために、一連のクエリの対応する応答を選択的に融合するクエリバランスモジュールを提案します。さらに、多様な言語表現を扱うモデルの能力を強化するために、サンプル間学習を考慮して、同じオブジェクトに対して異なる言語表現を理解する知識をモデルに明示的に与えます。異なるオブジェクトの特徴を区別しながら、同じ対象オブジェクトの異なる表現の特徴を絞り込むために、マスクされた対照学習を導入します。提案されたアプローチは軽量で、5 つのデータセットで一貫してセグメンテーション結果を参照する新しい最先端を実現します。

We propose a Vision-Language Transformer (VLT) framework for referring segmentation to facilitate deep interactions among multi-modal information and enhance the holistic understanding to vision-language features. There are different ways to understand the dynamic emphasis of a language expression, especially when interacting with the image. However, the learned queries in existing transformer works are fixed after training, which cannot cope with the randomness and huge diversity of the language expressions. To address this issue, we propose a Query Generation Module, which dynamically produces multiple sets of input-specific queries to represent the diverse comprehensions of language expression. To find the best among these diverse comprehensions, so as to generate a better mask, we propose a Query Balance Module to selectively fuse the corresponding responses of the set of queries. Furthermore, to enhance the model's ability in dealing with diverse language expressions, we consider inter-sample learning to explicitly endow the model with knowledge of understanding different language expressions to the same object. We introduce masked contrastive learning to narrow down the features of different expressions for the same target object while distinguishing the features of different objects. The proposed approach is lightweight and achieves new state-of-the-art referring segmentation results consistently on five datasets.

updated: Fri Oct 28 2022 03:36:07 GMT+0000 (UTC)

published: Fri Oct 28 2022 03:36:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト