Differentiable Soft-Masked Attention

Ali Athar; Jonathon Luiten; Alexander Hermans; Deva Ramanan; Bastian Leibe

差別化可能なソフトマスクされた注意

トランスフォーマーは、複雑な操作のモデリングにおけるパフォーマンスと柔軟性により、コンピュータービジョンで普及しています。特に重要なのは、「クロスアテンション」操作です。これにより、任意のサイズの入力特徴のセットに注意を払うことで、ベクトル表現（たとえば、画像内のオブジェクト）を学習できます。最近、所与のオブジェクト表現が、そのオブジェクトのセグメンテーションマスクがアクティブであるそれらの画像ピクセル特徴にのみ注意を向ける「マスクされた注意」が提案された。この注意の専門化は、さまざまな画像およびビデオのセグメンテーションタスクに有益であることが証明されました。この論文では、「ソフトマスク」（バイナリ値ではなく連続マスク確率を持つもの）に注意を向けることができ、これらのマスク確率によって区別できる別の注意の専門分野を提案します。これにより、注意に使用されるマスクを学習できます。直接的な損失監視を必要とせずにネットワーク内で。これは、いくつかのアプリケーションに役立ちます。具体的には、「Differentiable Soft-Masked Attention」を使用して、弱く監視されたビデオオブジェクトセグメンテーション（VOS）のタスクを実行します。このタスクでは、トレーニングに単一の注釈付き画像フレームのみを必要とするVOS用のトランスベースのネットワークを開発します。注釈付きフレームが1つしかないビデオでのサイクル一貫性トレーニングの恩恵を受けます。ラベルのないフレームのマスクが失われることはありませんが、ネットワークは、新しい注意の定式化により、これらのフレームのオブジェクトをセグメント化することができます。

Transformers have become prevalent in computer vision due to their performance and flexibility in modelling complex operations. Of particular significance is the 'cross-attention' operation, which allows a vector representation (e.g. of an object in an image) to be learned by attending to an arbitrarily sized set of input features. Recently, "Masked Attention" was proposed in which a given object representation only attends to those image pixel features for which the segmentation mask of that object is active. This specialization of attention proved beneficial for various image and video segmentation tasks. In this paper, we propose another specialization of attention which enables attending over `soft-masks' (those with continuous mask probabilities instead of binary values), and is also differentiable through these mask probabilities, thus allowing the mask used for attention to be learned within the network without requiring direct loss supervision. This can be useful for several applications. Specifically, we employ our "Differentiable Soft-Masked Attention" for the task of Weakly-Supervised Video Object Segmentation (VOS), where we develop a transformer-based network for VOS which only requires a single annotated image frame for training, but can also benefit from cycle consistency training on a video with just one annotated frame. Although there is no loss for masks in unlabeled frames, the network is still able to segment objects in those frames due to our novel attention formulation.

updated: Wed Jun 01 2022 02:05:13 GMT+0000 (UTC)

published: Wed Jun 01 2022 02:05:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト