Differentiable Soft-Masked Attention

Ali Athar; Jonathon Luiten; Alexander Hermans; Deva Ramanan; Bastian Leibe

微分可能なソフトマスクの注意

トランスフォーマーは、複雑な操作をモデル化する際のパフォーマンスと柔軟性により、コンピュータービジョンで広く使用されるようになりました。特に重要なのは、「クロスアテンション」操作です。これにより、ベクトル表現 (画像内のオブジェクトなど) を、任意のサイズの入力特徴セットに注意を向けることによって学習できます。最近、「Masked Attention」が提案されました。この手法では、特定のオブジェクト表現が、そのオブジェクトのセグメンテーションマスクがアクティブになっているイメージピクセルの特徴のみに注意を向けます。この注意の専門化は、さまざまな画像およびビデオのセグメンテーションタスクに有益であることが証明されました。この論文では、「ソフトマスク」（バイナリ値の代わりに連続マスク確率を持つマスク）に注意を向けることを可能にし、これらのマスク確率を通じて微分可能である注意の別の特殊化を提案し、注意に使用されるマスクを学習できるようにします直接の損失監視を必要とせずにネットワーク内で。これは、いくつかのアプリケーションで役立ちます。具体的には、弱教師付きビデオオブジェクトセグメンテーション (VOS) のタスクに「Differentiable Soft-Masked Attention」を使用します。ここでは、トレーニングに単一の注釈付き画像フレームのみを必要とする VOS 用のトランスフォーマーベースのネットワークを開発しますが、注釈付きフレームが 1 つだけのビデオでのサイクル一貫性トレーニングの恩恵を受けます。ラベルのないフレームではマスクの損失はありませんが、ネットワークは、新しい注意定式化により、それらのフレームでオブジェクトをセグメント化できます。コード: https://github.com/Ali2500/HODOR/blob/main/hodor/modelling/encoder/soft_masked_attention.py

Transformers have become prevalent in computer vision due to their performance and flexibility in modelling complex operations. Of particular significance is the 'cross-attention' operation, which allows a vector representation (e.g. of an object in an image) to be learned by attending to an arbitrarily sized set of input features. Recently, "Masked Attention" was proposed in which a given object representation only attends to those image pixel features for which the segmentation mask of that object is active. This specialization of attention proved beneficial for various image and video segmentation tasks. In this paper, we propose another specialization of attention which enables attending over `soft-masks' (those with continuous mask probabilities instead of binary values), and is also differentiable through these mask probabilities, thus allowing the mask used for attention to be learned within the network without requiring direct loss supervision. This can be useful for several applications. Specifically, we employ our "Differentiable Soft-Masked Attention" for the task of Weakly-Supervised Video Object Segmentation (VOS), where we develop a transformer-based network for VOS which only requires a single annotated image frame for training, but can also benefit from cycle consistency training on a video with just one annotated frame. Although there is no loss for masks in unlabeled frames, the network is still able to segment objects in those frames due to our novel attention formulation. Code: https://github.com/Ali2500/HODOR/blob/main/hodor/modelling/encoder/soft_masked_attention.py

updated: Fri Aug 05 2022 14:09:12 GMT+0000 (UTC)

published: Wed Jun 01 2022 02:05:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト