Pixel-Level Bijective Matching for Video Object Segmentation

Suhwan Cho; Heansung Lee; Minjung Kim; Sungjun Jang; Sangyoun Lee

ビデオオブジェクトセグメンテーションのためのピクセルレベルの全単射マッチング

半教師ありビデオオブジェクトセグメンテーション（VOS）は、ビデオの最初のフレームに存在する指定されたオブジェクトをピクセルレベルで追跡することを目的としています。オブジェクトの外観情報を十分に活用するために、VOSではピクセルレベルの特徴マッチングが広く使用されています。従来の特徴マッチングは全射的に実行されます。つまり、クエリフレームから参照フレームへの最良のマッチングのみが考慮されます。クエリフレーム内の各場所は、各参照フレームの場所が参照される頻度に関係なく、参照フレーム内の最適な場所を参照します。これはほとんどの場合うまく機能し、外観の急激な変化に対して堅牢ですが、クエリフレームにターゲットオブジェクトに似た背景のディストラクタが含まれていると、重大なエラーが発生する可能性があります。この懸念を軽減するために、全単射マッチングメカニズムを導入して、クエリフレームから参照フレームへ、またはその逆に最適な一致を見つけます。クエリフレームピクセルに最適な一致を見つける前に、参照フレームピクセルに最適な一致を最初に検討して、各参照フレームピクセルが過度に参照されないようにします。このメカニズムは厳密に動作するため、つまり、ピクセルが相互に確実に一致する場合にのみピクセルが接続されるため、背景の気を散らすものを効果的に排除できます。さらに、既存のマスク伝播方法を改善するために、マスク埋め込みモジュールを提案します。複数の履歴マスクに座標情報を埋め込むことで、対象物の位置情報を効果的に捉えることができます。

Semi-supervised video object segmentation (VOS) aims to track the designated objects present in the initial frame of a video at the pixel level. To fully exploit the appearance information of an object, pixel-level feature matching is widely used in VOS. Conventional feature matching runs in a surjective manner, i.e., only the best matches from the query frame to the reference frame are considered. Each location in the query frame refers to the optimal location in the reference frame regardless of how often each reference frame location is referenced. This works well in most cases and is robust against rapid appearance variations, but may cause critical errors when the query frame contains background distractors that look similar to the target object. To mitigate this concern, we introduce a bijective matching mechanism to find the best matches from the query frame to the reference frame and vice versa. Before finding the best matches for the query frame pixels, the optimal matches for the reference frame pixels are first considered to prevent each reference frame pixel from being overly referenced. As this mechanism operates in a strict manner, i.e., pixels are connected if and only if they are the sure matches for each other, it can effectively eliminate background distractors. In addition, we propose a mask embedding module to improve the existing mask propagation method. By embedding multiple historic masks with coordinate information, it can effectively capture the position information of a target object.

updated: Fri Nov 12 2021 07:24:15 GMT+0000 (UTC)

published: Mon Oct 04 2021 18:15:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト