InstMove: Instance Motion for Object-centric Video Segmentation

Qihao Liu; Junfeng Wu; Yi Jiang; Xiang Bai; Alan Yuille; Song Bai

InstMove: オブジェクト中心のビデオセグメンテーションのためのインスタンスモーション

多大な努力にもかかわらず、最先端のビデオセグメンテーション方法は、オブジェクト埋め込みの形でのオブジェクトの外観に依存しているため、オクルージョンや急速な動きに敏感なままであり、これらの外乱に対して脆弱です。一般的な解決策は、オプティカルフローを使用してモーション情報を提供することですが、基本的にはピクセルレベルのモーションのみが考慮されます。これは依然として外観の類似性に依存しているため、オクルージョンや高速移動では不正確になることがよくあります。この作業では、インスタンスレベルのモーションを研究し、オブジェクト中心のビデオセグメンテーションのインスタンスモーションの略である InstMove を提示します。ピクセル単位のモーションと比較して、InstMove は主にインスタンスレベルのモーション情報に依存しており、画像の特徴の埋め込みがなく、物理的な解釈を備えているため、オクルージョンや動きの速いオブジェクトに対してより正確で堅牢になります。ビデオセグメンテーションタスクにうまく適合するために、InstMove はインスタンスマスクを使用してオブジェクトの物理的存在をモデル化し、メモリネットワークを介して動的モデルを学習して、次のフレームでの位置と形状を予測します。わずか数行のコードで、InstMove を現在の SOTA メソッドに統合して 3 つの異なるビデオセグメンテーションタスクを実行し、パフォーマンスを向上させることができます。具体的には、重度のオクルージョンを特徴とする OVIS データセットでは 1.5 AP、動きの速いオブジェクトを主に含む YouTubeVIS-Long データセットでは 4.9 AP だけ従来のアートを改善しました。これらの結果は、インスタンスレベルのモーションが堅牢で正確であるため、オブジェクト中心のビデオセグメンテーションの複雑なシナリオで強力なソリューションとして機能することを示唆しています。

Despite significant efforts, cutting-edge video segmentation methods still remain sensitive to occlusion and rapid movement, due to their reliance on the appearance of objects in the form of object embeddings, which are vulnerable to these disturbances. A common solution is to use optical flow to provide motion information, but essentially it only considers pixel-level motion, which still relies on appearance similarity and hence is often inaccurate under occlusion and fast movement. In this work, we study the instance-level motion and present InstMove, which stands for Instance Motion for Object-centric Video Segmentation. In comparison to pixel-wise motion, InstMove mainly relies on instance-level motion information that is free from image feature embeddings, and features physical interpretations, making it more accurate and robust toward occlusion and fast-moving objects. To better fit in with the video segmentation tasks, InstMove uses instance masks to model the physical presence of an object and learns the dynamic model through a memory network to predict its position and shape in the next frame. With only a few lines of code, InstMove can be integrated into current SOTA methods for three different video segmentation tasks and boost their performance. Specifically, we improve the previous arts by 1.5 AP on OVIS dataset, which features heavy occlusions, and 4.9 AP on YouTubeVIS-Long dataset, which mainly contains fast-moving objects. These results suggest that instance-level motion is robust and accurate, and hence serving as a powerful solution in complex scenarios for object-centric video segmentation.

updated: Tue Mar 14 2023 17:58:44 GMT+0000 (UTC)

published: Tue Mar 14 2023 17:58:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト