Video Instance Segmentation by Instance Flow Assembly

Xiang Li; Jinglu Wang; Xiao Li; Yan Lu

インスタンスフローアセンブリによるビデオインスタンスのセグメンテーション

インスタンスのセグメンテーションは、特定のクラスのすべてのオブジェクトインスタンスを分類およびセグメント化することを目的とした難しいタスクです。 2段階のボックスベースの方法は、画像ドメインで最高のパフォーマンスを実現しますが、その優位性をビデオドメインに簡単に拡張することはできません。これは、通常、検出されたバウンディングボックスから切り取られた特徴や画像を位置合わせせずに処理し、ピクセルレベルの時間的一貫性をキャプチャできないためです。ボックスフリー機能を処理するボトムアップ方式は、フレーム全体で正確な空間相関を提供でき、オブジェクトおよびピクセルレベルの追跡に完全に利用できるという観察結果を受け入れます。最初に、フレーム間の相関をより適切にエンコードするために、時間コンテキスト融合モジュールを備えたボトムアップフレームワークを提案します。セマンティックセグメンテーションとオブジェクトローカリゼーションのフレーム内キューは、共有バックボーンの後に対応するデコーダーによって同時に抽出および再構築されます。インスタンス間の効率的で堅牢な追跡のために、隣接するフレーム間でインスタンスレベルの対応を導入します。これは、インスタンスフローと呼ばれる中心間フローで表され、乱雑で密な時間的対応を組み立てます。実験は、提案された方法が、挑戦的なYoutube-VISデータセットでの最先端のオンライン方法（画像レベルの入力を取得）よりも優れていることを示しています。

Instance segmentation is a challenging task aiming at classifying and segmenting all object instances of specific classes. While two-stage box-based methods achieve top performances in the image domain, they cannot easily extend their superiority into the video domain. This is because they usually deal with features or images cropped from the detected bounding boxes without alignment, failing to capture pixel-level temporal consistency. We embrace the observation that bottom-up methods dealing with box-free features could offer accurate spacial correlations across frames, which can be fully utilized for object and pixel level tracking. We first propose our bottom-up framework equipped with a temporal context fusion module to better encode inter-frame correlations. Intra-frame cues for semantic segmentation and object localization are simultaneously extracted and reconstructed by corresponding decoders after a shared backbone. For efficient and robust tracking among instances, we introduce an instance-level correspondence across adjacent frames, which is represented by a center-to-center flow, termed as instance flow, to assemble messy dense temporal correspondences. Experiments demonstrate that the proposed method outperforms the state-of-the-art online methods (taking image-level input) on the challenging Youtube-VIS dataset.

updated: Wed Oct 20 2021 14:49:28 GMT+0000 (UTC)

published: Wed Oct 20 2021 14:49:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト