Tag-Based Attention Guided Bottom-Up Approach for Video Instance Segmentation

Jyoti Kini; Mubarak Shah

ビデオインスタンスセグメンテーションのためのタグベースの注意ガイド付きボトムアップアプローチ

ビデオインスタンスのセグメンテーションは、ビデオシーケンス全体でオブジェクトインスタンスのセグメント化と追跡を処理する基本的なコンピュータビジョンタスクです。ほとんどの既存の方法は、通常、各フレーム内のオブジェクトを検出してセグメント化するために別々のネットワークを含む多段階のトップダウンアプローチを採用し、学習した追跡ヘッドを使用してこれらの検出を連続するフレームに関連付けることによって、このタスクを実行します。ただし、この作業では、一般的な領域提案ベースのアプローチではなく、ピクセルレベルの粒度でインスタンスマスク予測を実現するための、単純なエンドツーエンドのトレーニング可能なボトムアップアプローチを紹介します。現在のフレームベースのモデルとは異なり、当社のネットワークパイプラインは、入力ビデオクリップを単一の3Dボリュームとして処理し、時間情報を組み込みます。私たちの定式化の中心的な考え方は、ビデオインスタンスのセグメンテーションタスクをタグ割り当ての問題として解決することです。これにより、個別のタグ値を生成すると、ビデオシーケンス全体で個々のオブジェクトインスタンスが本質的に分離されます（ここでは、各タグは0〜1の任意の値になります）。この目的のために、異なるオブジェクトの十分な分離と、同じオブジェクトの異なるインスタンスの必要な識別を可能にする、新しい時空間タグ付け損失を提案します。さらに、ビデオ内のインスタンスの伝播を同時に学習しながら、インスタンスタグを改善するタグベースのアテンションモジュールを紹介します。評価は、私たちの方法がYouTube-VISおよびDAVIS-19データセットで競争力のある結果を提供し、他の最先端のパフォーマンス方法と比較して最小の実行時間を持っていることを示しています。

Video Instance Segmentation is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence. Most existing methods typically accomplish this task by employing a multi-stage top-down approach that usually involves separate networks to detect and segment objects in each frame, followed by associating these detections in consecutive frames using a learned tracking head. In this work, however, we introduce a simple end-to-end trainable bottom-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach. Unlike contemporary frame-based models, our network pipeline processes an input video clip as a single 3D volume to incorporate temporal information. The central idea of our formulation is to solve the video instance segmentation task as a tag assignment problem, such that generating distinct tag values essentially separates individual object instances across the video sequence (here each tag could be any arbitrary value between 0 and 1). To this end, we propose a novel spatio-temporal tagging loss that allows for sufficient separation of different objects as well as necessary identification of different instances of the same object. Furthermore, we present a tag-based attention module that improves instance tags, while concurrently learning instance propagation within a video. Evaluations demonstrate that our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other state-of-the-art performance methods.

updated: Fri Apr 22 2022 15:32:46 GMT+0000 (UTC)

published: Fri Apr 22 2022 15:32:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト