Efficient Video Instance Segmentation via Tracklet Query and Proposal

Jialian Wu; Sudhir Yarram; Hui Liang; Tian Lan; Junsong Yuan; Jayan Eledath; Gerard Medioni

トラックレットクエリと提案による効率的なビデオインスタンスのセグメンテーション

ビデオインスタンスセグメンテーション（VIS）は、ビデオ内の複数のオブジェクトインスタンスを同時に分類、セグメント化、および追跡することを目的としています。最近のクリップレベルのVISは、複数のフレームからのより多くの時間的コンテキストが利用されるため、毎回短いビデオクリップを入力として受け取り、フレームレベルのVIS（セグメンテーションによる追跡）よりも強力なパフォーマンスを示します。しかし、ほとんどのクリップレベルのメソッドは、エンドツーエンドで学習可能でもリアルタイムでもありません。これらの制限は、クリップ内でVISをエンドツーエンドで実行する最近のVISトランスフォーマー（VisTR）によって対処されています。ただし、VisTRは、フレームごとに注意が集中しているため、トレーニング時間が長くなります。さらに、VisTRは、連続するクリップ間でインスタンストラックレットをリンクするために手作りのデータ関連付けが必要なため、複数のビデオクリップで完全にエンドツーエンドで学習できるわけではありません。このホワイトペーパーでは、効率的なトレーニングと推論を備えた完全なエンドツーエンドのフレームワークであるEfficientVISを提案します。コアとなるのは、トラックレットクエリとトラックレットプロポーザルであり、クエリとビデオの反復的な相互作用によって、空間と時間にわたって関心領域（RoI）を関連付けてセグメント化します。さらに、クリップ間のトラックレットをエンドツーエンドで学習できるようにする通信学習を提案します。 VisTRと比較して、EfficientVISは、YouTube-VISベンチマークで最先端の精度を達成しながら、必要なトレーニングエポックが15分の1になります。一方、私たちの方法では、データをまったく関連付けずに、単一のエンドツーエンドパスでビデオインスタンス全体のセグメンテーションを可能にします。

Video Instance Segmentation (VIS) aims to simultaneously classify, segment, and track multiple object instances in videos. Recent clip-level VIS takes a short video clip as input each time showing stronger performance than frame-level VIS (tracking-by-segmentation), as more temporal context from multiple frames is utilized. Yet, most clip-level methods are neither end-to-end learnable nor real-time. These limitations are addressed by the recent VIS transformer (VisTR) which performs VIS end-to-end within a clip. However, VisTR suffers from long training time due to its frame-wise dense attention. In addition, VisTR is not fully end-to-end learnable in multiple video clips as it requires a hand-crafted data association to link instance tracklets between successive clips. This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference. At the core are tracklet query and tracklet proposal that associate and segment regions-of-interest (RoIs) across space and time by an iterative query-video interaction. We further propose a correspondence learning that makes tracklets linking between clips end-to-end learnable. Compared to VisTR, EfficientVIS requires 15x fewer training epochs while achieving state-of-the-art accuracy on the YouTube-VIS benchmark. Meanwhile, our method enables whole video instance segmentation in a single end-to-end pass without data association at all.

updated: Thu Mar 03 2022 17:00:11 GMT+0000 (UTC)

published: Thu Mar 03 2022 17:00:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト