DVIS: Decoupled Video Instance Segmentation Framework

Tao Zhang; Xingye Tian; Yu Wu; Shunping Ji; Xuebo Wang; Yuan Zhang; Pengfei Wan

DVIS: 分離されたビデオインスタンスセグメンテーションフレームワーク

ビデオインスタンスセグメンテーション (VIS) は、自動運転やビデオ編集などのさまざまなアプリケーションにとって重要なタスクです。既存の手法は、主に 2 つの要因により、現実世界の複雑で長いビデオではパフォーマンスが低下することがよくあります。まず、オフライン手法は、すべてのフレームを同等に扱い、隣接するフレーム間の相互依存性を無視する密結合モデリングパラダイムによって制限されます。その結果、長期にわたる時間的調整中に過剰なノイズが発生することになります。第 2 に、オンライン方法では一時的な情報が適切に活用されないという問題があります。これらの課題に取り組むために、VIS をセグメンテーション、トラッキング、リファインメントという 3 つの独立したサブタスクに分割することによる分離戦略を提案します。デカップリング戦略の有効性は、2 つの重要な要素に依存します。1) 追跡中にフレームごとの関連付けを介して正確な長期位置合わせ結果を達成すること、および 2) リファイン中に前述の正確な位置合わせ結果に基づいた時間情報を効果的に利用することです。分離 VIS フレームワーク (DVIS) を構築するために、新しい参照トラッカーと時間リファイナーを導入します。 DVIS は、VIS と VPS の両方で新しい SOTA パフォーマンスを達成し、最も困難で現実的なベンチマークである OVIS および VIPSeg データセットで現在の SOTA 手法を 7.3 AP および 9.6 VPQ 上回ります。さらに、デカップリング戦略のおかげで、参照トラッカーとテンポラルリファイナーは超軽量 (セグメンターの FLOP のわずか 1.69%) なので、11G メモリを備えた単一の GPU で効率的なトレーニングと推論が可能になります。コードは https://github.com/zhang-tao-whu/DVIShttps://github.com/zhang-tao-whu/DVIS で入手できます。

Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing. Existing methods often underperform on complex and long videos in real world, primarily due to two factors. Firstly, offline methods are limited by the tightly-coupled modeling paradigm, which treats all frames equally and disregards the interdependencies between adjacent frames. Consequently, this leads to the introduction of excessive noise during long-term temporal alignment. Secondly, online methods suffer from inadequate utilization of temporal information. To tackle these challenges, we propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement. The efficacy of the decoupling strategy relies on two crucial elements: 1) attaining precise long-term alignment outcomes via frame-by-frame association during tracking, and 2) the effective utilization of temporal information predicated on the aforementioned accurate alignment outcomes during refinement. We introduce a novel referring tracker and temporal refiner to construct the Decoupled VIS framework (DVIS). DVIS achieves new SOTA performance in both VIS and VPS, surpassing the current SOTA methods by 7.3 AP and 9.6 VPQ on the OVIS and VIPSeg datasets, which are the most challenging and realistic benchmarks. Moreover, thanks to the decoupling strategy, the referring tracker and temporal refiner are super light-weight (only 1.69% of the segmenter FLOPs), allowing for efficient training and inference on a single GPU with 11G memory. The code is available at https://github.com/zhang-tao-whu/DVIShttps://github.com/zhang-tao-whu/DVIS.

updated: Tue Jun 06 2023 05:24:15 GMT+0000 (UTC)

published: Tue Jun 06 2023 05:24:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト