InstanceFormer: An Online Video Instance Segmentation Framework

Rajat Koner; Tanveer Hannan; Suprosanna Shit; Sahand Sharifzadeh; Matthias Schubert; Thomas Seidl; Volker Tresp

InstanceFormer: オンラインビデオインスタンスセグメンテーションフレームワーク

最近の変圧器ベースのオフラインビデオインスタンスセグメンテーション (VIS) アプローチは、有望な結果を達成し、オンラインアプローチを大幅に上回ります。ただし、ビデオ全体への依存と、完全な時空間的注意によって引き起こされる膨大な計算の複雑さにより、長いビデオの処理などの実際のアプリケーションでは制限されます。このホワイトペーパーでは、InstanceFormer という名前のシングルステージトランスフォーマーベースの効率的なオンライン VIS フレームワークを提案します。これは、長くてやりがいのあるビデオに特に適しています。短期的および長期的な依存関係と時間的一貫性をモデル化する 3 つの新しいコンポーネントを提案します。まず、以前のインスタンスの表現、場所、および意味情報を伝播して、短期的な変更をモデル化します。第二に、ネットワークが特定の時間ウィンドウ内の以前のインスタンスを調べることを可能にする、デコーダーでの新しいメモリークロスアテンションを提案します。最後に、すべてのフレームにわたってインスタンスの表現に一貫性を課すために、時間的コントラスト損失を採用します。メモリの注意と一時的な一貫性は、オクルージョンなどの困難なシナリオを含む、長期的な依存関係のモデル化に特に有益です。提案された InstanceFormer は、複数のデータセットにわたって以前のオンラインベンチマークメソッドよりも大幅に優れています。最も重要なことは、InstanceFormer が YouTube-VIS-2021 や OVIS などの困難で長いデータセットに対するオフラインアプローチを凌駕することです。コードは https://github.com/rajatkoner08/InstanceFormer で入手できます。

Recent transformer-based offline video instance segmentation (VIS) approaches achieve encouraging results and significantly outperform online approaches. However, their reliance on the whole video and the immense computational complexity caused by full Spatio-temporal attention limit them in real-life applications such as processing lengthy videos. In this paper, we propose a single-stage transformer-based efficient online VIS framework named InstanceFormer, which is especially suitable for long and challenging videos. We propose three novel components to model short-term and long-term dependency and temporal coherence. First, we propagate the representation, location, and semantic information of prior instances to model short-term changes. Second, we propose a novel memory cross-attention in the decoder, which allows the network to look into earlier instances within a certain temporal window. Finally, we employ a temporal contrastive loss to impose coherence in the representation of an instance across all frames. Memory attention and temporal coherence are particularly beneficial to long-range dependency modeling, including challenging scenarios like occlusion. The proposed InstanceFormer outperforms previous online benchmark methods by a large margin across multiple datasets. Most importantly, InstanceFormer surpasses offline approaches for challenging and long datasets such as YouTube-VIS-2021 and OVIS. Code is available at https://github.com/rajatkoner08/InstanceFormer.

updated: Mon Aug 22 2022 18:54:18 GMT+0000 (UTC)

published: Mon Aug 22 2022 18:54:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト