End-to-End Video Instance Segmentation with Transformers

Yuqing Wang; Zhaoliang Xu; Xinlong Wang; Chunhua Shen; Baoshan Cheng; Hao Shen; Huaxia Xia

トランスフォーマーを使用したエンドツーエンドのビデオインスタンスセグメンテーション

ビデオインスタンスセグメンテーション（VIS）は、ビデオに関係するオブジェクトインスタンスの分類、セグメント化、追跡を同時に行う必要があるタスクです。最近の方法では、通常、このタスクに取り組むための高度なパイプラインが開発されています。ここでは、トランスフォーマー上に構築されたVisTRと呼ばれる新しいビデオインスタンスセグメンテーションフレームワークを提案します。これは、VISタスクを直接的なエンドツーエンドの並列シーケンスデコード/予測問題と見なします。入力として複数の画像フレームで構成されるビデオクリップを指定すると、VisTRは、ビデオ内の各インスタンスのマスクのシーケンスを直接順番に出力します。コアとなるのは、新しい効果的なインスタンスシーケンスマッチングおよびセグメンテーション戦略です。これは、インスタンス全体をシーケンスレベルで監視およびセグメント化します。 VisTRは、類似性学習の同じ観点でインスタンスのセグメンテーションと追跡をフレーム化するため、パイプライン全体が大幅に簡素化され、既存のアプローチとは大幅に異なります。ベルやホイッスルがない場合、VisTRは既存のすべてのVISモデルの中で最高速度を達成し、YouTube-VISデータセットで単一のモデルを使用する方法の中で最高の結果を達成します。トランスフォーマー上に構築された、はるかにシンプルで高速なビデオインスタンスセグメンテーションフレームワークを初めて実証し、競争力のある精度を実現します。 VisTRが、より多くのビデオ理解タスクのための将来の研究の動機付けになることを願っています。

Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video. Recent methods typically develop sophisticated pipelines to tackle this task. Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem. Given a video clip consisting of multiple image frames as input, VisTR outputs the sequence of masks for each instance in the video in order directly. At the core is a new, effective instance sequence matching and segmentation strategy, which supervises and segments instances at the sequence level as a whole. VisTR frames the instance segmentation and tracking in the same perspective of similarity learning, thus considerably simplifying the overall pipeline and is significantly different from existing approaches. Without bells and whistles, VisTR achieves the highest speed among all existing VIS models, and achieves the best result among methods using single model on the YouTube-VIS dataset. For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy. We hope that VisTR can motivate future research for more video understanding tasks.

updated: Sun Apr 25 2021 09:43:28 GMT+0000 (UTC)

published: Mon Nov 30 2020 02:03:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト