SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation

Junfeng Wu; Yi Jiang; Wenqing Zhang; Xiang Bai; Song Bai

SeqFormer：ビデオインスタンスセグメンテーションのためのイライラするほど単純なモデル

この作業では、ビデオインスタンスのセグメンテーションのためのイライラするほど単純なモデルであるSeqFormerを紹介します。 SeqFormerは、ビデオフレーム間のインスタンス関係をモデル化するビジョントランスフォーマーの原則に従います。それでも、ビデオ内のインスタンスの時系列をキャプチャするには、スタンドアロンのインスタンスクエリで十分であることがわかりますが、注意メカニズムは各フレームで個別に実行する必要があります。これを実現するために、SeqFormerは各フレーム内のインスタンスを特定し、時間情報を集約して、各フレームのマスクシーケンスを動的に予測するために使用されるビデオレベルのインスタンスの強力な表現を学習します。インスタンスの追跡は、ブランチの追跡や後処理なしで自然に実現されます。 YouTube-VISデータセットでは、SeqFormerはResNet-50バックボーンで47.4 APを達成し、ベルやホイッスルのないResNet-101バックボーンで49.0APを達成します。このような成果は、以前の最先端のパフォーマンスをそれぞれ4.6および4.4大幅に上回っています。さらに、最近提案されたSwinトランスフォーマーと統合されたSeqFormerは、59.3というはるかに高いAPを実現します。 SeqFormerが、ビデオインスタンスのセグメンテーションにおける将来の研究を促進する強力なベースラインになることを願っています。その間、この分野をより堅牢で正確な、きちんとしたモデルで前進させます。コードと事前トレーニング済みモデルは、https：//github.com/wjf5203/SeqFormerで公開されています。

In this work, we present SeqFormer, a frustratingly simple model for video instance segmentation. SeqFormer follows the principle of vision transformer that models instance relationships among video frames. Nevertheless, we observe that a stand-alone instance query suffices for capturing a time sequence of instances in a video, but attention mechanisms should be done with each frame independently. To achieve this, SeqFormer locates an instance in each frame and aggregates temporal information to learn a powerful representation of a video-level instance, which is used to predict the mask sequences on each frame dynamically. Instance tracking is achieved naturally without tracking branches or post-processing. On the YouTube-VIS dataset, SeqFormer achieves 47.4 AP with a ResNet-50 backbone and 49.0 AP with a ResNet-101 backbone without bells and whistles. Such achievement significantly exceeds the previous state-of-the-art performance by 4.6 and 4.4, respectively. In addition, integrated with the recently-proposed Swin transformer, SeqFormer achieves a much higher AP of 59.3. We hope SeqFormer could be a strong baseline that fosters future research in video instance segmentation, and in the meantime, advances this field with a more robust, accurate, neat model. The code and the pre-trained models are publicly available at https://github.com/wjf5203/SeqFormer.

updated: Wed Dec 15 2021 17:09:18 GMT+0000 (UTC)

published: Wed Dec 15 2021 17:09:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト