Hybrid Instance-aware Temporal Fusion for Online Video Instance Segmentation

Xiang Li; Jinglu Wang; Xiao Li; Yan Lu

オンラインビデオインスタンスセグメンテーションのためのハイブリッドインスタンス対応の時間的融合

最近、トランスフォーマーベースの画像セグメンテーション手法は、以前のソリューションに対して顕著な成功を収めています。ビデオドメインの場合、フレーム全体のオブジェクトインスタンスに注意を払いながら時間コンテキストを効果的にモデル化する方法は未解決の問題です。本論文では、新しいインスタンス認識時間融合法を用いたオンラインビデオインスタンスセグメンテーションフレームワークを提案します。まず、表現、つまりグローバルコンテキストの潜在コード（インスタンスコード）とCNN特徴マップを活用して、インスタンスレベルとピクセルレベルの特徴を表現します。この表現に基づいて、ビデオフレーム間の時間的一貫性をモデル化するために、トリミングのない時間的融合アプローチを導入します。具体的には、グローバルインスタンス固有の情報をインスタンスコードにエンコードし、インスタンスコードとCNN機能マップ間のハイブリッドアテンションを使用してフレーム間のコンテキスト融合を構築します。インスタンスコード間のフレーム間の一貫性は、順序の制約によってさらに強化されます。学習したハイブリッド時間整合性を活用することで、フレーム間でインスタンスIDを直接取得して維持し、以前の方法での複雑なフレームごとのインスタンスマッチングを排除できます。人気のあるVISデータセット、つまりYoutube-VIS-19 / 21で広範な実験が行われました。私たちのモデルは、すべてのオンラインVISメソッドの中で最高のパフォーマンスを実現します。特に、ResNet-50バックボーンを使用する場合、このモデルはすべてのオフラインメソッドを覆い隠します。

Recently, transformer-based image segmentation methods have achieved notable success against previous solutions. While for video domains, how to effectively model temporal context with the attention of object instances across frames remains an open problem. In this paper, we propose an online video instance segmentation framework with a novel instance-aware temporal fusion method. We first leverages the representation, i.e., a latent code in the global context (instance code) and CNN feature maps to represent instance- and pixel-level features. Based on this representation, we introduce a cropping-free temporal fusion approach to model the temporal consistency between video frames. Specifically, we encode global instance-specific information in the instance code and build up inter-frame contextual fusion with hybrid attentions between the instance codes and CNN feature maps. Inter-frame consistency between the instance codes are further enforced with order constraints. By leveraging the learned hybrid temporal consistency, we are able to directly retrieve and maintain instance identities across frames, eliminating the complicated frame-wise instance matching in prior methods. Extensive experiments have been conducted on popular VIS datasets, i.e. Youtube-VIS-19/21. Our model achieves the best performance among all online VIS methods. Notably, our model also eclipses all offline methods when using the ResNet-50 backbone.

updated: Fri Dec 03 2021 03:37:57 GMT+0000 (UTC)

published: Fri Dec 03 2021 03:37:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト