SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation

Brendan Duke; Abdalla Ahmed; Christian Wolf; Parham Aarabi; Graham W. Taylor

SSTVOS：ビデオオブジェクトセグメンテーションのためのスパース時空間トランスフォーマー

このホワイトペーパーでは、ビデオオブジェクトセグメンテーション（VOS）へのTransformerベースのアプローチを紹介します。以前の作業の複合エラーとスケーラビリティの問題に対処するために、Sparse Spatiotemporal Transformers（SST）と呼ばれるVOS用のスケーラブルなエンドツーエンドの方法を提案します。 SSTは、時空間特徴に対するまばらな注意を使用して、ビデオ内の各オブジェクトのピクセルごとの表現を抽出します。 VOSの注意ベースの定式化により、モデルは複数のフレームの履歴に参加することを学習でき、モーションセグメンテーションの解決に必要な対応のような計算を実行するための適切な誘導バイアスを提供します。時空間ドメインにおけるリカレントネットワーク上の注意ベースの有効性を示します。私たちの方法は、YouTube-VOSとDAVIS 2017で競争力のある結果を達成し、最先端技術と比較して、オクルージョンに対するスケーラビリティと堅牢性が向上しています。

In this paper we introduce a Transformer-based approach to video object segmentation (VOS). To address compounding error and scalability issues of prior work, we propose a scalable, end-to-end method for VOS called Sparse Spatiotemporal Transformers (SST). SST extracts per-pixel representations for each object in a video using sparse attention over spatiotemporal features. Our attention-based formulation for VOS allows a model to learn to attend over a history of multiple frames and provides suitable inductive bias for performing correspondence-like computations necessary for solving motion segmentation. We demonstrate the effectiveness of attention-based over recurrent networks in the spatiotemporal domain. Our method achieves competitive results on YouTube-VOS and DAVIS 2017 with improved scalability and robustness to occlusions compared with the state of the art.

updated: Thu Jan 21 2021 20:06:12 GMT+0000 (UTC)

published: Thu Jan 21 2021 20:06:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト