Hierarchical Spatiotemporal Transformers for Video Object Segmentation

Jun-Sang Yoo; Hongjae Lee; Seung-Won Jung

ビデオオブジェクトセグメンテーションのための階層的時空間トランスフォーマー

この論文では、半教師ありビデオオブジェクトセグメンテーション (VOS) 用の HST と呼ばれる新しいフレームワークを紹介します。 HST は、最新の Swin Transformer と Video Swin Transformer を使用して画像とビデオの特徴を抽出し、時間的にコヒーレントな VOS に不可欠な時空間局所性の誘導バイアスを継承します。画像とビデオの機能を最大限に活用するために、HST は画像とビデオの機能をそれぞれクエリとメモリとしてキャストします。 HST は、複数のスケールで効率的なメモリ読み取り操作を適用することにより、オブジェクトマスクを正確に再構築するための階層的な特徴を生成します。 HST は、雑然とした背景の下で遮蔽され、高速で移動するオブジェクトを含む困難なシナリオを処理する際に有効性と堅牢性を示します。特に、HST-B は、複数の人気ベンチマーク、つまり YouTube-VOS (85.0%)、DAVIS 2017 (85.9%)、および DAVIS 2016 (94.0%) で最先端の競合他社を上回っています。

This paper presents a novel framework called HST for semi-supervised video object segmentation (VOS). HST extracts image and video features using the latest Swin Transformer and Video Swin Transformer to inherit their inductive bias for the spatiotemporal locality, which is essential for temporally coherent VOS. To take full advantage of the image and video features, HST casts image and video features as a query and memory, respectively. By applying efficient memory read operations at multiple scales, HST produces hierarchical features for the precise reconstruction of object masks. HST shows effectiveness and robustness in handling challenging scenarios with occluded and fast-moving objects under cluttered backgrounds. In particular, HST-B outperforms the state-of-the-art competitors on multiple popular benchmarks, i.e., YouTube-VOS (85.0%), DAVIS 2017 (85.9%), and DAVIS 2016 (94.0%).

updated: Mon Jul 17 2023 06:12:26 GMT+0000 (UTC)

published: Mon Jul 17 2023 06:12:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト