Scalable Video Object Segmentation with Simplified Framework

Qiangqiang Wu; Tianyu Yang; Wei WU; Antoni Chan

簡素化されたフレームワークによるスケーラブルなビデオオブジェクトセグメンテーション

現在一般的なビデオオブジェクトセグメンテーション (VOS) の方法は、特徴抽出とマッチングを個別に実行するいくつかの手作りモジュールを通じて特徴マッチングを実装しています。ただし、上記の手作りの設計では、経験的にターゲットの相互作用が不十分になるため、VOS での動的なターゲット認識機能の学習が制限されます。これらの制限に対処するために、この文書では、単一のトランスフォーマーバックボーンを利用して結合特徴抽出とマッチングを実行するスケーラブルな Simplified VOS (SimVOS) フレームワークを紹介します。具体的には、SimVOS はスケーラブルな ViT バックボーンを採用して、同時に特徴抽出とクエリ特徴と参照特徴の間の照合を行います。この設計により、SimVOS はより適切なターゲットウェア機能を学習して、マスクを正確に予測できるようになります。さらに重要なことは、SimVOS は、十分に事前トレーニングされた ViT バックボーン (MAE など) を VOS に直接適用できるため、VOS と大規模な自己教師付き事前トレーニングの間のギャップを埋めることができます。より良いパフォーマンスと速度のトレードオフを達成するために、フレーム内アテンションをさらに調査し、実行速度を向上させ、計算コストを節約するための新しいトークン改良モジュールを提案します。実験的に、当社の SimVOS は、一般的なビデオオブジェクトセグメンテーションベンチマーク、つまり DAVIS-2017 (88.0% J&F)、DAVIS-2016 (92.9% J&F)、および YouTube-VOS 2019 (84.2% J&F) で最先端の結果を達成しました。以前の VOS アプローチで使用されていた合成ビデオや BL30K 事前トレーニングを適用する必要はありません。

The current popular methods for video object segmentation (VOS) implement feature matching through several hand-crafted modules that separately perform feature extraction and matching. However, the above hand-crafted designs empirically cause insufficient target interaction, thus limiting the dynamic target-aware feature learning in VOS. To tackle these limitations, this paper presents a scalable Simplified VOS (SimVOS) framework to perform joint feature extraction and matching by leveraging a single transformer backbone. Specifically, SimVOS employs a scalable ViT backbone for simultaneous feature extraction and matching between query and reference features. This design enables SimVOS to learn better target-ware features for accurate mask prediction. More importantly, SimVOS could directly apply well-pretrained ViT backbones (e.g., MAE) for VOS, which bridges the gap between VOS and large-scale self-supervised pre-training. To achieve a better performance-speed trade-off, we further explore within-frame attention and propose a new token refinement module to improve the running speed and save computational cost. Experimentally, our SimVOS achieves state-of-the-art results on popular video object segmentation benchmarks, i.e., DAVIS-2017 (88.0% J&F), DAVIS-2016 (92.9% J&F) and YouTube-VOS 2019 (84.2% J&F), without applying any synthetic video or BL30K pre-training used in previous VOS approaches.

updated: Sat Aug 19 2023 04:30:48 GMT+0000 (UTC)

published: Sat Aug 19 2023 04:30:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト