VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

Xudong Wang; Ishan Misra; Ziyun Zeng; Rohit Girdhar; Trevor Darrell

VideoCutLER: 驚くほどシンプルな教師なしビデオインスタンスのセグメンテーション

教師なしビデオインスタンスのセグメンテーションに対する既存のアプローチは、通常、動き推定に依存しており、小さな動きや発散した動きを追跡するのが困難です。オプティカルフローや自然ビデオでのトレーニングなどのモーションベースの学習信号を使用せずに、教師なしマルチインスタンスビデオセグメンテーションを行う簡単な方法である VideoCutLER を紹介します。私たちの重要な洞察は、モデルのトレーニングに高品質の疑似マスクとシンプルなビデオ合成手法を使用するだけで、結果として得られるビデオモデルがビデオフレーム全体で複数のインスタンスを効果的にセグメント化して追跡できるようにするのに驚くほど十分であるということです。挑戦的な YouTubeVIS-2019 ベンチマークでの最初の競争力のある教師なし学習の結果を示し、50.7% APvideo^50 を達成し、以前の最先端技術を大幅に上回りました。 VideoCutLER は、教師ありビデオインスタンスセグメンテーションタスクの強力な事前トレーニング済みモデルとしても機能し、APvideo に関して YouTubeVIS-2019 で DINO を 15.9% 上回ります。

Existing approaches to unsupervised video instance segmentation typically rely on motion estimates and experience difficulties tracking small or divergent motions. We present VideoCutLER, a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos. Our key insight is that using high-quality pseudo masks and a simple video synthesis method for model training is surprisingly sufficient to enable the resulting video model to effectively segment and track multiple instances across video frames. We show the first competitive unsupervised learning results on the challenging YouTubeVIS-2019 benchmark, achieving 50.7% APvideo^50 , surpassing the previous state-of-the-art by a large margin. VideoCutLER can also serve as a strong pretrained model for supervised video instance segmentation tasks, exceeding DINO by 15.9% on YouTubeVIS-2019 in terms of APvideo.

updated: Mon Aug 28 2023 17:10:12 GMT+0000 (UTC)

published: Mon Aug 28 2023 17:10:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト