Scalable Multi-object Identification for Video Object Segmentation

Zongxin Yang; Jiaxu Miao; Xiaohan Wang; Yunchao Wei; Yi Yang

ビデオオブジェクトセグメンテーションのためのスケーラブルなマルチオブジェクト識別

単一のネットワークパスで複数のオブジェクトを処理でき、速度と精度のトレードオフのために動的にスケーラブルなアーキテクチャを備えた、新しい半教師付きビデオオブジェクトセグメンテーションフレームワークを紹介します。最先端の方法は、1 つのポジティブオブジェクトを照合してセグメント化することを好み、複数オブジェクトのシナリオではオブジェクトを 1 つずつ処理する必要があり、複数回の計算リソースを消費します。さらに、以前の方法には常に静的なネットワークアーキテクチャがあり、さまざまな速度と精度の要件に適応するほど柔軟ではありませんでした。上記の問題を解決するために、オンラインネットワークのスケーラビリティと連携して複数のオブジェクトを照合およびセグメント化するためのスケーラブルトランスフォーマー (AOST) によるオブジェクトの関連付けを提案しました。単一のオブジェクトを処理するのと同じくらい効率的に複数のオブジェクトを照合してセグメント化するために、AOST は IDentification (ID) メカニズムを使用してオブジェクトに一意の ID を割り当て、それらを共有の高次元埋め込み空間に関連付けます。さらに、Scalable Long Short-Term Transformer (S-LSTT) は、階層的な複数オブジェクトの関連付けを構築し、精度と効率のトレードオフをオンラインで適応できるように設計されています。スケーラブルな監視とレイヤーごとの ID ベースの注意をさらに導入することで、AOST は以前の方法よりも柔軟であるだけでなく、より堅牢です。 AOST の亜種を評価するために、複数オブジェクトおよび単一オブジェクトのベンチマークで広範な実験を行います。最先端の競合他社と比較して、私たちの方法は優れた実行時間効率と優れたパフォーマンスを維持できます。特に、YouTube-VOS (86.5%)、DAVIS 2017 Val/Test (87.0%/84.7%)、および DAVIS 2016 (93.0%) などの一般的な VOS ベンチマークで新しい最先端のパフォーマンスを達成しています。プロジェクトページ: https://github.com/zx-yang/AOT。

We present a new semi-supervised video object segmentation framework that can process multiple objects in a single network pass and has a dynamically scalable architecture for speed-accuracy trade-offs. State-of-the-art methods prefer to match and segment a single positive object and have to process objects one by one under multi-object scenarios, consuming multiple times of computation resources. Besides, previous methods always have static network architectures, which are not flexible enough to adapt to different speed-accuracy requirements. To solve the above problems, we proposed an Associating Objects with Scalable Transformers (AOST) approach to match and segment multiple objects collaboratively with online network scalability. To match and segment multiple objects as efficiently as processing a single one, AOST employs an IDentification (ID) mechanism to assign objects with unique identities and associate them in a shared high-dimensional embedding space. In addition, a Scalable Long Short-Term Transformer (S-LSTT) is designed to construct hierarchical multi-object associations and enable online adaptation of accuracy-efficiency trade-offs. By further introducing scalable supervision and layer-wise ID-based attention, AOST is not only more flexible but more robust than previous methods. We conduct extensive experiments on multi-object and single-object benchmarks to evaluate AOST variants. Compared to state-of-the-art competitors, our methods can maintain superior run-time efficiency with better performance. Notably, we achieve new state-of-the-art performance on popular VOS benchmarks, i.e., YouTube-VOS (86.5%), DAVIS 2017 Val/Test (87.0%/84.7%), and DAVIS 2016 (93.0%). Project page: https://github.com/z-x-yang/AOT.

updated: Tue Oct 18 2022 11:41:50 GMT+0000 (UTC)

published: Tue Mar 22 2022 03:33:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト