Associating Objects with Scalable Transformers for Video Object Segmentation

Zongxin Yang; Jiaxu Miao; Xiaohan Wang; Yunchao Wei; Yi Yang

ビデオオブジェクトセグメンテーションのためのスケーラブルトランスフォーマーとオブジェクトの関連付け

このホワイトペーパーでは、困難なマルチオブジェクトシナリオの下で、半教師ありビデオオブジェクトのセグメンテーションに取り組むために、より優れた効率的な埋め込み学習を実現する方法を調査します。最先端の方法は、単一のポジティブオブジェクトで特徴をデコードすることを学習するため、マルチオブジェクトシナリオでは各ターゲットを個別に照合およびセグメント化する必要があり、複数回の計算リソースを消費します。この問題を解決するために、複数のオブジェクトを共同で共同で照合およびデコードするための、オブジェクトとトランスフォーマーの関連付け（AOT）アプローチを提案します。詳細には、AOTは識別メカニズムを使用して、複数のターゲットを同じ高次元の埋め込みスペースに関連付けます。したがって、単一のオブジェクトを処理するのと同じくらい効率的に、複数のオブジェクトのマッチングとセグメンテーションのデコードを同時に処理できます。マルチオブジェクトの関連付けを十分にモデル化するために、階層的なマッチングと伝播を構築するためのLong Short-Term Transformer（LSTT）が考案されています。 AOTに基づいて、より柔軟で堅牢なフレームワークである、オブジェクトとスケーラブルトランスフォーマーの関連付け（AOST）をさらに提案します。このフレームワークでは、LSTTのスケーラブルバージョンが、精度と効率のトレードオフの実行時の適応を可能にするように設計されています。さらに、AOSTは、識別とビジョンの埋め込みを結合するためのより優れたレイヤー単位の方法を導入します。 AOTシリーズのフレームワークを調べるために、マルチオブジェクトおよびシングルオブジェクトのベンチマークで広範な実験を行います。最先端の競合他社と比較して、私たちの方法は、優れたパフォーマンスで実行時の効率を維持することができます。特に、YouTube-VOS（86.5％）、DAVIS 2017 Val / Test（87.0％/ 84.7％）、DAVIS 2016（93.0％）の3つの人気のあるベンチマークで新しい最先端のパフォーマンスを達成しています。プロジェクトページ：https：//github.com/zx-yang/AOT。

This paper investigates how to realize better and more efficient embedding learning to tackle the semi-supervised video object segmentation under challenging multi-object scenarios. The state-of-the-art methods learn to decode features with a single positive object and thus have to match and segment each target separately under multi-object scenarios, consuming multiple times computation resources. To solve the problem, we propose an Associating Objects with Transformers (AOT) approach to match and decode multiple objects jointly and collaboratively. In detail, AOT employs an identification mechanism to associate multiple targets into the same high-dimensional embedding space. Thus, we can simultaneously process multiple objects' matching and segmentation decoding as efficiently as processing a single object. To sufficiently model multi-object association, a Long Short-Term Transformer (LSTT) is devised to construct hierarchical matching and propagation. Based on AOT, we further propose a more flexible and robust framework, Associating Objects with Scalable Transformers (AOST), in which a scalable version of LSTT is designed to enable run-time adaptation of accuracy-efficiency trade-offs. Besides, AOST introduces a better layer-wise manner to couple identification and vision embeddings. We conduct extensive experiments on multi-object and single-object benchmarks to examine AOT series frameworks. Compared to the state-of-the-art competitors, our methods can maintain times of run-time efficiency with superior performance. Notably, we achieve new state-of-the-art performance on three popular benchmarks, i.e., YouTube-VOS (86.5%), DAVIS 2017 Val/Test (87.0%/84.7%), and DAVIS 2016 (93.0%). Project page: https://github.com/z-x-yang/AOT.

updated: Tue Mar 22 2022 03:33:27 GMT+0000 (UTC)

published: Tue Mar 22 2022 03:33:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト