Associating Objects with Transformers for Video Object Segmentation

Zongxin Yang; Yunchao Wei; Yi Yang

ビデオオブジェクトセグメンテーションのためのトランスフォーマーとオブジェクトの関連付け

このホワイトペーパーでは、困難なマルチオブジェクトシナリオの下で、半教師ありビデオオブジェクトのセグメンテーションに取り組むために、より優れた、より効率的な埋め込み学習を実現する方法を調査します。最先端の方法は、単一のポジティブオブジェクトで特徴をデコードすることを学習するため、マルチオブジェクトシナリオでは各ターゲットを個別に照合およびセグメント化する必要があり、複数回のコンピューティングリソースを消費します。この問題を解決するために、複数のオブジェクトを均一に照合およびデコードするための、オブジェクトとトランスフォーマーの関連付け（AOT）アプローチを提案します。詳細には、AOTは識別メカニズムを使用して、複数のターゲットを同じ高次元の埋め込みスペースに関連付けます。したがって、単一のオブジェクトを処理するのと同じくらい効率的に、複数のオブジェクトのマッチングとセグメンテーションのデコードを同時に処理できます。マルチオブジェクトの関連付けを十分にモデル化するために、階層的なマッチングと伝播を構築するためのLong Short-TermTransformerが設計されています。さまざまな複雑さを持つAOTバリアントネットワークを調べるために、マルチオブジェクトベンチマークとシングルオブジェクトベンチマークの両方で広範な実験を実施します。特に、当社のR50-AOT-Lは、YouTube-VOS（84.1％J＆F）、DAVIS 2017（84.9％）、DAVIS 2016（91.1％）の3つの人気のあるベンチマークで、すべての最先端の競合他社を上回っています。マルチオブジェクトの実行時間を3倍以上高速に保ちます。一方、AOT-Tは、上記のベンチマークでリアルタイムのマルチオブジェクト速度を維持できます。 AOTに基づいて、第3回大規模VOSチャレンジで1位にランクインしました。

This paper investigates how to realize better and more efficient embedding learning to tackle the semi-supervised video object segmentation under challenging multi-object scenarios. The state-of-the-art methods learn to decode features with a single positive object and thus have to match and segment each target separately under multi-object scenarios, consuming multiple times computing resources. To solve the problem, we propose an Associating Objects with Transformers (AOT) approach to match and decode multiple objects uniformly. In detail, AOT employs an identification mechanism to associate multiple targets into the same high-dimensional embedding space. Thus, we can simultaneously process multiple objects' matching and segmentation decoding as efficiently as processing a single object. For sufficiently modeling multi-object association, a Long Short-Term Transformer is designed for constructing hierarchical matching and propagation. We conduct extensive experiments on both multi-object and single-object benchmarks to examine AOT variant networks with different complexities. Particularly, our R50-AOT-L outperforms all the state-of-the-art competitors on three popular benchmarks, i.e., YouTube-VOS (84.1% J&F), DAVIS 2017 (84.9%), and DAVIS 2016 (91.1%), while keeping more than 3× faster multi-object run-time. Meanwhile, our AOT-T can maintain real-time multi-object speed on the above benchmarks. Based on AOT, we ranked 1st in the 3rd Large-scale VOS Challenge.

updated: Sat Oct 30 2021 20:14:46 GMT+0000 (UTC)

published: Fri Jun 04 2021 17:59:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト