Scalable Video Object Segmentation with Identification Mechanism

Zongxin Yang; Xiaohan Wang; Jiaxu Miao; Yunchao Wei; Wenguan Wang; Yi Yang

識別メカニズムを備えたスケーラブルなビデオオブジェクトセグメンテーション

このペーパーでは、半教師ありビデオオブジェクトセグメンテーション (VOS) のスケーラブルで効果的なマルチオブジェクトモデリングを実現するという課題について詳しく説明します。以前の VOS メソッドは、単一のポジティブオブジェクトを使用してフィーチャをデコードし、マルチオブジェクトシナリオでは各ターゲットを個別に照合してセグメント化する必要があるため、マルチオブジェクト表現の学習を制限していました。さらに、以前の技術は特定のアプリケーション目的に対応していて、さまざまな速度精度の要件を満たす柔軟性に欠けていました。これらの問題に対処するために、オブジェクトとトランスフォーマーの関連付け (AOT) とオブジェクトとスケーラブルトランスフォーマーの関連付け (AOST) という 2 つの革新的なアプローチを紹介します。効果的なマルチオブジェクトモデリングを追求するために、AOT は各オブジェクトに一意の ID を割り当てる IDentification (ID) メカニズムを導入します。このアプローチにより、ネットワークはすべてのオブジェクト間の関連性を同時にモデル化できるため、単一のネットワークパスでのオブジェクトの追跡とセグメント化が容易になります。柔軟性のない導入という課題に対処するために、AOST は、レイヤーごとの ID ベースの注意とスケーラブルな監視を組み込んだスケーラブルな長期短期トランスフォーマーをさらに統合します。これにより、ID 埋め込みの表現制限が克服され、VOS でのオンラインアーキテクチャのスケーラビリティが初めて可能になります。高密度のマルチオブジェクトアノテーションを含む VOS のベンチマークが存在しないことを考慮して、アプローチを検証するために、挑戦的な Video Object Segmentation in the Wild (VOSW) ベンチマークを提案します。私たちは、VOSW と一般的に使用される 5 つの VOS ベンチマークにわたる広範な実験を使用して、さまざまな AOT および AOST のバリアントを評価しました。当社のアプローチは、最先端の競合他社を上回り、6 つのベンチマークすべてにおいて一貫して優れた効率性と拡張性を示します。さらに、第 3 回大規模ビデオオブジェクトセグメンテーションチャレンジでも見事 1 位を獲得しました。

This paper delves into the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object Segmentation (VOS). Previous VOS methods decode features with a single positive object, limiting the learning of multi-object representation as they must match and segment each target separately under multi-object scenarios. Additionally, earlier techniques catered to specific application objectives and lacked the flexibility to fulfill different speed-accuracy requirements. To address these problems, we present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST). In pursuing effective multi-object modeling, AOT introduces the IDentification (ID) mechanism to allocate each object a unique identity. This approach enables the network to model the associations among all objects simultaneously, thus facilitating the tracking and segmentation of objects in a single network pass. To address the challenge of inflexible deployment, AOST further integrates scalable long short-term transformers that incorporate layer-wise ID-based attention and scalable supervision. This overcomes ID embeddings' representation limitations and enables online architecture scalability in VOS for the first time. Given the absence of a benchmark for VOS involving densely multi-object annotations, we propose a challenging Video Object Segmentation in the Wild (VOSW) benchmark to validate our approaches. We evaluated various AOT and AOST variants using extensive experiments across VOSW and five commonly-used VOS benchmarks. Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks. Moreover, we notably achieved the 1st position in the 3rd Large-scale Video Object Segmentation Challenge.

updated: Mon Jul 03 2023 04:58:30 GMT+0000 (UTC)

published: Tue Mar 22 2022 03:33:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト