Joint Inductive and Transductive Learning for Video Object Segmentation

Yunyao Mao; Ning Wang; Wengang Zhou; Houqiang Li

ビデオオブジェクトセグメンテーションのための共同帰納的およびトランスダクティブ学習

半監視ビデオオブジェクトセグメンテーションは、最初のフレームにマスク注釈のみを指定して、ビデオシーケンス内のターゲットオブジェクトをセグメント化するタスクです。入手できる情報が限られているため、非常に困難な作業です。以前のほとんどの最高のパフォーマンスの方法は、マッチングベースのトランスダクティブ推論またはオンライン帰納学習を採用しています。それにもかかわらず、それらは同様のインスタンスに対する識別力が低いか、時空間情報の利用が不十分です。この作業では、トランスダクティブ学習とインダクティブ学習を統合フレームワークに統合して、それらの間の相補性を活用し、正確で堅牢なビデオオブジェクトセグメンテーションを実現することを提案します。提案されたアプローチは、2つの機能ブランチで構成されています。トランスダクションブランチは軽量のトランスアーキテクチャを採用して豊富な時空間キューを集約し、インダクションブランチはオンラインの帰納的学習を実行して識別可能なターゲット情報を取得します。これらの2つの多様なブランチをブリッジするために、2ヘッドのラベルエンコーダーが導入され、それぞれに適したターゲットを事前に学習します。生成されたマスクエンコーディングは、それらの相補性をよりよく保持するために、さらに解きほぐされることを余儀なくされます。いくつかの一般的なベンチマークでの広範な実験は、合成トレーニングデータを必要とせずに、提案されたアプローチが一連の新しい最先端の記録を設定することを示しています。コードはhttps://github.com/maoyunyao/JOINTで入手できます。

Semi-supervised video object segmentation is a task of segmenting the target object in a video sequence given only a mask annotation in the first frame. The limited information available makes it an extremely challenging task. Most previous best-performing methods adopt matching-based transductive reasoning or online inductive learning. Nevertheless, they are either less discriminative for similar instances or insufficient in the utilization of spatio-temporal information. In this work, we propose to integrate transductive and inductive learning into a unified framework to exploit the complementarity between them for accurate and robust video object segmentation. The proposed approach consists of two functional branches. The transduction branch adopts a lightweight transformer architecture to aggregate rich spatio-temporal cues while the induction branch performs online inductive learning to obtain discriminative target information. To bridge these two diverse branches, a two-head label encoder is introduced to learn the suitable target prior for each of them. The generated mask encodings are further forced to be disentangled to better retain their complementarity. Extensive experiments on several prevalent benchmarks show that, without the need of synthetic training data, the proposed approach sets a series of new state-of-the-art records. Code is available at https://github.com/maoyunyao/JOINT.

updated: Sun Aug 08 2021 16:25:48 GMT+0000 (UTC)

published: Sun Aug 08 2021 16:25:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト