Reliability-Hierarchical Memory Network for Scribble-Supervised Video Object Segmentation

Zikun Zhou; Kaige Mao; Wenjie Pei; Hongpeng Wang; Yaowei Wang; Zhenyu He

スクリブル監視ビデオオブジェクトセグメンテーションのための信頼性 - 階層メモリネットワーク

このホワイトペーパーでは、ビデオオブジェクトセグメンテーション (VOS) タスクを落書き教師付きの方法で解決することを目的としています。この方法では、VOS モデルはスパーススクリブルアノテーションによってトレーニングされるだけでなく、推論のためにスパースターゲットスクリブルで初期化されます。したがって、トレーニングと初期化の両方の注釈の負担を大幅に軽減できます。フリーハンドで監視された VOS の難しさは 2 つの側面にあります。一方では、トレーニング中にまばらな落書き注釈から学習する強力な機能が必要です。一方で、まばらな初期ターゲット落書きのみが与えられた場合、推論中に強力な推論能力が要求されます。この作業では、メモリ信頼性レベルに関して段階的に拡張する戦略でターゲットマスクを予測するための信頼性階層メモリネットワーク（RHMNet）を提案します。具体的には、RHMNet は最初に信頼度の高いレベルのメモリのみを使用して、ターゲットに属する信頼性の高い領域を特定します。これは、最初のターゲットの落書きと非常によく似ています。次に、特定された高信頼性領域を、その領域自体とすべての信頼性レベルのメモリで条件付けされたターゲット全体に拡張します。さらに、密な結果を予測するためのモデルの学習を容易にする落書き教師付き学習メカニズムを提案します。単一フレーム内のピクセルレベルの関係とシーケンス内のフレームレベルの関係をマイニングして、シーケンストレーニングサンプルの落書き注釈を最大限に活用します。 2 つの一般的なベンチマークでの良好なパフォーマンスは、私たちの方法が有望であることを示しています。

This paper aims to solve the video object segmentation (VOS) task in a scribble-supervised manner, in which VOS models are not only trained by the sparse scribble annotations but also initialized with the sparse target scribbles for inference. Thus, the annotation burdens for both training and initialization can be substantially lightened. The difficulties of scribble-supervised VOS lie in two aspects. On the one hand, it requires the powerful ability to learn from the sparse scribble annotations during training. On the other hand, it demands strong reasoning capability during inference given only a sparse initial target scribble. In this work, we propose a Reliability-Hierarchical Memory Network (RHMNet) to predict the target mask in a step-wise expanding strategy w.r.t. the memory reliability level. To be specific, RHMNet first only uses the memory in the high-reliability level to locate the region with high reliability belonging to the target, which is highly similar to the initial target scribble. Then it expands the located high-reliability region to the entire target conditioned on the region itself and the memories in all reliability levels. Besides, we propose a scribble-supervised learning mechanism to facilitate the learning of our model to predict dense results. It mines the pixel-level relation within the single frame and the frame-level relation within the sequence to take full advantage of the scribble annotations in sequence training samples. The favorable performance on two popular benchmarks demonstrates that our method is promising.

updated: Sat Mar 25 2023 07:21:40 GMT+0000 (UTC)

published: Sat Mar 25 2023 07:21:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト