Two-shot Video Object Segmentation

Kun Yan; Xiao Li; Fangyun Wei; Jinglu Wang; Chenbin Zhang; Ping Wang; Yan Lu

ツーショットビデオオブジェクトセグメンテーション

ビデオオブジェクトセグメンテーション (VOS) に関する以前の作業は、密に注釈が付けられたビデオでトレーニングされました。それにもかかわらず、ピクセルレベルで注釈を取得するには、費用と時間がかかります。この作業では、まばらに注釈が付けられたビデオで満足のいく VOS モデルをトレーニングする可能性を示します。パフォーマンスが維持されている間、トレーニングビデオごとに 2 つのラベル付きフレームが必要になるだけです。この新しいトレーニングパラダイムを、2 ショットビデオオブジェクトセグメンテーション、または略して 2 ショット VOS と呼びます。基礎となるアイデアは、トレーニング中にラベルのないフレームの疑似ラベルを生成し、ラベル付きデータと疑似ラベル付きデータの組み合わせでモデルを最適化することです。私たちのアプローチは非常にシンプルで、既存のフレームワークの大部分に適用できます。最初に、最初のフレームが常にラベル付きのフレームである半教師付き方法で、まばらに注釈が付けられたビデオで VOS モデルを事前トレーニングします。次に、事前トレーニング済みの VOS モデルを採用して、ラベルのないすべてのフレームの疑似ラベルを生成し、その後疑似ラベルバンクに格納します。最後に、ラベル付きデータと疑似ラベル付きデータの両方で VOS モデルを再トレーニングします。最初のフレームに制限はありません。初めて、ツーショット VOS データセットで VOS モデルをトレーニングする一般的な方法を提示します。 YouTube-VOS および DAVIS ベンチマークの 7.3% および 2.9% のラベル付きデータを使用することにより、完全にラベル付けされたセットでトレーニングされたカウンターパートとは対照的に、私たちのアプローチは同等の結果を達成します。コードとモデルは、https://github.com/yk-pku/Two-shot-Video-Object-Segmentation で入手できます。

Previous works on video object segmentation (VOS) are trained on densely annotated videos. Nevertheless, acquiring annotations in pixel level is expensive and time-consuming. In this work, we demonstrate the feasibility of training a satisfactory VOS model on sparsely annotated videos-we merely require two labeled frames per training video while the performance is sustained. We term this novel training paradigm as two-shot video object segmentation, or two-shot VOS for short. The underlying idea is to generate pseudo labels for unlabeled frames during training and to optimize the model on the combination of labeled and pseudo-labeled data. Our approach is extremely simple and can be applied to a majority of existing frameworks. We first pre-train a VOS model on sparsely annotated videos in a semi-supervised manner, with the first frame always being a labeled one. Then, we adopt the pre-trained VOS model to generate pseudo labels for all unlabeled frames, which are subsequently stored in a pseudo-label bank. Finally, we retrain a VOS model on both labeled and pseudo-labeled data without any restrictions on the first frame. For the first time, we present a general way to train VOS models on two-shot VOS datasets. By using 7.3% and 2.9% labeled data of YouTube-VOS and DAVIS benchmarks, our approach achieves comparable results in contrast to the counterparts trained on fully labeled set. Code and models are available at https://github.com/yk-pku/Two-shot-Video-Object-Segmentation.

updated: Tue Mar 21 2023 17:59:56 GMT+0000 (UTC)

published: Tue Mar 21 2023 17:59:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト