Weakly Supervised Video Salient Object Detection via Point Supervision

Shuyong Gao; Haozhe Xing; Wei Zhang; Yan Wang; Qianyu Guo; Wenqiang Zhang

ポイント監視による弱く監視されたビデオ顕著なオブジェクトの検出

ピクセル単位の高密度アノテーションでトレーニングされたビデオの顕著なオブジェクト検出モデルは優れたパフォーマンスを実現しましたが、ピクセルごとにアノテーションが付けられたデータセットを取得するのは面倒です。いくつかの研究では、この問題を軽減するために落書き注釈を使用しようとしていますが、より省力的な注釈方法としてのポイント監視（高密度予測のための手動注釈方法の中で最も省力的な方法でさえ）は検討されていません。本論文では、ポイント監視に基づく強力なベースラインモデルを提案します。時間情報を使用して顕著性マップを推測するために、短期および長期の観点からそれぞれフレーム間補完情報をマイニングします。具体的には、オプティカルフローと画像情報を直交方向から混合し、オプティカルフロー情報（チャネル次元）とクリティカルトークン情報（空間次元）を適応的に強調するハイブリッドトークンアテンションモジュールを提案します。長期的な手がかりを活用するために、マルチフレームトークンに基づいて顕著なオブジェクトを推測する際に現在のフレームを支援する長期クロスフレーム注意モジュール（LCFA）を開発します。さらに、DAVISとDAVSODデータセットのラベルを変更することにより、2つのポイント監視データセットP-DAVISとP-DAVSODにラベルを付けます。 6つのベンチマークデータセットでの実験は、私たちの方法が以前の最先端の弱く監視された方法よりも優れており、完全に監視されたいくつかのアプローチと同等であることを示しています。ソースコードとデータセットが利用可能です。

Video salient object detection models trained on pixel-wise dense annotation have achieved excellent performance, yet obtaining pixel-by-pixel annotated datasets is laborious. Several works attempt to use scribble annotations to mitigate this problem, but point supervision as a more labor-saving annotation method (even the most labor-saving method among manual annotation methods for dense prediction), has not been explored. In this paper, we propose a strong baseline model based on point supervision. To infer saliency maps with temporal information, we mine inter-frame complementary information from short-term and long-term perspectives, respectively. Specifically, we propose a hybrid token attention module, which mixes optical flow and image information from orthogonal directions, adaptively highlighting critical optical flow information (channel dimension) and critical token information (spatial dimension). To exploit long-term cues, we develop the Long-term Cross-Frame Attention module (LCFA), which assists the current frame in inferring salient objects based on multi-frame tokens. Furthermore, we label two point-supervised datasets, P-DAVIS and P-DAVSOD, by relabeling the DAVIS and the DAVSOD dataset. Experiments on the six benchmark datasets illustrate our method outperforms the previous state-of-the-art weakly supervised methods and even is comparable with some fully supervised approaches. Source code and datasets are available.

updated: Fri Jul 15 2022 03:31:15 GMT+0000 (UTC)

published: Fri Jul 15 2022 03:31:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト