RPM-Net: Robust Pixel-Level Matching Networks for Self-Supervised Video   Object Segmentation

Youngeun Kim; Seokeon Choi; Hankyeol Lee; Taekyung Kim; Changick Kim

RPM-Net：自己監視ビデオオブジェクトセグメンテーションのための堅牢なピクセルレベルマッチングネットワーク

RPM-Net: Robust Pixel-Level Matching Networks for Self-Supervised Video Object Segmentation

本論文では、人間がラベル付けしたデータを使用しないビデオオブジェクトセグメンテーションのための自己監視アプローチを紹介します。具体的には、ロバストなピクセルレベルのマッチングネットワーク（RPM-Net）を提示します。ラベル付けされていない動画からのトレーニング用の色情報のみ。技術的には、RPM-Netは2つのメインモジュールに分離できます。埋め込みモジュールは、最初に入力画像を高次元の埋め込みスペースに投影します。その後、変形可能な畳み込み層を備えたマッチングモジュールは、埋め込み機能に基づいて参照フレームとターゲットフレーム間のピクセルをマッチングします。変形可能な畳み込みを使用する以前の方法とは異なり、マッチングモジュールは、変形可能な畳み込みを採用して、時空間的に隣接するピクセルの類似の機能に焦点を当てます。選択的特徴サンプリングは、カメラの揺れ、速い動き、変形、オクルージョンなど、ビデオオブジェクトのセグメンテーションにおける困難な問題に対する堅牢性を向上させます。また、3つの公開データセット（つまり、DAVIS-2017、SegTrack-v2、およびYoutube-Objects）で包括的な実験を実施し、自己監視型のビデオオブジェクトセグメンテーションで最先端のパフォーマンスを実現しています。さらに、自己監視型と完全監視型のビデオオブジェクトセグメンテーション間のパフォーマンスギャップを大幅に削減します（DAVIS-2017検証セットの42.5％対52.5％）

In this paper, we introduce a self-supervised approach for video object segmentation without human labeled data.Specifically, we present Robust Pixel-level Matching Net-works (RPM-Net), a novel deep architecture that matches pixels between adjacent frames, using only color information from unlabeled videos for training. Technically, RPM-Net can be separated in two main modules. The embed-ding module first projects input images into high dimensional embedding space. Then the matching module with deformable convolution layers matches pixels between reference and target frames based on the embedding features.Unlike previous methods using deformable convolution, our matching module adopts deformable convolution to focus on similar features in spatio-temporally neighboring pixels.Our experiments show that the selective feature sampling improves the robustness to challenging problems in video object segmentation such as camera shake, fast motion, deformation, and occlusion. Also, we carry out comprehensive experiments on three public datasets (i.e., DAVIS-2017,SegTrack-v2, and Youtube-Objects) and achieve state-of-the-art performance on self-supervised video object seg-mentation. Moreover, we significantly reduce the performance gap between self-supervised and fully-supervised video object segmentation (41.0% vs. 52.5% on DAVIS-2017 validation set)

updated: Thu Oct 10 2019 12:02:26 GMT+0000 (UTC)

published: Sun Sep 29 2019 10:07:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト