Make One-Shot Video Object Segmentation Efficient Again

Tim Meinhardt; Laura Leal-Taixe

ワンショットビデオオブジェクトのセグメンテーションを再び効率化する

ビデオオブジェクトセグメンテーション（VOS）は、ビデオの各フレームでオブジェクトのセットをセグメント化するタスクを説明します。半教師あり設定では、各オブジェクトの最初のマスクがテスト時に提供されます。ワンショットの原則に従って、VOSメソッドを微調整すると、指定されたオブジェクトマスクごとにセグメンテーションモデルが個別にトレーニングされます。ただし、最近、VOSコミュニティは、このようなテスト時間の最適化と、テストランタイムへの影響を実行不可能と見なしています。以前の微調整アプローチの非効率性を軽減するために、効率的なワンショットビデオオブジェクトセグメンテーション（e-OSVOS）を紹介します。ほとんどのVOSアプローチとは対照的に、e-OSVOSはオブジェクト検出タスクを分離し、マスクR-CNNの修正バージョンを適用することによってローカルセグメンテーションマスクのみを予測します。ワンショットテストの実行時間とパフォーマンスは、面倒で手作りのハイパーパラメータ検索なしで最適化されます。この目的のために、テスト時間の最適化のためにモデルの初期化と学習率をメタ学習します。最適な学習行動を実現するために、ニューロンレベルで個々の学習率を予測します。さらに、フレーム間のバウンディングボックスの伝播によってサポートされる以前のマスク予測でモデルを継続的に微調整することにより、シーケンス全体の一般的なパフォーマンスの低下に対処するためにオンライン適応を適用します。 e-OSVOSは、DAVIS 2016、DAVIS 2017、およびYouTube-VOSで最先端の結果を提供し、テストの実行時間を大幅に短縮しながら、ワンショットの微調整方法を実現します。コードはhttps://github.com/dvl-tum/e-osvosで入手できます。

Video object segmentation (VOS) describes the task of segmenting a set of objects in each frame of a video. In the semi-supervised setting, the first mask of each object is provided at test time. Following the one-shot principle, fine-tuning VOS methods train a segmentation model separately on each given object mask. However, recently the VOS community has deemed such a test time optimization and its impact on the test runtime as unfeasible. To mitigate the inefficiencies of previous fine-tuning approaches, we present efficient One-Shot Video Object Segmentation (e-OSVOS). In contrast to most VOS approaches, e-OSVOS decouples the object detection task and predicts only local segmentation masks by applying a modified version of Mask R-CNN. The one-shot test runtime and performance are optimized without a laborious and handcrafted hyperparameter search. To this end, we meta learn the model initialization and learning rates for the test time optimization. To achieve optimal learning behavior, we predict individual learning rates at a neuron level. Furthermore, we apply an online adaptation to address the common performance degradation throughout a sequence by continuously fine-tuning the model on previous mask predictions supported by a frame-to-frame bounding box propagation. e-OSVOS provides state-of-the-art results on DAVIS 2016, DAVIS 2017, and YouTube-VOS for one-shot fine-tuning methods while reducing the test runtime substantially. Code is available at https://github.com/dvl-tum/e-osvos.

updated: Thu Dec 03 2020 12:21:23 GMT+0000 (UTC)

published: Thu Dec 03 2020 12:21:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト