Self-supervised Amodal Video Object Segmentation

Jian Yao; Yuxin Hong; Chiyu Wang; Tianjun Xiao; Tong He; Francesco Locatello; David Wipf; Yanwei Fu; Zheng Zhang

自己管理型アモーダルビデオオブジェクトセグメンテーション

無モード知覚では、部分的に遮られたオブジェクトの完全な形状を推測する必要があります。このタスクは、次の 2 つのレベルで特に困難です。(1) インスタントの網膜またはイメージングセンサーに含まれる情報よりも多くの情報が必要である、(2) 監督のために十分に注釈が付けられた非モーダルラベルを取得するのが難しい。この目的のために、この論文では、自己教師ありアモーダルビデオオブジェクトセグメンテーション (SaVos) の新しいフレームワークを開発します。私たちの方法は、ビデオの時系列の視覚情報を効率的に活用して、オブジェクトの非モーダルマスクを推測します。重要な直感は、変形が合理的に学習できる限り、その部分が他のフレームで見える場合、オブジェクトの遮られた部分を説明できるということです。したがって、ビデオのトレーニングを導くための監督として目に見えるオブジェクトの部分を効率的に利用する、新しい自己教師あり学習パラダイムを導き出します。既知の型の完全なマスクの前に型を学習することに加えて、SaVos は時空間事前分布も学習します。これは非モーダルタスクにも役立ち、目に見えない型に一般化できます。提案されたフレームワークは、合成アモーダルセグメンテーションベンチマーク FISHBOWL および現実世界のベンチマーク KINS-Video-Car で最先端のパフォーマンスを実現します。さらに、テスト時間の適応を使用して新しいディストリビューションに移行するのに適しているため、新しいディストリビューションに移行した後でも既存のモデルよりも優れています。

Amodal perception requires inferring the full shape of an object that is partially occluded. This task is particularly challenging on two levels: (1) it requires more information than what is contained in the instant retina or imaging sensor, (2) it is difficult to obtain enough well-annotated amodal labels for supervision. To this end, this paper develops a new framework of Self-supervised amodal Video object segmentation (SaVos). Our method efficiently leverages the visual information of video temporal sequences to infer the amodal mask of objects. The key intuition is that the occluded part of an object can be explained away if that part is visible in other frames, possibly deformed as long as the deformation can be reasonably learned. Accordingly, we derive a novel self-supervised learning paradigm that efficiently utilizes the visible object parts as the supervision to guide the training on videos. In addition to learning type prior to complete masks for known types, SaVos also learns the spatiotemporal prior, which is also useful for the amodal task and could generalize to unseen types. The proposed framework achieves the state-of-the-art performance on the synthetic amodal segmentation benchmark FISHBOWL and the real world benchmark KINS-Video-Car. Further, it lends itself well to being transferred to novel distributions using test-time adaptation, outperforming existing models even after the transfer to a new distribution.

updated: Sun Oct 23 2022 14:09:35 GMT+0000 (UTC)

published: Sun Oct 23 2022 14:09:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト