Contextual Guided Segmentation Framework for Semi-supervised Video Instance Segmentation

Trung-Nghia Le; Tam V. Nguyen; Minh-Triet Tran

半教師ありビデオインスタンスセグメンテーションのためのコンテキストガイドセグメンテーションフレームワーク

このペーパーでは、3 つのパスでのビデオインスタンスセグメンテーションのためのコンテキストガイドセグメンテーション (CGS) フレームワークを提案します。最初のパス、つまりプレビューセグメンテーションでは、インスタンスの再識別フローを提案して、プレビューマスクを他のフレームに伝播することにより、各インスタンスの主要なプロパティ (つまり、人間/非人間、剛性/変形可能、既知/未知のカテゴリ) を推定します。 . 2 番目のパス、つまりコンテキストセグメンテーションでは、複数のコンテキストセグメンテーションスキームを導入します。人間の場合、フレーム間で結果を修正および洗練するために、オブジェクトフローとともにフレーム内のスケルトンガイド付きセグメンテーションを開発します。人以外のインスタンスの場合、インスタンスの外観に幅広いバリエーションがあり、既知のカテゴリ (初期マスクから推測できる) に属している場合、インスタンスのセグメンテーションを採用します。人間以外のインスタンスがほぼ剛体である場合、ビデオシーケンスの最初のフレームからの合成画像で FCN をトレーニングします。最終パス、つまりガイド付きセグメンテーションでは、非長方形の関心領域 (ROI) に対して、新しい細粒度のセグメンテーション方法を開発します。自然な形の ROI は、現在のフレームの隣接フレームからのガイド付きアテンションを適用して、異なる重複インスタンスのセグメンテーションのあいまいさを減らすことによって生成されます。前方マスク伝播の後に後方マスク伝播が続き、再出現したインスタンス、高速モーション、オクルージョン、または大きな変形により失われたインスタンスのフラグメントをさらに復元します。最後に、各フレームのインスタンスは、人間と人間以外のオブジェクトの相互作用とまれなインスタンスの優先度とともに、深度値に基づいてマージされます。 DAVIS Test-Challenge データセットで行われた実験は、提案されたフレームワークの有効性を示しています。 DAVIS Challenges 2017-2019 では、グローバルスコア、領域の類似性、輪郭の精度の点で、それぞれ 75.4%、72.4%、78.4% で 3 位を達成しました。

In this paper, we propose Contextual Guided Segmentation (CGS) framework for video instance segmentation in three passes. In the first pass, i.e., preview segmentation, we propose Instance Re-Identification Flow to estimate main properties of each instance (i.e., human/non-human, rigid/deformable, known/unknown category) by propagating its preview mask to other frames. In the second pass, i.e., contextual segmentation, we introduce multiple contextual segmentation schemes. For human instance, we develop skeleton-guided segmentation in a frame along with object flow to correct and refine the result across frames. For non-human instance, if the instance has a wide variation in appearance and belongs to known categories (which can be inferred from the initial mask), we adopt instance segmentation. If the non-human instance is nearly rigid, we train FCNs on synthesized images from the first frame of a video sequence. In the final pass, i.e., guided segmentation, we develop a novel fined-grained segmentation method on non-rectangular regions of interest (ROIs). The natural-shaped ROI is generated by applying guided attention from the neighbor frames of the current one to reduce the ambiguity in the segmentation of different overlapping instances. Forward mask propagation is followed by backward mask propagation to further restore missing instance fragments due to re-appeared instances, fast motion, occlusion, or heavy deformation. Finally, instances in each frame are merged based on their depth values, together with human and non-human object interaction and rare instance priority. Experiments conducted on the DAVIS Test-Challenge dataset demonstrate the effectiveness of our proposed framework. We achieved the 3rd consistently in the DAVIS Challenges 2017-2019 with 75.4%, 72.4%, and 78.4% in terms of global score, region similarity, and contour accuracy, respectively.

updated: Mon Apr 11 2022 08:46:16 GMT+0000 (UTC)

published: Mon Jun 07 2021 04:16:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト