Online Unsupervised Video Object Segmentation via Contrastive Motion Clustering

Lin Xi; Weihai Chen; Xingming Wu; Zhong Liu; Zhengguo Li

コントラストモーションクラスタリングによるオンライン教師なしビデオオブジェクトセグメンテーション

オンラインの教師なしビデオオブジェクトセグメンテーション (UVOS) は、前のフレームを入力として使用し、手動による注釈を追加することなく、ストリーミングビデオからプライマリオブジェクトを自動的に分離します。大きな課題は、モデルが将来にアクセスできず、履歴のみに依存する必要があることです。つまり、セグメンテーションマスクは現在のフレームがキャプチャされるとすぐに予測されます。この研究では、視覚要素が同じ動きパターンを持っている場合、グループとして認識される傾向があるという運命共同体の原理を利用して、入力としてオプティカルフローを使用する新しいコントラストモーションクラスタリングアルゴリズムがオンライン UVOS に対して提案されています。私たちは、動きパターンの学習不可能なプロトタイプベースを繰り返し要約するためのシンプルで効果的なオートエンコーダを構築します。その一方で、ベースは埋め込みネットワークの表現を学習するのに役立ちます。さらに、境界事前分布に基づく対比学習戦略が開発され、表現学習段階における前景と背景の特徴の識別が向上します。提案されたアルゴリズムは、任意のスケールのデータ (フレーム、クリップ、データセット) に対して最適化でき、オンライン形式で実行できます。 DAVIS_16、FBMS、および SegTrackV2 データセットの実験では、私たちの手法の精度が以前の最先端 (SoTA) オンライン UVOS 手法をそれぞれ 0.8%、2.9%、1.1% 上回っていることが示されています。さらに、オンラインの深部分空間クラスタリングを使用してモーションのグループ化に取り組むことで、私たちの方法は、SoTA オンライン UVOS 方法と比較して 3 倍速い推論時間で高い精度を達成でき、有効性と効率の間で適切なトレードオフを実現できます。

Online unsupervised video object segmentation (UVOS) uses the previous frames as its input to automatically separate the primary object(s) from a streaming video without using any further manual annotation. A major challenge is that the model has no access to the future and must rely solely on the history, i.e., the segmentation mask is predicted from the current frame as soon as it is captured. In this work, a novel contrastive motion clustering algorithm with an optical flow as its input is proposed for the online UVOS by exploiting the common fate principle that visual elements tend to be perceived as a group if they possess the same motion pattern. We build a simple and effective auto-encoder to iteratively summarize non-learnable prototypical bases for the motion pattern, while the bases in turn help learn the representation of the embedding network. Further, a contrastive learning strategy based on a boundary prior is developed to improve foreground and background feature discrimination in the representation learning stage. The proposed algorithm can be optimized on arbitrarily-scale data i.e., frame, clip, dataset) and performed in an online fashion. Experiments on DAVIS_16, FBMS, and SegTrackV2 datasets show that the accuracy of our method surpasses the previous state-of-the-art (SoTA) online UVOS method by a margin of 0.8%, 2.9%, and 1.1%, respectively. Furthermore, by using an online deep subspace clustering to tackle the motion grouping, our method is able to achieve higher accuracy at 3× faster inference time compared to SoTA online UVOS method, and making a good trade-off between effectiveness and efficiency.

updated: Wed Jun 21 2023 06:40:31 GMT+0000 (UTC)

published: Wed Jun 21 2023 06:40:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト