In-N-Out Generative Learning for Dense Unsupervised Video Segmentation

Xiao Pan; Peike Li; Zongxin Yang; Huiling Zhou; Chang Zhou; Hongxia Yang; Jingren Zhou; Yi Yang

密な教師なしビデオセグメンテーションのためのIn-N-Out生成学習

この論文では、ラベルのないビデオから視覚的な対応（つまり、ピクセルレベルの特徴間の類似性）を学習するビデオオブジェクトセグメンテーション（VOS）の教師なし学習に焦点を当てます。以前の方法は、主に対照的な学習パラダイムに基づいており、画像レベルまたはピクセルレベルのいずれかで最適化されます。画像レベルの最適化（たとえば、ResNetの空間的にプールされた機能）は、堅牢な高レベルのセマンティクスを学習しますが、ピクセルレベルの機能が暗黙的に最適化されるため、最適ではありません。対照的に、ピクセルレベルの最適化はより明確ですが、トレーニングデータの視覚的品質に敏感であり、オブジェクトの変形に対してロバストではありません。統一されたフレームワークでこれら2つのレベルの最適化を補完的に実行するために、Vision Transformer（ViT）で自然に設計されたクラストークンとパッチトークンを使用して、純粋に生成的な観点からIn-aNd-Out（INO）生成学習を提案します。具体的には、画像レベルの最適化のために、クラストークンのローカルビューからグローバルビューへのアウトビューの想像力を強制します。これは、高レベルのセマンティクスのキャプチャに役立ち、アウトジェネレーティブラーニングと名付けます。ピクセルレベルの最適化に関しては、パッチトークンに対してビュー内のマスクされた画像モデリングを実行します。これにより、画像の詳細な構造を推測することで画像の破損した部分が復元され、生成学習と呼ばれます。時間情報をより適切に発見するために、機能レベルとアフィニティマトリックスレベルの両方からフレーム間の一貫性をさらに強制します。 DAVIS-2017valおよびYouTube-VOS2018valでの広範な実験は、私たちのINOが以前の最先端の方法を大幅に上回っていることを示しています。

In this paper, we focus on the unsupervised learning for Video Object Segmentation (VOS) which learns visual correspondence (i.e., similarity between pixel-level features) from unlabeled videos. Previous methods are mainly based on the contrastive learning paradigm, which optimize either in image level or pixel level. Image-level optimization (e.g., the spatially pooled feature of ResNet) learns robust high-level semantics but is sub-optimal since the pixel-level features are optimized implicitly. By contrast, pixel-level optimization is more explicit, however, it is sensitive to the visual quality of training data and is not robust to object deformation. To complementarily perform these two levels of optimization in a unified framework, we propose the In-aNd-Out (INO) generative learning from a purely generative perspective with the help of naturally designed class tokens and patch tokens in Vision Transformer (ViT). Specifically, for image-level optimization, we force the out-view imagination from local to global views on class tokens, which helps capturing high-level semantics, and we name it as out-generative learning. As to pixel-level optimization, we perform in-view masked image modeling on patch tokens, which recovers the corrupted parts of an image via inferring its fine-grained structure, and we term it as in-generative learning. To better discover the temporal information, we additionally force the inter-frame consistency from both feature level and affinity matrix level. Extensive experiments on DAVIS-2017 val and YouTube-VOS 2018 val show that our INO outperforms previous state-of-the-art methods by significant margins.

updated: Mon Apr 11 2022 08:12:07 GMT+0000 (UTC)

published: Tue Mar 29 2022 07:56:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト