Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities

Andrii Zadaianchuk; Maximilian Seitzer; Georg Martius

時間的特徴の類似性を予測することによる現実世界のビデオのオブジェクト中心学習

教師なしビデオベースのオブジェクト中心学習は、ラベルのない大規模なビデオコレクションから構造化表現を学習するための有望な手段ですが、これまでのアプローチでは、制限されたドメイン内の現実世界のデータセットにしか拡張できませんでした。最近、事前トレーニングされた自己教師付き特徴の再構成により、制約のない実世界の画像データセット上でオブジェクト中心の表現が得られることが示されました。このアプローチに基づいて、このような事前トレーニングされた特徴を時間的特徴類似性損失の形で使用する新しい方法を提案します。この損失は、画像パッチ間の時間的相関をエンコードし、物体発見のための動きバイアスを導入する自然な方法です。この損失により、困難な合成 MOVi データセットで最先端のパフォーマンスが得られることを実証します。特徴再構築損失と組み合わせて使用すると、私たちのモデルは、YouTube-VIS などの制約のないビデオデータセットにスケールする初のオブジェクト中心のビデオモデルになります。

Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.

updated: Wed Jun 07 2023 23:18:14 GMT+0000 (UTC)

published: Wed Jun 07 2023 23:18:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト