Unsupervised Open-Vocabulary Object Localization in Videos

Ke Fan; Zechen Bai; Tianjun Xiao; Dominik Zietlow; Max Horn; Zixu Zhao; Carl-Johann Simon-Gabriel; Mike Zheng Shou; Francesco Locatello; Bernt Schiele; Thomas Brox; Zheng Zhang; Yanwei Fu; Tong He

In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via an object-centric approach with slot attention and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks.

updated: Wed Jun 26 2024 16:26:08 GMT+0000 (UTC)

published: Mon Sep 18 2023 15:20:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト