OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning

Sheng Liu; Kevin Lin; Lijuan Wang; Junsong Yuan; Zicheng Liu

OVIS：視覚的意味論的整列表現学習によるオープンボキャブラリービジュアルインスタンス検索

オープンボキャブラリービジュアルインスタンス検索（OVIS）のタスクを紹介します。任意のテキスト検索クエリが与えられると、オープンボキャブラリービジュアルインスタンス検索（OVIS）は、画像データベースからの検索意図を満たすビジュアルインスタンスのランク付けされたリスト、つまり画像パッチを返すことを目的としています。「オープンボキャブラリー」という用語は、検索されるビジュアルインスタンスに制限がなく、テキスト検索クエリを構成するために使用できる単語にも制限がないことを意味します。私たちは、視覚的意味論的整列表現学習（ViSA）を介してそのような検索の課題に対処することを提案します。 ViSAは、弱い画像レベル（インスタンスレベルではない）の監視として大規模な画像とキャプションのペアを活用して、視覚的なインスタンス（画像ではない）の表現とテキストクエリの表現が整列する豊富なクロスモーダルセマンティックスペースを学習します。任意のビジュアルインスタンスと任意のテキストクエリの間の類似性を測定します。 ViSAのパフォーマンスを評価するために、OVIS40とOVIS1600という名前の2つのデータセットを構築し、エラー分析用のパイプラインも導入します。 2つのデータセットでの広範な実験を通じて、一般的でない単語で構成されるものを含む幅広いテキストクエリを前提として、トレーニング中に利用できない画像内の視覚的なインスタンスを検索するViSAの機能を示します。実験結果は、ViSAが最も困難な設定の下でOVIS40で21.9％のmAP @ 50を達成し、OVIS1600データセットで14.9％のmAP @ 6を達成することを示しています。

We introduce the task of open-vocabulary visual instance search (OVIS). Given an arbitrary textual search query, Open-vocabulary Visual Instance Search (OVIS) aims to return a ranked list of visual instances, i.e., image patches, that satisfies the search intent from an image database. The term "open vocabulary" means that there are neither restrictions to the visual instance to be searched nor restrictions to the word that can be used to compose the textual search query. We propose to address such a search challenge via visual-semantic aligned representation learning (ViSA). ViSA leverages massive image-caption pairs as weak image-level (not instance-level) supervision to learn a rich cross-modal semantic space where the representations of visual instances (not images) and those of textual queries are aligned, thus allowing us to measure the similarities between any visual instance and an arbitrary textual query. To evaluate the performance of ViSA, we build two datasets named OVIS40 and OVIS1600 and also introduce a pipeline for error analysis. Through extensive experiments on the two datasets, we demonstrate ViSA's ability to search for visual instances in images not available during training given a wide range of textual queries including those composed of uncommon words. Experimental results show that ViSA achieves an mAP@50 of 21.9% on OVIS40 under the most challenging setting and achieves an mAP@6 of 14.9% on OVIS1600 dataset.

updated: Sun Aug 08 2021 18:13:53 GMT+0000 (UTC)

published: Sun Aug 08 2021 18:13:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト