Embedding Arithmetic of Multimodal Queries for Image Retrieval

Guillaume Couairon; Matthieu Cord; Matthijs Douze; Holger Schwenk

画像検索のためのマルチモーダルクエリの算術演算の埋め込み

潜在テキスト表現は、有名な例えのような幾何学的な規則性を示します: 女王は王に対して、女性は男性に対してです。このような構造化された意味関係は、画像表現では示されませんでした。このセマンティックギャップを埋めることを目的とした最近の作品は、画像とテキストをマルチモーダル空間に埋め込み、テキスト定義の変換を画像モダリティに転送できるようにします。マルチモーダルクエリを使用した画像検索のタスクを評価するために、SIMAT データセットを導入します。 SIMAT には、シーン要素の置換またはシーン要素間のペアワイズ関係の変更を目的とした 6,000 の画像と 18,000 のテキスト変換クエリが含まれています。目標は、(ソース画像、テキスト変換) クエリと一致する画像を取得することです。画像/テキストマッチングオラクル (OSCAR) を使用して、画像変換が成功したかどうかを評価します。 SIMAT データセットは公開されます。 SIMAT を使用して、CLIP などの画像/テキストマッチング目的でトレーニングされたマルチモーダル埋め込み空間の幾何学的特性を評価します。通常の CLIP 埋め込みは、デルタベクトルを使用した画像の変換にはあまり適していませんが、COCO データセットを微調整するだけで劇的な改善が得られることを示しています。また、事前トレーニング済みのユニバーサルセンテンスエンコーダー (FastText、LASER、LaBSE) を活用することが有益かどうかも調査します。

Latent text representations exhibit geometric regularities, such as the famous analogy: queen is to king what woman is to man. Such structured semantic relations were not demonstrated on image representations. Recent works aiming at bridging this semantic gap embed images and text into a multimodal space, enabling the transfer of text-defined transformations to the image modality. We introduce the SIMAT dataset to evaluate the task of Image Retrieval with Multimodal queries. SIMAT contains 6k images and 18k textual transformation queries that aim at either replacing scene elements or changing pairwise relationships between scene elements. The goal is to retrieve an image consistent with the (source image, text transformation) query. We use an image/text matching oracle (OSCAR) to assess whether the image transformation is successful. The SIMAT dataset will be publicly available. We use SIMAT to evaluate the geometric properties of multimodal embedding spaces trained with an image/text matching objective, like CLIP. We show that vanilla CLIP embeddings are not very well suited to transform images with delta vectors, but that a simple finetuning on the COCO dataset can bring dramatic improvements. We also study whether it is beneficial to leverage pretrained universal sentence encoders (FastText, LASER and LaBSE).

updated: Thu Oct 20 2022 17:18:04 GMT+0000 (UTC)

published: Mon Dec 06 2021 16:51:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト