Image-text Retrieval via Preserving Main Semantics of Vision

Xu Zhang; Xinzheng Niu; Philippe Fournier-Viger; Xudong Dai

視覚の主要な意味論を保存することによる画像テキスト検索

画像とテキストの検索は、クロスモーダル検索の主要なタスクの 1 つです。このタスクのいくつかのアプローチでは、画像とテキストを共通の空間にマッピングして、2 つのモダリティ間の対応を作成します。ただし、画像のコンテンツ (セマンティクス) の豊富さにより、画像内の冗長な二次情報が誤った一致を引き起こす可能性があります。この問題に対処するために、このホワイトペーパーでは、Visual Semantic Loss (VSL) として実装されたセマンティック最適化アプローチを提示し、モデルが画像のメインコンテンツに焦点を当てるのを支援します。このアプローチは、画像のメインコンテンツを説明することで、画像のコンテンツに注釈を付ける一般的な方法に着想を得ています。したがって、画像に対応する注釈付きテキストを活用して、モデルが画像のメインコンテンツをキャプチャするのを支援し、二次コンテンツの悪影響を減らします。 2 つのベンチマークデータセット (MSCOCO と Flickr30K) での広範な実験により、この方法の優れたパフォーマンスが実証されました。コードは https://github.com/ZhangXu0963/VSL で入手できます。

Image-text retrieval is one of the major tasks of cross-modal retrieval. Several approaches for this task map images and texts into a common space to create correspondences between the two modalities. However, due to the content (semantics) richness of an image, redundant secondary information in an image may cause false matches. To address this issue, this paper presents a semantic optimization approach, implemented as a Visual Semantic Loss (VSL), to assist the model in focusing on an image's main content. This approach is inspired by how people typically annotate the content of an image by describing its main content. Thus, we leverage the annotated texts corresponding to an image to assist the model in capturing the main content of the image, reducing the negative impact of secondary content. Extensive experiments on two benchmark datasets (MSCOCO and Flickr30K) demonstrate the superior performance of our method. The code is available at: https://github.com/ZhangXu0963/VSL.

updated: Fri Apr 28 2023 08:09:54 GMT+0000 (UTC)

published: Thu Apr 20 2023 12:23:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト