SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval

Surgan Jandial; Pinkesh Badjatiya; Pranit Chawla; Ayush Chopra; Mausoom Sarkar; Balaji Krishnamurthy

SAC：テキスト条件付き画像検索のための意味的注意構成

画像を効率的に検索する機能は、さまざまな製品のユーザーエクスペリエンスを向上させるために不可欠です。マルチモーダル入力を介してユーザーフィードバックを組み込んで視覚的検索をナビゲートすると、取得した結果を特定のユーザークエリに合わせて調整するのに役立ちます。参照画像と一緒にサポートテキストフィードバックを利用して、両方の入力によって課せられた制約を同時に満たす画像を取得する、テキスト条件付き画像検索のタスクに焦点を当てます。テキストフィードバックから複数のクロスグラニュラーセマンティック編集を組み込み、それを視覚的特徴に適用することによって複合画像-テキスト特徴を学習する必要があるため、このタスクは困難です。これに対処するために、「どこを見るか」（セマンティック機能の注意）と「変更方法」（セマンティック機能の変更）の2つの主要なステップで上記を解決する新しいフレームワークSACを提案します。他の最先端技術に必要なさまざまなモジュールの必要性を排除することにより、アーキテクチャがテキスト認識画像機能の生成をどのように合理化するかを体系的に示します。自然をサポートしながら、3つのベンチマークデータセット（FashionIQ、Shoes、Birds-to-Words）で最先端のパフォーマンスを実現することで、アーキテクチャSACが既存の手法よりも優れていることを示すために、広範な定量的、定性的分析、およびアブレーション研究を紹介します。さまざまな長さの言語フィードバック。

The ability to efficiently search for images is essential for improving the user experiences across various products. Incorporating user feedback, via multi-modal inputs, to navigate visual search can help tailor retrieved results to specific user queries. We focus on the task of text-conditioned image retrieval that utilizes support text feedback alongside a reference image to retrieve images that concurrently satisfy constraints imposed by both inputs. The task is challenging since it requires learning composite image-text features by incorporating multiple cross-granular semantic edits from text feedback and then applying the same to visual features. To address this, we propose a novel framework SAC which resolves the above in two major steps: "where to see" (Semantic Feature Attention) and "how to change" (Semantic Feature Modification). We systematically show how our architecture streamlines the generation of text-aware image features by removing the need for various modules required by other state-of-art techniques. We present extensive quantitative, qualitative analysis, and ablation studies, to show that our architecture SAC outperforms existing techniques by achieving state-of-the-art performance on 3 benchmark datasets: FashionIQ, Shoes, and Birds-to-Words, while supporting natural language feedback of varying lengths.

updated: Tue Oct 19 2021 19:02:15 GMT+0000 (UTC)

published: Thu Sep 03 2020 06:55:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト