Ask&Confirm: Active Detail Enriching for Cross-Modal Retrieval with Partial Query

Guanyu Cai; Jun Zhang; Xinyang Jiang; Yifei Gong; Lianghua He; Fufu Yu; Pai Peng; Xiaowei Guo; Feiyue Huang; Xing Sun

Ask＆Confirm：部分クエリによるクロスモーダル検索のためのアクティブな詳細エンリッチメント

テキストベースの画像検索は、近年かなりの進歩を遂げています。ただし、ユーザーが画像の不完全な説明を提供する可能性があり、不完全な説明に一致する誤検知で満たされた結果につながる可能性があるため、既存の方法のパフォーマンスは実際には低下します。この作業では、部分クエリ問題を紹介し、テキストベースの画像検索への影響を広範囲に分析します。以前のインタラクティブな方法では、ユーザーのフィードバックを受動的に受信して不完全なクエリを繰り返し補足することで問題に取り組んでいます。これには時間がかかり、ユーザーの多大な労力が必要です。代わりに、AIが現在のクエリに欠けている識別の詳細を積極的に検索し、ユーザーがAIの提案を確認するだけでよい、Ask-and-Confirm方式でインタラクティブプロセスを実行する新しい検索フレームワークを提案します。具体的には、インタラクティブ検索をよりユーザーフレンドリーにするためのオブジェクトベースのインタラクションを提案し、識別オブジェクトを検索するための強化学習ベースのポリシーを提示します。さらに、人間と機械のダイアログデータを取得するのが難しいため、完全に監視されたトレーニングは実行できないことが多いため、テキスト画像データセット以外に人間が注釈を付けたダイアログを必要としない、監視が弱いトレーニング戦略を提示します。実験は、私たちのフレームワークがテキストベースの画像検索のパフォーマンスを大幅に改善することを示しています。コードはhttps://github.com/CuthbertCai/Ask-Confirmで入手できます。

Text-based image retrieval has seen considerable progress in recent years. However, the performance of existing methods suffers in real life since the user is likely to provide an incomplete description of an image, which often leads to results filled with false positives that fit the incomplete description. In this work, we introduce the partial-query problem and extensively analyze its influence on text-based image retrieval. Previous interactive methods tackle the problem by passively receiving users' feedback to supplement the incomplete query iteratively, which is time-consuming and requires heavy user effort. Instead, we propose a novel retrieval framework that conducts the interactive process in an Ask-and-Confirm fashion, where AI actively searches for discriminative details missing in the current query, and users only need to confirm AI's proposal. Specifically, we propose an object-based interaction to make the interactive retrieval more user-friendly and present a reinforcement-learning-based policy to search for discriminative objects. Furthermore, since fully-supervised training is often infeasible due to the difficulty of obtaining human-machine dialog data, we present a weakly-supervised training strategy that needs no human-annotated dialogs other than a text-image dataset. Experiments show that our framework significantly improves the performance of text-based image retrieval. Code is avaiable at https://github.com/CuthbertCai/Ask-Confirm.

updated: Wed Aug 11 2021 07:35:53 GMT+0000 (UTC)

published: Tue Mar 02 2021 11:27:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト