Chatting Makes Perfect -- Chat-based Image Retrieval

Matan Levy; Rami Ben-Ari; Nir Darshan; Dani Lischinski

チャットで完璧に -- チャットベースの画像検索

チャットは、情報検索のための効果的でユーザーフレンドリーなアプローチとして登場し、顧客サービス、ヘルスケア、金融などの多くの分野で採用され、成功しています。しかし、既存の画像検索アプローチは通常、単一のクエリから画像へのラウンドの場合に対処しており、画像検索のためのチャットの使用はほとんど見落とされてきました。この研究では、ユーザーの検索意図を明確にするために、最初のクエリに加えて、ユーザーとの会話を行って情報を引き出すチャットベースの画像検索システムである ChatIR を紹介します。今日の基礎モデルの機能を動機として、私たちは大規模言語モデルを活用して、最初の画像の説明に対するフォローアップの質問を生成します。これらの質問は、大規模なコーパスから目的の画像を検索するためにユーザーとの対話を形成します。この研究では、大規模なデータセットでテストされたこのようなシステムの機能を調査し、ダイアログに参加することで画像検索に大幅な効果が得られることを明らかにしました。まず、手動で生成された既存のデータセットから評価パイプラインを構築し、ChatIR のさまざまなモジュールとトレーニング戦略を検討します。この比較には、強化学習でトレーニングされた関連アプリケーションから得られた強力なベースラインが含まれています。当社のシステムは、5 回の対話ラウンド後に 78% 以上の成功率で 50,000 画像のプールからターゲット画像を取得できます。これに対し、人間による質問の場合は 75%、単発のテキストから画像への取得では 64% でした。。広範な評価により、強力な機能が明らかになり、さまざまな設定下での CharIR の制限が調査されます。

Chats emerge as an effective user-friendly approach for information retrieval, and are successfully employed in many domains, such as customer service, healthcare, and finance. However, existing image retrieval approaches typically address the case of a single query-to-image round, and the use of chats for image retrieval has been mostly overlooked. In this work, we introduce ChatIR: a chat-based image retrieval system that engages in a conversation with the user to elicit information, in addition to an initial query, in order to clarify the user's search intent. Motivated by the capabilities of today's foundation models, we leverage Large Language Models to generate follow-up questions to an initial image description. These questions form a dialog with the user in order to retrieve the desired image from a large corpus. In this study, we explore the capabilities of such a system tested on a large dataset and reveal that engaging in a dialog yields significant gains in image retrieval. We start by building an evaluation pipeline from an existing manually generated dataset and explore different modules and training strategies for ChatIR. Our comparison includes strong baselines derived from related applications trained with Reinforcement Learning. Our system is capable of retrieving the target image from a pool of 50K images with over 78% success rate after 5 dialogue rounds, compared to 75% when questions are asked by humans, and 64% for a single shot text-to-image retrieval. Extensive evaluations reveal the strong capabilities and examine the limitations of CharIR under different settings.

updated: Wed May 31 2023 17:38:08 GMT+0000 (UTC)

published: Wed May 31 2023 17:38:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト