Learning to Retrieve Videos by Asking Questions

Avinash Madasu; Junier Oliva; Gedas Bertasius

質問をして動画を取得する方法を学ぶ

従来のテキストからビデオへの検索システムの大部分は静的な環境で動作します。つまり、ユーザーが提供する最初のテキストクエリを超えて、ユーザーとエージェントの間に対話はありません。最初のクエリにあいまいさがあり、誤って取得された動画が多数発生する場合、これは最適ではない可能性があります。この制限を克服するために、Dialog（ViReD）を使用したビデオ検索の新しいフレームワークを提案します。これにより、ユーザーはAIエージェントによって生成された質問に回答することで、取得した結果を絞り込むことができます。私たちの新しいマルチモーダル質問ジェネレーターは、（i）ユーザーとの最後の対話中に取得されたビデオ候補、および（ii）以前のすべての対話を文書化したテキストベースのダイアログ履歴を使用して、後続のビデオ取得パフォーマンスを最大化する質問をすることを学習します。ビデオ検索に関連する視覚的および言語的手がかりの両方を組み込んだ質問。さらに、最大限に有益な質問を生成するために、情報ガイド監視（IGS）を提案します。これは、質問ジェネレーターが後続のビデオ検索の精度を高める質問をするようにガイドします。 AVSDデータセットに対するインタラクティブなViReDフレームワークの有効性を検証し、インタラクティブな方法が従来の非インタラクティブなビデオ検索システムよりも大幅に優れていることを示しています。また、提案されたアプローチが実際の人間との相互作用を含む実際の設定に一般化されることを示し、したがって、フレームワークの堅牢性と一般性を示します

The majority of traditional text-to-video retrieval systems operate in static environments, i.e., there is no interaction between the user and the agent beyond the initial textual query provided by the user. This can be sub-optimal if the initial query has ambiguities, which would lead to many falsely retrieved videos. To overcome this limitation, we propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog, where the user refines retrieved results by answering questions generated by an AI agent. Our novel multimodal question generator learns to ask questions that maximize the subsequent video retrieval performance using (i) the video candidates retrieved during the last round of interaction with the user and (ii) the text-based dialog history documenting all previous interactions, to generate questions that incorporate both visual and linguistic cues relevant to video retrieval. Furthermore, to generate maximally informative questions, we propose an Information-Guided Supervision (IGS), which guides the question generator to ask questions that would boost subsequent video retrieval accuracy. We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems. We also demonstrate that our proposed approach generalizes to the real-world settings that involve interactions with real humans, thus, demonstrating the robustness and generality of our framework

updated: Sat Jul 16 2022 06:06:53 GMT+0000 (UTC)

published: Wed May 11 2022 19:14:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト