Learning to Retrieve Videos by Asking Questions

Avinash Madasu; Junier Oliva; Gedas Bertasius

質問をして動画を取得する方法を学ぶ

従来のテキストからビデオへの検索システムの大部分は静的な環境で動作します。つまり、ユーザーが提供する最初のテキストクエリ以外に、ユーザーとエージェントの間に対話はありません。最初のクエリにあいまいさがあり、誤って取得された動画が多数発生する場合、これは最適ではない可能性があります。この制限を克服するために、Dialog（ViReD）を使用したビデオ検索の新しいフレームワークを提案します。これにより、ユーザーは複数回のダイアログを介してAIエージェントと対話できます。私たちのフレームワークの主な貢献は、後続のビデオ検索パフォーマンスを最大化する質問をすることを学ぶ新しいマルチモーダル質問ジェネレーターです。マルチモーダル質問ジェネレーターは、（i）ユーザーとの最後の対話中に取得されたビデオ候補、および（ii）以前のすべての対話を文書化したテキストベースのダイアログ履歴を使用して、ビデオ検索に関連する視覚的および言語的手がかりの両方を組み込んだ質問を生成します。さらに、最大限に有益な質問を生成するために、情報ガイド監視（IGS）を提案します。これは、質問ジェネレーターが後続のビデオ検索の精度を高める質問をするようにガイドします。 AVSDデータセットに対するインタラクティブViReDフレームワークの有効性を検証し、インタラクティブな方法が従来の非インタラクティブなビデオ検索システムよりも大幅に優れていることを示しています。さらに、提案されたアプローチが実際の人間との相互作用を含む実際の設定にも一般化されることを示し、したがって、フレームワークの堅牢性と一般性を示します

The majority of traditional text-to-video retrieval systems operate in static environments, i.e., there is no interaction between the user and the agent beyond the initial textual query provided by the user. This can be suboptimal if the initial query has ambiguities, which would lead to many falsely retrieved videos. To overcome this limitation, we propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog. The key contribution of our framework is a novel multimodal question generator that learns to ask questions that maximize the subsequent video retrieval performance. Our multimodal question generator uses (i) the video candidates retrieved during the last round of interaction with the user and (ii) the text-based dialog history documenting all previous interactions, to generate questions that incorporate both visual and linguistic cues relevant to video retrieval. Furthermore, to generate maximally informative questions, we propose an Information-Guided Supervision (IGS), which guides the question generator to ask questions that would boost subsequent video retrieval accuracy. We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems. Furthermore, we also demonstrate that our proposed approach also generalizes to the real-world settings that involve interactions with real humans, thus, demonstrating the robustness and generality of our framework

updated: Fri May 13 2022 16:39:43 GMT+0000 (UTC)

published: Wed May 11 2022 19:14:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト