AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant

Stan Weixian Lei; Difei Gao; Yuxuan Wang; Dongxing Mao; Zihan Liang; Lingmin Ran; Mike Zheng Shou

AssistSR: Personal AI Assistant 向けのタスク指向のビデオセグメント検索

電話のパーソナル AI アシスタントと AR メガネが、「この時計の日付を調整するにはどうすればよいですか?」「加熱時間をどのように設定すればよいですか? （オーブンを指差しながら）」。従来のタスク (つまり、ビデオによる質問への回答、ビデオ検索、瞬間のローカリゼーション) で使用されるクエリは、多くの場合事実に基づいており、純粋なテキストに基づいています。対照的に、タスク指向の質問駆動型ビデオセグメント検索 (TQVSR) と呼ばれる新しいタスクを提示します。私たちの各質問は、私たちの日常生活におけるアイテムのアフォーダンスに焦点を当てた画像ボックステキストクエリであり、関連する回答セグメントが教育ビデオトランスクリプトセグメントのコーパスから取得されることを期待しています.この TQVSR タスクの研究をサポートするために、AssistSR と呼ばれる新しいデータセットを構築します。高品質のサンプルを作成するための新しいガイドラインを設計します。このデータセットには、日常的に使用されるさまざまなアイテムに関する説明ビデオの 1.6k ビデオセグメントに関する 3.2k のマルチモーダル質問が含まれています。 TQVSR に対処するために、デュアルマルチモーダルエンコーダー (DME) と呼ばれるシンプルで効果的なモデルを開発しました。さらに、詳細なアブレーション分析を提示します。コードとデータは https://github.com/StanLei52/TQVSR で入手できます。

It is still a pipe dream that personal AI assistants on the phone and AR glasses can assist our daily life in addressing our questions like ``how to adjust the date for this watch?'' and ``how to set its heating duration? (while pointing at an oven)''. The queries used in conventional tasks (i.e. Video Question Answering, Video Retrieval, Moment Localization) are often factoid and based on pure text. In contrast, we present a new task called Task-oriented Question-driven Video Segment Retrieval (TQVSR). Each of our questions is an image-box-text query that focuses on affordance of items in our daily life and expects relevant answer segments to be retrieved from a corpus of instructional video-transcript segments. To support the study of this TQVSR task, we construct a new dataset called AssistSR. We design novel guidelines to create high-quality samples. This dataset contains 3.2k multimodal questions on 1.6k video segments from instructional videos on diverse daily-used items. To address TQVSR, we develop a simple yet effective model called Dual Multimodal Encoders (DME) that significantly outperforms several baseline methods while still having large room for improvement in the future. Moreover, we present detailed ablation analyses. Code and data are available at https://github.com/StanLei52/TQVSR.

updated: Mon Oct 10 2022 05:40:46 GMT+0000 (UTC)

published: Tue Nov 30 2021 01:14:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト