AssistSR: Affordance-centric Question-driven Video Segment Retrieval

Stan Weixian Lei; Yuxuan Wang; Dongxing Mao; Difei Gao; Mike Zheng Shou

AssistSR：アフォーダンス中心の質問駆動型ビデオセグメント検索

電話とARメガネのAIアシスタントが、「この時計の日付を調整する方法」などの質問に答えるのに私たちの日常生活を支援できることは、今でも夢のようなことです。と「加熱時間を設定する方法は？（オーブンを指さしながら）」。従来のタスク（つまり、ビデオの質問応答、ビデオの取得、モーメントのローカリゼーション）で使用されるクエリは、多くの場合、事実に基づいており、純粋なテキストに基づいています。対照的に、アフォーダンス中心の質問駆動型ビデオセグメント検索（AQVSR）と呼ばれる新しいタスクを提示します。私たちの質問はそれぞれ、日常生活におけるアイテムのアフォーダンスに焦点を当て、関連する回答セグメントが教育用ビデオトランスクリプトセグメントのコーパスから取得されることを期待する画像ボックステキストクエリです。このAQVSRタスクの研究をサポートするために、AssistSRと呼ばれる新しいデータセットを構築します。高品質のサンプルを作成するための新しいガイドラインを設計します。このデータセットには、日常的に使用されるさまざまなアイテムの説明ビデオからの1kビデオセグメントに関する1.4kマルチモーダル質問が含まれています。 AQVSRに対処するために、デュアルマルチモーダルエンコーダー（DME）と呼ばれる単純で効果的なモデルを開発します。これは、将来的に改善の余地が大きく、いくつかのベースライン手法を大幅に上回ります。さらに、詳細なアブレーション分析を提示します。コードとデータはhttps://github.com/StanLei52/AQVSRで入手できます。

It is still a pipe dream that AI assistants on phone and AR glasses can assist our daily life in addressing our questions like "how to adjust the date for this watch?" and "how to set its heating duration? (while pointing at an oven)". The queries used in conventional tasks (i.e. Video Question Answering, Video Retrieval, Moment Localization) are often factoid and based on pure text. In contrast, we present a new task called Affordance-centric Question-driven Video Segment Retrieval (AQVSR). Each of our questions is an image-box-text query that focuses on affordance of items in our daily life and expects relevant answer segments to be retrieved from a corpus of instructional video-transcript segments. To support the study of this AQVSR task, we construct a new dataset called AssistSR. We design novel guidelines to create high-quality samples. This dataset contains 1.4k multimodal questions on 1k video segments from instructional videos on diverse daily-used items. To address AQVSR, we develop a straightforward yet effective model called Dual Multimodal Encoders (DME) that significantly outperforms several baseline methods while still having large room for improvement in the future. Moreover, we present detailed ablation analyses. Our codes and data are available at https://github.com/StanLei52/AQVSR.

updated: Tue Nov 30 2021 01:14:10 GMT+0000 (UTC)

published: Tue Nov 30 2021 01:14:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト