Self-Chained Image-Language Model for Video Localization and Question Answering

Shoubin Yu; Jaemin Cho; Prateek Yadav; Mohit Bansal

ビデオローカリゼーションと質問応答のための自己連鎖型イメージ言語モデル

最近の研究では、ビデオ質問応答に事前トレーニングされた画像言語モデルを利用することで有望な結果が示されています。これらの画像言語モデルは、ビデオ言語モデルの表現学習を効率的にブートストラップできますが、通常、明示的な言語を意識した時間モデリングを行わずに、均一にサンプリングされたビデオフレームを視覚入力として連結します。ビデオ入力の一部のみが言語クエリに関連する場合、そのような均一なフレームサンプリングにより、多くの場合、重要な視覚的手がかりが失われる可能性があります。人間は、注目すべきビデオの瞬間を見つけて質問に答えるためにその瞬間を巻き戻すことがよくありますが、クエリ認識ビデオモーメントローカライザーのトレーニングには高価な注釈と高い計算コストが必要になることがよくあります。この問題に対処するために、私たちは、単一の画像言語モデル (BLIP-2) を活用して、ビデオの時間的キーフレームローカリゼーションと QA の両方に取り組む新しいフレームワークである Self-Chained Video Localization-Answering (SeViLA) を提案します。 SeViLA フレームワークは、Localizer と Answerer の 2 つのモジュールで構成されており、両方とも BLIP-2 からパラメータを効率的に微調整しています。これらのモジュールを連鎖させて、カスケード推論と自己洗練を実現します。まず、順方向チェーンで、ローカライザーはビデオ内で複数の言語対応キーフレームを見つけます。回答者はそれを使用して答えを予測します。次に、リバースチェーンで、Answerer がキーフレーム疑似ラベルを生成して Localizer を改良し、高価なビデオモーメントローカリゼーションアノテーションの必要性を軽減します。 SeViLA は、5 つのビデオ QA およびイベント予測タスクでいくつかの強力なベースライン/以前の研究を上回り、微調整 (NExT-QA、STAR) とゼロショット (NExT-QA、STAR、 How2QA、VLEP) の設定。 Localizer の影響、Localizer と他の時間的位置特定モデルとの比較、Localizer の事前トレーニング/自己調整、キーフレーム数の変化など、包括的な分析を示します。

Recent studies have shown promising results on utilizing pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP-2) to tackle both temporal keyframe localization and QA on videos. SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2. We chain these modules for cascaded inference and self-refinement. First, in the forward chain, the Localizer finds multiple language-aware keyframes in a video, which the Answerer uses to predict the answer. Second, in the reverse chain, the Answerer generates keyframe pseudo-labels to refine the Localizer, alleviating the need for expensive video moment localization annotations. SeViLA outperforms several strong baselines/previous works on five video QA and event prediction tasks, and achieves the state-of-the-art in both fine-tuning (NExT-QA, STAR) and zero-shot (NExT-QA, STAR, How2QA, VLEP) settings. We show a comprehensive analysis, e.g., the impact of Localizer, comparisons of Localizer with other temporal localization models, pre-training/self-refinement of Localizer, and varying the number of keyframes.

updated: Thu May 11 2023 17:23:00 GMT+0000 (UTC)

published: Thu May 11 2023 17:23:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト