LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

Jingjing Jiang; Ziyi Liu; Yifan Liu; Nanning Zheng

LiVLR：ビデオ質問応答のための軽量の視覚言語推論フレームワーク

マルチモーダルビデオコンテンツの理解に基づいて特定の質問に正しく回答することを目的としたビデオ質問応答（VideoQA）は、豊富なビデオコンテンツのために困難です。ビデオ理解の観点から、優れたVideoQAフレームワークは、さまざまなセマンティックレベルでビデオコンテンツを理解し、多様なビデオコンテンツを柔軟に統合して、質問関連のコンテンツを抽出する必要があります。この目的のために、LiVLRという名前の軽量の視覚言語推論フレームワークを提案します。具体的には、LiVLRは最初にグラフベースの視覚的および言語的エンコーダーを利用して、マルチグレインの視覚的および言語的表現を取得します。続いて、得られた表現は、考案された多様性を意識した視覚言語推論モジュール（DaVL）と統合されます。 DaVLは、さまざまなタイプの表現の違いを考慮し、効果的で一般的な表現統合方法である質問関連の共同表現を生成するときに、さまざまなタイプの表現の重要性を柔軟に調整できます。提案されたLiVLRは軽量であり、2つのVideoQAベンチマーク、MRSVTT-QAとKnowITVQAでその優位性を示しています。広範なアブレーション研究は、LiVLRの主要コンポーネントの有効性を示しています。

Video Question Answering (VideoQA), aiming to correctly answer the given question based on understanding multi-modal video content, is challenging due to the rich video content. From the perspective of video understanding, a good VideoQA framework needs to understand the video content at different semantic levels and flexibly integrate the diverse video content to distill question-related content. To this end, we propose a Lightweight Visual-Linguistic Reasoning framework named LiVLR. Specifically, LiVLR first utilizes the graph-based Visual and Linguistic Encoders to obtain multi-grained visual and linguistic representations. Subsequently, the obtained representations are integrated with the devised Diversity-aware Visual-Linguistic Reasoning module (DaVL). The DaVL considers the difference between the different types of representations and can flexibly adjust the importance of different types of representations when generating the question-related joint representation, which is an effective and general representation integration method. The proposed LiVLR is lightweight and shows its superiority on two VideoQA benchmarks, MRSVTT-QA and KnowIT VQA. Extensive ablation studies demonstrate the effectiveness of LiVLR key components.

updated: Mon Nov 29 2021 14:18:47 GMT+0000 (UTC)

published: Mon Nov 29 2021 14:18:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト