Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering

Jiong Wang; Zhou Zhao; Weike Jin

マルチモーダルビデオ質問応答のためのフレームサブタイトルセルフスーパービジョン

マルチモーダルビデオ質問応答は、正解を予測し、質問に関連する時間的境界をローカライズすることを目的としています。質問の一時的な注釈は、最近の作業の QA パフォーマンスと解釈可能性を向上させますが、通常は経験的で費用がかかります。一時的な注釈を回避するために、QA 注釈のみが使用され、一時的な注意スコアに従って関連する一時的な境界が生成される、弱教師付き質問接地 (WSQG) 設定を考案します。一時的な注釈を置き換えるために、フレームと字幕の間の対応をフレーム字幕 (FS) 自己監視に変換します。これは、一時的な注意スコアを最適化し、VideoQA モデルでのビデオ言語の理解を向上させるのに役立ちます。 TVQA および TVQA+ データセットに関する広範な実験は、提案された WSQG 戦略が質問グラウンディングで同等のパフォーマンスを得ることを示しており、FS 自己監督は、QA 監督のみと完全監督設定の両方で質問応答とグラウンディングのパフォーマンスを向上させるのに役立ちます。

Multi-modal video question answering aims to predict correct answer and localize the temporal boundary relevant to the question. The temporal annotations of questions improve QA performance and interpretability of recent works, but they are usually empirical and costly. To avoid the temporal annotations, we devise a weakly supervised question grounding (WSQG) setting, where only QA annotations are used and the relevant temporal boundaries are generated according to the temporal attention scores. To substitute the temporal annotations, we transform the correspondence between frames and subtitles to Frame-Subtitle (FS) self-supervision, which helps to optimize the temporal attention scores and hence improve the video-language understanding in VideoQA model. The extensive experiments on TVQA and TVQA+ datasets demonstrate that the proposed WSQG strategy gets comparable performance on question grounding, and the FS self-supervision helps improve the question answering and grounding performance on both QA-supervision only and full-supervision settings.

updated: Thu Sep 08 2022 07:20:51 GMT+0000 (UTC)

published: Thu Sep 08 2022 07:20:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト