VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks

Hung Le; Nancy F. Chen; Steven C. H. Hoi

VGNMN：ビデオに基づいたニューラルモジュールネットワークからビデオに基づいた言語タスク

ニューラルモジュールネットワーク（NMN）は、合成画像の視覚的質問応答（VQA）などの画像ベースのタスクで成功を収めています。ただし、NMNに関する非常に限られた作業が、ビデオベースの言語タスクで研究されています。これらのタスクは、従来の視覚的タスクの複雑さを拡張し、視覚的な時間的差異を追加します。画像に基づいたタスクに関する最近のNMNアプローチに動機付けられて、ビデオに基づいたニューラルモジュールネットワーク（VGNMN）を導入し、ビデオに基づいた言語タスクの情報検索プロセスをニューラルモジュールのパイプラインとしてモデル化します。 VGNMNは、最初にすべての言語コンポーネントを分解して、エンティティ参照を明示的に解決し、質問からの対応するアクションベースの入力を検出します。検出されたエンティティとアクションは、ニューラルモジュールネットワークをインスタンス化し、ビデオから視覚的な手がかりを抽出するためのパラメータとして使用されます。私たちの実験は、VGNMNが2つのビデオベースの言語タスクで有望なパフォーマンスを達成できることを示しています：ビデオQAとビデオベースの対話。

Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question Answering (VQA) on synthetic images. However, very limited work on NMN has been studied in the video-grounded language tasks. These tasks extend the complexity of traditional visual tasks with the additional visual temporal variance. Motivated by recent NMN approaches on image-grounded tasks, we introduce Video-grounded Neural Module Network (VGNMN) to model the information retrieval process in video-grounded language tasks as a pipeline of neural modules. VGNMN first decomposes all language components to explicitly resolve any entity references and detect corresponding action-based inputs from the question. The detected entities and actions are used as parameters to instantiate neural module networks and extract visual cues from the video. Our experiments show that VGNMN can achieve promising performance on two video-grounded language tasks: video QA and video-grounded dialogues.

updated: Fri Apr 16 2021 06:47:41 GMT+0000 (UTC)

published: Fri Apr 16 2021 06:47:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト