Pano-AVQA: Grounded Audio-Visual Question Answering on 360^∘ Videos

Heeseung Yun; Youngjae Yu; Wonsuk Yang; Kangil Lee; Gunhee Kim

Pano-AVQA：360 ^∘ビデオでの接地されたオーディオビジュアル質問応答

360 ^∘ビデオは、シーンの周囲の全体像を伝えます。これは、事前に決定された通常の視野を超えた視聴覚の手がかりを提供し、球上に独特の空間的関係を表示します。ただし、パノラマビデオの以前のベンチマークタスクは、視聴覚関係または周囲の球形の空間プロパティの意味的理解を評価するためにまだ制限されています。パノラマビデオの大規模な接地された視聴覚質問応答データセットとして、Pano-AVQAという名前の新しいベンチマークを提案します。オンラインで収集された5.4K360 ^∘ビデオクリップを使用して、バウンディングボックスグラウンディングを備えた2種類の新しい質問と回答のペアを収集します。球形の空間関係QAと視聴覚関係QAです。 Pano-AVQAからいくつかのトランスベースのモデルをトレーニングします。その結果は、提案された球形の空間埋め込みとマルチモーダルトレーニングの目的が、データセット上のパノラマ環境のより良い意味理解にかなり貢献することを示唆しています。

360^∘ videos convey holistic views for the surroundings of a scene. It provides audio-visual cues beyond pre-determined normal field of views and displays distinctive spatial relations on a sphere. However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings. We propose a novel benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos. Using 5.4K 360^∘ video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding: spherical spatial relation QAs and audio-visual relation QAs. We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly contribute to a better semantic understanding of the panoramic surroundings on the dataset.

updated: Mon Oct 11 2021 09:58:05 GMT+0000 (UTC)

published: Mon Oct 11 2021 09:58:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト