Learning Representations from Audio-Visual Spatial Alignment

Pedro Morgado; Yi Li; Nuno Vasconcelos

視聴覚空間アライメントから表現を学習する

視聴覚コンテンツから表現を学習するための新しい自己教師あり口実タスクを紹介します。視聴覚表現学習に関する以前の研究は、ビデオレベルでの対応を活用しています。視聴覚対応（AVC）に基づくアプローチは、オーディオクリップとビデオクリップが同じビデオインスタンスからのものか、異なるビデオインスタンスからのものかを予測します。視聴覚時間同期（AVTS）は、同じビデオインスタンスから発生した負のペアをさらに区別しますが、時間は異なります。これらのアプローチは、行動認識などの下流のタスクの高品質な表現を学習しますが、トレーニングの目的は、音声および視覚信号で自然に発生する空間的な手がかりを無視します。これらの空間的手がかりから学ぶために、360°ビデオと空間オーディオの対照的なオーディオビジュアル空間アライメントを実行するようにネットワークに依頼しました。空間アラインメントを実行する機能は、トランスアーキテクチャを使用して360°ビデオの空間コンテンツ全体を推論し、複数の視点からの表現を組み合わせると強化されます。提案された口実タスクの利点は、視聴覚対応、空間アラインメント、行動認識、ビデオセマンティックセグメンテーションなど、さまざまなオーディオおよびビジュアルダウンストリームタスクで実証されています。

We introduce a novel self-supervised pretext task for learning representations from audio-visual content. Prior work on audio-visual representation learning leverages correspondences at the video level. Approaches based on audio-visual correspondence (AVC) predict whether audio and video clips originate from the same or different video instances. Audio-visual temporal synchronization (AVTS) further discriminates negative pairs originated from the same video instance but at different moments in time. While these approaches learn high-quality representations for downstream tasks such as action recognition, their training objectives disregard spatial cues naturally occurring in audio and visual signals. To learn from these spatial cues, we tasked a network to perform contrastive audio-visual spatial alignment of 360° video and spatial audio. The ability to perform spatial alignment is enhanced by reasoning over the full spatial content of the 360° video using a transformer architecture to combine representations from multiple viewpoints. The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks, including audio-visual correspondence, spatial alignment, action recognition, and video semantic segmentation.

updated: Tue Nov 03 2020 16:20:04 GMT+0000 (UTC)

published: Tue Nov 03 2020 16:20:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト