Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

Sagnik Majumder; Ziad Al-Halah; Kristen Grauman

自己中心的なビデオの音声と視覚の対応から空間的特徴を学習する

私たちは、自己中心的なビデオにおける空間的な視聴覚対応に基づいて表現を学習するための自己教師ありの方法を提案します。特に、私たちの方法では、マスクされた自動エンコーディングフレームワークを活用して、オーディオと視覚の相乗効果を通じてマスクされたバイノーラルオーディオを合成し、それによって 2 つのモダリティ間の有用な空間関係を学習します。私たちは、事前トレーニングされた機能を使用して、社会的シナリオでの空間理解を必要とする 2 つのダウンストリームビデオタスク、つまりアクティブ話者の検出と空間オーディオのノイズ除去に取り組みます。私たちは広範な実験を通じて、私たちの機能が、公開されている 2 つの挑戦的な自己中心的なビデオデータセット、EgoCom と EasyCom に関する複数の最先端のベースラインを改善するのに十分な汎用性があることを示しています。プロジェクト: http://vision.cs.utexas.edu/projects/ego_av_corr。

We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. In particular, our method leverages a masked auto-encoding framework to synthesize masked binaural audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downstream video tasks requiring spatial understanding in social scenarios: active speaker detection and spatial audio denoising. We show through extensive experiments that our features are generic enough to improve over multiple state-of-the-art baselines on two public challenging egocentric video datasets, EgoCom and EasyCom. Project: http://vision.cs.utexas.edu/projects/ego_av_corr.

updated: Mon Jul 10 2023 17:58:17 GMT+0000 (UTC)

published: Mon Jul 10 2023 17:58:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト