Egocentric Audio-Visual Object Localization

Chao Huang; Yapeng Tian; Anurag Kumar; Chenliang Xu

自己中心的な視聴覚オブジェクトのローカリゼーション

人間は、音と視覚を統合して一人称視点で周囲の景色を自然に知覚します。同様に、機械は自己中心的な観点から多感覚入力を学習することで、人間の知性に近づくように進化しています。この論文では、挑戦的な自己中心的な視聴覚オブジェクトのローカリゼーションタスクを調査し、1) エゴモーションは、短い期間内であっても、一人称録音に一般的に存在することを観察します。 2) 着用者が注意を移しながら、視界外の音成分を作り出すことができます。最初の問題に対処するために、エゴモーションを明示的に処理するためのジオメトリ対応の時間集約モジュールを提案します。エゴモーションの影響は、一時的なジオメトリ変換を推定し、それを利用して視覚的表現を更新することで軽減されます。さらに、2 番目の問題に取り組むためにカスケード機能強化モジュールを提案します。視覚的に示された音声表現を解きほぐすことで、クロスモーダルローカリゼーションの堅牢性を向上させます。トレーニング中、コストのかかるラベル付けを避けるために、「無料の」自己監視として、自然に利用可能なオーディオビジュアルの時間同期を利用します。また、評価目的で Epic Sounding Object データセットに注釈を付けて作成します。広範な実験により、私たちの方法が自己中心的なビデオで最先端のローカリゼーションパフォーマンスを達成し、さまざまな視聴覚シーンに一般化できることが示されています。

Humans naturally perceive surrounding scenes by unifying sound and sight in a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created while wearers shift their attention. To address the first problem, we propose a geometry-aware temporal aggregation module to handle the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. Moreover, we propose a cascaded feature enhancement module to tackle the second issue. It improves cross-modal localization robustness by disentangling visually-indicated audio representation. During training, we take advantage of the naturally available audio-visual temporal synchronization as the ``free'' self-supervision to avoid costly labeling. We also annotate and create the Epic Sounding Object dataset for evaluation purposes. Extensive experiments show that our method achieves state-of-the-art localization performance in egocentric videos and can be generalized to diverse audio-visual scenes.

updated: Thu Mar 23 2023 17:43:11 GMT+0000 (UTC)

published: Thu Mar 23 2023 17:43:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト