Class-aware Sounding Objects Localization via Audiovisual Correspondence

Di Hu; Yake Wei; Rui Qian; Weiyao Lin; Ruihua Song; Ji-Rong Wen

視聴覚通信によるクラス認識サウンディングオブジェクトのローカリゼーション

視聴覚シーンは私たちの日常生活に浸透しています。人間がさまざまなサウンディングオブジェクトを区別してローカライズすることは一般的ですが、マシンがカテゴリ注釈なしでクラス認識サウンディングオブジェクトのローカリゼーションを実現すること、つまりサウンディングオブジェクトをローカライズしてそのカテゴリを認識することは非常に困難です。この問題に対処するために、オーディオとビジョンの対応のみを使用して、複雑なオーディオビジュアルシナリオでサウンドオブジェクトをローカライズおよび認識するための2段階の段階的な学習フレームワークを提案します。まず、単一ソースの場合の粗い視聴覚対応を介してサウンディングエリアを決定することを提案します。次に、サウンディング領域の視覚的特徴を候補オブジェクト表現として活用して、表現力豊かな視覚的文字抽出のためのカテゴリ表現オブジェクト辞書を確立します。カクテルパーティーのシナリオでクラス対応のオブジェクトローカリゼーションマップを生成し、視聴覚対応を使用して、この辞書を参照することでサイレントエリアを抑制します。最後に、カテゴリレベルの視聴覚の一貫性を監視として採用し、きめ細かいオーディオとサウンドオブジェクトの分布の調整を実現します。現実的なビデオと合成されたビデオの両方での実験は、私たちのモデルがオブジェクトのローカライズと認識、およびサイレントオブジェクトの除外に優れていることを示しています。また、学習した視聴覚ネットワークを教師なしオブジェクト検出タスクに転送し、妥当なパフォーマンスを取得します。

Audiovisual scenes are pervasive in our daily life. It is commonplace for humans to discriminatively localize different sounding objects but quite challenging for machines to achieve class-aware sounding objects localization without category annotations, i.e., localizing the sounding object and recognizing its category. To address this problem, we propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios using only the correspondence between audio and vision. First, we propose to determine the sounding area via coarse-grained audiovisual correspondence in the single source cases. Then visual features in the sounding area are leveraged as candidate object representations to establish a category-representation object dictionary for expressive visual character extraction. We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas by referring to this dictionary. Finally, we employ category-level audiovisual consistency as the supervision to achieve fine-grained audio and sounding object distribution alignment. Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones. We also transfer the learned audiovisual network into the unsupervised object detection task, obtaining reasonable performance.

updated: Wed Dec 22 2021 09:34:33 GMT+0000 (UTC)

published: Wed Dec 22 2021 09:34:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト