Improving Audio-Visual Segmentation with Bidirectional Generation

Dawei Hao; Yuxin Mao; Bowen He; Xiaodong Han; Yuchao Dai; Yiran Zhong

双方向生成によるオーディオビジュアルセグメンテーションの改善

オーディオビジュアルセグメンテーション (AVS) の目的は、ビデオ内の可聴オブジェクトをピクセルレベルまで正確に区別することです。従来のアプローチでは、さまざまなモダリティからの情報を組み合わせてこの課題に取り組むことが多く、各モダリティの寄与が暗黙的または明示的にモデル化されています。それにもかかわらず、オーディオビジュアルモデリングでは、異なるモダリティ間の相互接続が見落とされる傾向があります。この論文では、オブジェクトの音とその視覚的外観を頭の中でシミュレートする人間の能力に触発され、双方向生成フレームワークを紹介します。このフレームワークは、オブジェクトの視覚的特性とそれに関連する音の間の強力な相関関係を確立し、それによって AVS のパフォーマンスを向上させます。これを達成するために、オブジェクトセグメンテーションマスクからオーディオ特徴を再構成し、再構成エラーを最小限に抑えるビジュアルからオーディオへの投影コンポーネントを採用します。さらに、多くのサウンドがオブジェクトの動きにリンクしていることを認識し、従来のオプティカルフロー手法を使用してキャプチャするのが困難な時間的ダイナミクスを処理するための暗黙的ボリュームモーション推定モジュールを導入します。私たちのアプローチの有効性を示すために、広く認知されている AVSBench ベンチマークで包括的な実験と分析を実施します。その結果、AVS ベンチマークで新たな最先端のパフォーマンスレベルを確立し、特に複数の音源のセグメント化を伴う難しい MS3 サブセットで優れています。再現性を高めるために、ソースコードと事前トレーニング済みモデルの両方をリリースする予定です。

The aim of audio-visual segmentation (AVS) is to precisely differentiate audible objects within videos down to the pixel level. Traditional approaches often tackle this challenge by combining information from various modalities, where the contribution of each modality is implicitly or explicitly modeled. Nevertheless, the interconnections between different modalities tend to be overlooked in audio-visual modeling. In this paper, inspired by the human ability to mentally simulate the sound of an object and its visual appearance, we introduce a bidirectional generation framework. This framework establishes robust correlations between an object's visual characteristics and its associated sound, thereby enhancing the performance of AVS. To achieve this, we employ a visual-to-audio projection component that reconstructs audio features from object segmentation masks and minimizes reconstruction errors. Moreover, recognizing that many sounds are linked to object movements, we introduce an implicit volumetric motion estimation module to handle temporal dynamics that may be challenging to capture using conventional optical flow methods. To showcase the effectiveness of our approach, we conduct comprehensive experiments and analyses on the widely recognized AVSBench benchmark. As a result, we establish a new state-of-the-art performance level in the AVS benchmark, particularly excelling in the challenging MS3 subset which involves segmenting multiple sound sources. To facilitate reproducibility, we plan to release both the source code and the pre-trained model.

updated: Wed Aug 16 2023 11:20:23 GMT+0000 (UTC)

published: Wed Aug 16 2023 11:20:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト