Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation

Jinxiang Liu; Chen Ju; Chaofan Ma; Yanfeng Wang; Yu Wang; Ya Zhang

オーディオビジュアルセグメンテーション用のオーディオ対応クエリ強化トランスフォーマー

オーディオビジュアルセグメンテーション (AVS) タスクの目標は、オーディオキューを使用してビデオフレーム内の音声オブジェクトをセグメント化することです。しかし、現在の融合ベースの方法には、畳み込みの受容野が小さいことと、視聴覚機能の融合が不十分であるため、パフォーマンスに限界があります。これらの問題を克服するために、私たちはこのタスクに取り組むための新しい Audio-aware query-enhanced TRansformer (AuTR) を提案します。既存の方法とは異なり、私たちのアプローチでは、オーディオビジュアル機能の深い融合と集約を可能にするマルチモーダルトランスフォーマーアーキテクチャを導入しています。さらに、モデルが音声信号に基づいて特定された音オブジェクトのセグメンテーションに焦点を当てるのを明示的に支援する、音声認識クエリ強化トランスデコーダを考案しますが、静かではあるが顕著なオブジェクトは無視されます。実験結果は、私たちの方法が以前の方法よりも優れており、マルチサウンドおよびオープンセットのシナリオでより優れた汎化能力を実証していることを示しています。

The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in the video frames using audio cues. However, current fusion-based methods have the performance limitations due to the small receptive field of convolution and inadequate fusion of audio-visual features. To overcome these issues, we propose a novel Audio-aware query-enhanced TRansformer (AuTR) to tackle the task. Unlike existing methods, our approach introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features. Furthermore, we devise an audio-aware query-enhanced transformer decoder that explicitly helps the model focus on the segmentation of the pinpointed sounding objects based on audio signals, while disregarding silent yet salient objects. Experimental results show that our method outperforms previous methods and demonstrates better generalization ability in multi-sound and open-set scenarios.

updated: Tue Jul 25 2023 03:59:04 GMT+0000 (UTC)

published: Tue Jul 25 2023 03:59:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト