Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Tiantian Geng; Teng Wang; Jinming Duan; Runmin Cong; Feng Zheng

トリミングされていないビデオのオーディオビジュアルイベントを高密度にローカライズする: 大規模なベンチマークとベースライン

既存のオーディオビジュアルイベントローカリゼーション (AVE) は、手動でトリミングされたビデオを、それぞれに 1 つのインスタンスのみで処理します。ただし、自然なビデオにはさまざまなカテゴリの多数のオーディオビジュアルイベントが含まれていることが多いため、この設定は非現実的です。現実のアプリケーションによりよく適応するために、このホワイトペーパーでは、オーディオビジュアルイベントを高密度にローカライズするタスクに焦点を当てます。これは、トリミングされていないビデオで発生するすべてのオーディオビジュアルイベントを共同でローカライズして認識することを目的としています。この問題は、きめの細かい視聴覚シーンとコンテキストの理解を必要とするため、困難です。この問題に取り組むために、最初の Untrimmed Audio-Visual (UnAV-100) データセットを導入しました。これには、30K を超えるオーディオビジュアルイベントを含む 10K のトリミングされていないビデオが含まれています。各ビデオには、平均で 2.8 の視聴覚イベントがあり、イベントは通常、相互に関連しており、実際のシーンのように同時に発生する場合があります。次に、新しい学習ベースのフレームワークを使用してタスクを定式化します。これは、オーディオとビジュアルのモダリティを完全に統合して、さまざまな長さのオーディオビジュアルイベントをローカライズし、それらの間の依存関係を 1 つのパスでキャプチャすることができます。広範な実験により、この方法の有効性と、このタスクに対するマルチスケールのクロスモーダル認識および依存関係モデリングの重要性が実証されています。

Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, in this paper we focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. The problem is challenging as it requires fine-grained audio-visual scene and context understanding. To tackle this problem, we introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains 10K untrimmed videos with over 30K audio-visual events. Each video has 2.8 audio-visual events on average, and the events are usually related to each other and might co-occur as in real-life scenes. Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass. Extensive experiments demonstrate the effectiveness of our method as well as the significance of multi-scale cross-modal perception and dependency modeling for this task.

updated: Wed Mar 22 2023 22:00:17 GMT+0000 (UTC)

published: Wed Mar 22 2023 22:00:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト