Play It Back: Iterative Attention for Audio Recognition

Alexandros Stergiou; Dima Damen

Play It Back: 音声認識のための反復注意

聴覚認知の重要な機能は、特徴的な音とそれに対応するセマンティクスを経時的に関連付けることです。人間は、きめの細かいオーディオカテゴリを識別しようとしますが、予測の信頼性を高めるために、同じ識別音を再生することがよくあります。選択的な繰り返しを通じて、オーディオシーケンス全体で最も識別可能な音に注意を向ける、エンドツーエンドの注意ベースのアーキテクチャを提案します。私たちのモデルは、最初に完全なオーディオシーケンスを使用し、スロットの注意に基づいて再生される一時的なセグメントを繰り返し調整します。再生のたびに、選択されたセグメントは、これらのセグメント内の高解像度機能を表す短いホップ長を使用して再生されます。私たちの方法が、AudioSet、VGG-Sound、および EPIC-KITCHENS-100 の 3 つのオーディオ分類ベンチマークで最先端のパフォーマンスを一貫して達成できることを示します。

A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time. Humans attempting to discriminate between fine-grained audio categories, often replay the same discriminative sounds to increase their prediction confidence. We propose an end-to-end attention-based architecture that through selective repetition attends over the most discriminative sounds across the audio sequence. Our model initially uses the full audio sequence and iteratively refines the temporal segments replayed based on slot attention. At each playback, the selected segments are replayed using a smaller hop length which represents higher resolution features within these segments. We show that our method can consistently achieve state-of-the-art performance across three audio-classification benchmarks: AudioSet, VGG-Sound, and EPIC-KITCHENS-100.

updated: Sun Mar 12 2023 12:03:04 GMT+0000 (UTC)

published: Thu Oct 20 2022 15:03:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト