ASiT: Audio Spectrogram vIsion Transformer for General Audio Representation

Sara Atito; Muhammad Awais; Wenwu Wang; Mark D Plumbley; Josef Kittler

ASiT: 一般的なオーディオ表現のためのオーディオスペクトログラムビジョントランスフォーマー

もともと自然言語処理用に開発されたビジョントランスフォーマーは、長期的な関係を学習する際の柔軟性により、最近、コンピュータービジョンおよびオーディオコミュニティに大きな関心を集めています。トランスフォーマーのデータを大量に消費する性質と、限られたラベル付きデータによる制約により、オーディオタスク用のほとんどのトランスフォーマーベースのモデルは、自然画像ドメインとオーディオドメインの間に大きなギャップがあるにもかかわらず、ImageNet の事前トレーニング済みモデルから微調整されています。これは、大量のラベル付きデータへの依存を減らし、オーディオスペクトログラムの簡潔な表現を抽出することに焦点を当てた、オーディオトランスフォーマーの自己教師あり事前トレーニングの研究の動機となっています。このホワイトペーパーでは、グループマスクモデル学習と自己蒸留を使用してローカルおよびグローバルなコンテキスト情報をキャプチャする、一般的なオーディオ表現用の新しい自己監視型トランスフォーマーである ASiT を提案します。音声イベント分類、キーワードスポッティング、話者識別など、音声分類タスクと音声分類タスクの両方で事前トレーニング済みモデルを評価します。さらに、さまざまな事前トレーニング戦略の評価を含む包括的なアブレーション研究を実施しています。提案されたASiTフレームワークは、すべてのタスクのパフォーマンスを大幅に向上させ、5つの音声および音声分類タスクで新しい最先端のパフォーマンスを設定し、事前トレーニングに追加のデータセットを使用するアプローチを含む最近の方法よりも優れています.コードと事前トレーニング済みの重みは、科学コミュニティ向けに公開されます。

Vision transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by data hungry nature of transformers and limited labelled data most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the natural images domain and audio domain. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representation of the audio spectrograms. In this paper, we propose ASiT, a novel self-supervised transformer for general audio representations that captures local and global contextual information employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, and speaker identification. We further conduct comprehensive ablation studies, including evaluations of different pretraining strategies. The proposed ASiT framework significantly boosts the performance on all tasks and sets a new state-of-the-art performance on five audio and speech classification tasks, outperforming recent methods, including the approaches that use additional datasets for pretraining. The code and pretrained weights will be made publicly available for the scientific community.

updated: Wed Nov 23 2022 18:21:09 GMT+0000 (UTC)

published: Wed Nov 23 2022 18:21:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト