AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

Paul Hongsuck Seo; Arsha Nagrani; Cordelia Schmid

AVFormer: Zero-Shot AV-ASR のための凍結された音声モデルへのビジョンの注入

視聴覚自動音声認識 (AV-ASR) は、視覚情報を組み込むことによって音声認識システムの堅牢性を向上させることを目的としています。このタスクのために完全に監視されたマルチモーダルモデルをゼロからトレーニングしますが、大規模なラベル付きオーディオビジュアルデータセット (関心のある各下流ドメイン) の必要性によって制限されます。 AVFormer は、音声のみのモデルを視覚情報で拡張すると同時に、軽量のドメイン適応を実行するための簡単な方法です。これを行うには、(i) 軽量のトレーニング可能なアダプターを使用して、凍結された ASR モデルに視覚的な埋め込みを挿入します。これらは、最小限の追加トレーニング時間とパラメーターで、少量の弱くラベル付けされたビデオデータでトレーニングできることを示します。 (ii) また、トレーニング中に簡単なカリキュラムスキームを導入します。これは、モデルが音声情報と視覚情報を効果的に共同処理できるようにするために重要であることを示しています。最後に (iii) モデルが 3 つの異なる AV-ASR ベンチマーク (How2、VisSpeech、および Ego4D) で最先端のゼロショット結果を達成すると同時に、従来の音声のみの音声認識ベンチマーク (LibriSpeech) でまともなパフォーマンスを維持していることを示します。）。定性的な結果は、モデルが視覚情報を効果的に活用して堅牢な音声認識を行うことを示しています。

Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a speech recognition system by incorporating visual information. Training fully supervised multimodal models for this task from scratch, however is limited by the need for large labelled audiovisual datasets (in each downstream domain of interest). We present AVFormer, a simple method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation. We do this by (i) injecting visual embeddings into a frozen ASR model using lightweight trainable adaptors. We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters. (ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (LibriSpeech). Qualitative results show that our model effectively leverages visual information for robust speech recognition.

updated: Wed Mar 29 2023 07:24:28 GMT+0000 (UTC)

published: Wed Mar 29 2023 07:24:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト