VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Qiushi Zhu; Long Zhou; Ziqiang Zhang; Shujie Liu; Binxing Jiao; Jie Zhang; Lirong Dai; Daxin Jiang; Jinyu Li; Furu Wei

VATLM: 音声表現学習のための統合されたマスク予測によるビジュアルオーディオテキストの事前トレーニング

音声は、人間が外界と通信するためのシンプルで効果的な方法ですが、より現実的な音声インタラクションには、視覚やテキストなどのマルチモーダル情報が含まれます。異なるモーダル情報を統合し、異なるリソース (例: 視覚と音声のペア、音声とテキストのペア、ラベルのない音声、ラベルのないテキスト) を活用して、音声表現学習を促進する統合フレームワークを設計する方法は、十分に検討されていませんでした。この論文では、統合されたクロスモーダル表現学習フレームワーク VATLM (Visual-Audio-Text Language Model) を提案します。提案された VATLM は、統一されたバックボーンネットワークを使用してモダリティに依存しない情報をモデル化し、3 つの単純なモダリティ依存モジュールを利用して、視覚、音声、およびテキスト入力を前処理します。これら 3 つのモダリティを 1 つの共有セマンティックスペースに統合するために、VATLM は、提案された統合トークナイザーによって提供される統合トークンのマスクされた予測タスクで最適化されます。オーディオビジュアル音声認識 (AVSR)、ビジュアル音声認識 (VSR) タスクなど、オーディオビジュアル関連のダウンストリームタスクで事前トレーニング済みの VATLM を評価します。結果は、提案された VATLM が、オーディオビジュアルの事前トレーニング済み AV-HuBERT モデルなど、以前の最先端のモデルよりも優れていることを示しており、分析は、VATLM がさまざまなモダリティを同じ空間に配置できることも示しています。今後の研究を容易にするために、https://aka.ms/vatlm でコードと事前トレーニング済みのモデルをリリースします。

Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model). The proposed VATLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens, given by our proposed unified tokenizer. We evaluate the pre-trained VATLM on audio-visual related downstream tasks, including audio-visual speech recognition (AVSR), visual speech recognition (VSR) tasks. Results show that the proposed VATLM outperforms previous the state-of-the-art models, such as audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that VATLM is capable of aligning different modalities into the same space. To facilitate future research, we release the code and pre-trained models at https://aka.ms/vatlm.

updated: Mon Nov 21 2022 09:10:10 GMT+0000 (UTC)

published: Mon Nov 21 2022 09:10:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト