Accommodating Audio Modality in CLIP for Multimodal Processing

Ludan Ruan; Anwen Hu; Yuqing Song; Liang Zhang; Sipeng Zheng; Qin Jin

マルチモーダル処理のためのCLIPでのオーディオモダリティへの対応

マルチモーダル処理は、特に事前トレーニングの成功により、最近多くの注目を集めています。ただし、より多くのモダリティを導入するとモデルの設計と最適化が非常に複雑になる可能性があるため、探索は主にビジョン言語の事前トレーニングに焦点を当てています。このホワイトペーパーでは、最先端のビジョン言語モデル CLIP を拡張して、ビジョン言語オーディオマルチモーダル処理のオーディオモダリティに対応します。具体的には、インターモーダルおよびイントラモーダルの対照学習を適用して、オーディオモダリティの内部特性に加えて、オーディオと他のモダリティとの相関関係を調査します。さらに、一般的なオーディオでは言語と非言語の両方の異種情報が伝達されるため、さまざまなシナリオでさまざまなオーディオ情報タイプを動的に学習するオーディオタイプトークンをさらに設計します。提案された CLIP4VLA モデルは、ビデオ検索やビデオキャプションなどのさまざまなダウンストリームタスクで検証され、MSR-VTT、VATEX、および Audiocaps のベンチマークデータセットで最先端のパフォーマンスを達成します。

Multimodal processing has attracted much attention lately especially with the success of pre-training. However, the exploration has mainly focused on vision-language pre-training, as introducing more modalities can greatly complicate model design and optimization. In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing. Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities in addition to the inner characteristics of the audio modality. Moreover, we further design an audio type token to dynamically learn different audio information type for different scenarios, as both verbal and nonverbal heterogeneous information is conveyed in general audios. Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning, and achieves the state-of-the-art performance on the benchmark datasets of MSR-VTT, VATEX, and Audiocaps.

updated: Sun Mar 12 2023 06:57:01 GMT+0000 (UTC)

published: Sun Mar 12 2023 06:57:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト