One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code

Yong Dai; Duyu Tang; Liangxin Liu; Minghuan Tan; Cong Zhou; Jingquan Wang; Zhangyin Feng; Fan Zhang; Xueyu Hu; Shuming Shi

1つのモデル、複数のモダリティ：テキスト、サウンド、画像、ビデオ、およびコードに対するまばらにアクティブ化されたアプローチ

人々は複数の感覚で世界を知覚します（たとえば、音を聞いたり、言葉を読んだり、物を見たりすることによって）。ただし、ほとんどの既存のAIシステムは、個々のモダリティのみを処理します。このホワイトペーパーでは、単一のモデルで情報の複数のモダリティを処理するのに優れたアプローチを紹介します。「SkillNet」モデルでは、パラメーターのさまざまな部分がさまざまなモダリティの処理に特化しています。常にすべてのモデルパラメーターをアクティブ化する従来の高密度モデルとは異なり、私たちのモデルは、スキルがタスクに関連するパラメーターの一部をまばらにアクティブ化します。このようなモデル設計により、SkillNetはより解釈しやすい方法でスキルを学習できます。テキスト、画像、音声、ビデオ、コードを含む5つのモダリティのモデルを開発します。結果は、SkillNetが5つのモダリティ固有の微調整されたモデルと同等に機能することを示しています。さらに、私たちのモデルは、同じまばらにアクティブ化された方法で自己監視された事前トレーニングをサポートしているため、さまざまなモダリティに対してより適切に初期化されたパラメーターが得られます。事前トレーニングにより、5つのモダリティでのSkillNetのパフォーマンスが大幅に向上し、モダリティ固有の事前トレーニングのベースラインと同等か、それよりも優れていることがわかります。中国語のテキストから画像への検索のタスクでは、最終的なシステムは、アクティブ化されたパラメーターの数を減らしながら、WukongViT-BやWenlan2.0などの既存の主要なシステムよりも高い精度を実現します。

People perceive the world with multiple senses (e.g., through hearing sounds, reading words and seeing objects). However, most existing AI systems only process an individual modality. This paper presents an approach that excels at handling multiple modalities of information with a single model. In our "SkillNet" model, different parts of the parameters are specialized for processing different modalities. Unlike traditional dense models that always activate all the model parameters, our model sparsely activates parts of the parameters whose skills are relevant to the task. Such model design enables SkillNet to learn skills in a more interpretable way. We develop our model for five modalities including text, image, sound, video and code. Results show that, SkillNet performs comparably to five modality-specific fine-tuned models. Moreover, our model supports self-supervised pretraining with the same sparsely activated way, resulting in better initialized parameters for different modalities. We find that pretraining significantly improves the performance of SkillNet on five modalities, on par with or even better than baselines with modality-specific pretraining. On the task of Chinese text-to-image retrieval, our final system achieves higher accuracy than existing leading systems including WukongViT-B and Wenlan 2.0 while using less number of activated parameters.

updated: Thu May 12 2022 14:39:21 GMT+0000 (UTC)

published: Thu May 12 2022 14:39:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト