An Empirical Study of Multimodal Model Merging

Yi-Lin Sung; Linjie Li; Kevin Lin; Zhe Gan; Mohit Bansal; Lijuan Wang

マルチモーダルモデルマージの実証的研究

モデルのマージ (補間やタスク演算などによる) は、異なるタスクでトレーニングされた複数のモデルを融合して、マルチタスクソリューションを生成します。この手法は、モデルが同様のタスクで同じ初期化でトレーニングされる以前の研究で成功を収めていることが証明されています。このホワイトペーパーでは、さまざまなモダリティでトレーニングされたトランスフォーマーをマージすることにより、この概念をマルチモーダルセットアップに拡張します。さらに、ビジョン、言語、およびモダリティ固有のアーキテクチャのクロスモーダルトランスフォーマーをマージして、パラメータ効率の高いモダリティに依存しないアーキテクチャを作成できるという新しい目標のために研究を行います。包括的な実験を通じて、初期化、マージメカニズム、モデルアーキテクチャなど、マージ後のモデルパフォーマンスに影響を与える重要な要因を体系的に調査します。私たちの分析は、モダリティに依存しないベースライン (つまり、最初から事前にトレーニングされたもの) のパフォーマンスをモデルのマージによって一致させるための効果的なトレーニングレシピにつながります。コードは https://github.com/ylsung/vl-merging で入手できます。

Model merging (e.g., via interpolation or task arithmetic) fuses multiple models trained on different tasks to generate a multi-task solution. The technique has been proven successful in previous studies, where the models are trained on similar tasks and with the same initialization. In this paper, we expand on this concept to a multimodal setup by merging transformers trained on different modalities. Furthermore, we conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture to create a parameter-efficient modality-agnostic architecture. Through comprehensive experiments, we systematically investigate the key factors impacting model performance after merging, including initialization, merging mechanisms, and model architectures. Our analysis leads to an effective training recipe for matching the performance of the modality-agnostic baseline (i.e. pre-trained from scratch) via model merging. Our code is available at: https://github.com/ylsung/vl-merging

updated: Fri Apr 28 2023 15:43:21 GMT+0000 (UTC)

published: Fri Apr 28 2023 15:43:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト