Unified Model for Image, Video, Audio and Language Tasks

Mustafa Shukor; Corentin Dancette; Alexandre Rame; Matthieu Cord

画像、ビデオ、オーディオ、言語タスクの統合モデル

大規模言語モデル (LLM) により、ゼネラリストエージェントの野心的な探求が空想から大幅に遠ざけられました。このような一般的なモデルを構築する際の主なハードルは、タスクとモダリティの多様性と異質性です。有望なソリューションは統合であり、1 つの統一されたフレームワーク内で無数のタスクとモダリティのサポートを可能にします。大規模なデータセットでトレーニングされた大規模モデル (Flamingo (Alayrac et al., 2022) など) は 2 つ以上のモダリティをサポートできるものはほとんどありませんが、現在の小規模から中規模の統合モデルは依然として 2 つのモダリティ (通常は画像テキストまたはビデオ) に制限されています。 -text. 私たちが尋ねる質問は: すべてのモダリティをサポートできる統合モデルを効率的に構築することは可能ですか? これに答えるために、私たちは、この野心的な目標に向けてさらに一歩進んだ UnIVAL を提案します。数十億のパラメータを備えた ~ 0.25B パラメータの UnIVAL モデルは 2 つのモダリティを超え、テキスト、画像、ビデオ、オーディオを 1 つのモデルに統合します。私たちのモデルは、タスクバランシングとマルチモーダルカリキュラム学習に基づいて、多くのタスクで効率的に事前トレーニングされています。UnIVAL は示しています画像およびビデオテキストタスク全体で、既存の最先端のアプローチに匹敵するパフォーマンスを実現します。画像およびビデオテキストモダリティから学習した特徴表現により、モデルがオーディオテキストタスクで微調整された場合、たとえそうでない場合でも、競争力のあるパフォーマンスを達成できます。オーディオで事前トレーニングされています。統合モデルのおかげで、さまざまなマルチモーダルタスクでトレーニングされたモデルの重み補間を介したマルチモーダルモデルのマージに関する新しい研究を提案し、特に分布外一般化に対する利点を示します。最後に、タスク間の相乗効果を示すことで、統合の動機付けを行います。モデルの重みとコードはここでリリースされています: https://github.com/mshukor/UnIVAL。

Large Language Models (LLMs) have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising solution is unification, allowing the support of a myriad of tasks and modalities within one unified framework. While few large models (e.g., Flamingo (Alayrac et al., 2022), trained on massive datasets, can support more than two modalities, current small to mid-scale unified models are still limited to 2 modalities, usually image-text or video-text. The question that we ask is: is it possible to build efficiently a unified model that can support all modalities? To answer this, we propose UnIVAL, a step further towards this ambitious goal. Without relying on fancy datasets sizes or models with billions of parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model. Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning. UnIVAL shows competitive performance to existing state-of-the-art approaches, across image and video-text tasks. The feature representations learned from image and video-text modalities, allows the model to achieve competitive performance when finetuned on audio-text tasks, despite not being pretrained on audio. Thanks to the unified model, we propose a novel study on multimodal model merging via weight interpolation of models trained on different multimodal tasks, showing their benefits in particular for out-of-distribution generalization. Finally, we motivate unification by showing the synergy between tasks. The model weights and code are released here: https://github.com/mshukor/UnIVAL.

updated: Sun Jul 30 2023 09:48:36 GMT+0000 (UTC)

published: Sun Jul 30 2023 09:48:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト