Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

Alexandre Rame; Guillaume Couairon; Mustafa Shukor; Corentin Dancette; Jean-Baptiste Gaya; Laure Soulier; Matthieu Cord

報酬スープ: さまざまな報酬に基づいて微調整された重みを補間することで、パレート最適配置に向けて

基礎モデルは、まず広大な教師なしデータセットで事前トレーニングされ、次にラベル付きデータで微調整されます。特に人間のフィードバック (RLHF) からの強化学習により、ネットワークを目的の用途にさらに合わせることができます。しかし、代理報酬の不完全性がトレーニングを妨げ、最適ではない結果につながる可能性があります。現実世界のタスクにおける目的の多様性と人間の意見が問題を悪化させます。この論文では、複数のポリシー戦略に従うことで、多様な報酬の不均一性を受け入れることを提案しています。単一のアプリオリな報酬に焦点を当てるのではなく、好みの空間全体にわたるパレート最適の一般化を目指します。この目的を達成するために、最初に複数のネットワークを個別に特化し (プロキシ報酬ごとに 1 つ)、次にそれらの重みを線形に補間する報酬スープを提案します。これは経験的に成功します。共有の事前トレーニングされた初期化からのさまざまな報酬に基づいて微調整された場合、重みが線形に接続されたままであることが示されているからです。テキストからテキスト（要約、Q&A、役立つアシスタント、レビュー）、テキストから画像（画像キャプション、テキストから画像への生成、視覚的グラウンディング、VQA）、および制御（移動）タスクに対するアプローチの有効性を実証します。。私たちは、ディープモデルの連携を強化し、それらが多様性に富む世界とどのように相互作用するかを強化したいと考えています。

Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned on labeled data. Reinforcement learning, notably from human feedback (RLHF), can further align the network with the intended usage. Yet the imperfections in the proxy reward may hinder the training and lead to suboptimal results; the diversity of objectives in real-world tasks and human opinions exacerbate the issue. This paper proposes embracing the heterogeneity of diverse rewards by following a multi-policy strategy. Rather than focusing on a single a priori reward, we aim for Pareto-optimal generalization across the entire space of preferences. To this end, we propose rewarded soup, first specializing multiple networks independently (one for each proxy reward) and then interpolating their weights linearly. This succeeds empirically because we show that the weights remain linearly connected when fine-tuned on diverse rewards from a shared pre-trained initialization. We demonstrate the effectiveness of our approach for text-to-text (summarization, Q&A, helpful assistant, review), text-image (image captioning, text-to-image generation, visual grounding, VQA), and control (locomotion) tasks. We hope to enhance the alignment of deep models, and how they interact with the world in all its diversity.

updated: Wed Jun 07 2023 14:58:15 GMT+0000 (UTC)

published: Wed Jun 07 2023 14:58:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト