FedTune: A Deep Dive into Efficient Federated Fine-Tuning with Pre-trained Transformers

Jinyu Chen; Wenchao Xu; Song Guo; Junxiao Wang; Jie Zhang; Haozhao Wang

FedTune: 事前トレーニング済みのトランスフォーマーを使用した効率的なフェデレーション微調整の詳細

フェデレーテッドラーニング (FL) は、分散したユーザーがプライベートデータを共有することなく、機械学習モデルを協調的かつ反復的にトレーニングできるようにする新しいパラダイムです。自己注意ベースのアーキテクチャの有効性と堅牢性に動機付けられた研究者は、FL の従来の畳み込みニューラルネットワークの代わりに事前トレーニング済みのトランスフォーマー (つまり、基礎モデル) を使用して、その優れた転移学習機能を活用することに目を向けています。最近の進歩にもかかわらず、事前トレーニング済みの Transformer モデルが FL でどのように役割を果たすか、つまり、これらの事前トレーニング済みモデルを FL で効率的に微調整する方法と、FL ユーザーがこの新しいパラダイムからどのように利益を得ることができるかは不明のままです。このホワイトペーパーでは、この問題を調査し、微調整された Transformer が FL で並外れたパフォーマンスを達成すること、および軽量の微調整方法が高速な収束率と低い通信コストを促進することを示します。具体的には、FL の 2 種類の事前学習済みモデル (視覚言語モデルと視覚モデル) を使用して、3 つのチューニング方法 (入力の変更、モジュールの追加、バックボーンの調整) について厳密な実証研究を行います。私たちの実験では、1) バックボーンのバイアス項の微調整は、強力な事前トレーニング済みモデルに依存している場合に最も効果的です。 2) 視覚言語モデル (CLIP など) は、純粋な視覚モデル (ViT など) よりも優れており、少数ショット設定に対してより堅牢です。 3) 純粋なローカルトレーニングと比較して、事前トレーニング済みモデルを使用した FL は、オーバーフィッティングの問題が軽減されるため、精度が高くなります。コードをリリースし、事前トレーニング済みの Transformers と FL のさらなる調査を奨励します。

Federated Learning (FL) is an emerging paradigm that enables distributed users to collaboratively and iteratively train machine learning models without sharing their private data. Motivated by the effectiveness and robustness of self-attention-based architectures, researchers are turning to using pre-trained Transformers (i.e., foundation models) instead of traditional convolutional neural networks in FL to leverage their excellent transfer learning capabilities. Despite recent progress, how pre-trained Transformer models play a role in FL remains obscure, that is, how to efficiently fine-tune these pre-trained models in FL and how FL users could benefit from this new paradigm. In this paper, we explore this issue and demonstrate that the fine-tuned Transformers achieve extraordinary performance on FL, and that the lightweight fine-tuning method facilitates a fast convergence rate and low communication costs. Concretely, we conduct a rigorous empirical study of three tuning methods (i.e., modifying the input, adding extra modules, and adjusting the backbone) using two types of pre-trained models (i.e., vision-language models and vision models) for FL. Our experiments show that 1) Fine-tuning the bias term of the backbone performs best when relying on a strong pre-trained model; 2) The vision-language model (e.g., CLIP) outperforms the pure vision model (e.g., ViT) and is more robust to the few-shot settings; 3) Compared to pure local training, FL with pre-trained models has a higher accuracy because it alleviates the problem of over-fitting. We will release our code and encourage further exploration of pre-trained Transformers and FL.

updated: Tue Nov 15 2022 10:16:13 GMT+0000 (UTC)

published: Tue Nov 15 2022 10:16:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト