Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information

Weijie Su; Xizhou Zhu; Chenxin Tao; Lewei Lu; Bin Li; Gao Huang; Yu Qiao; Xiaogang Wang; Jie Zhou; Jifeng Dai

マルチモーダルな相互情報の最大化によるオールインワンのプレトレーニングに向けて

大規模モデルの可能性を効果的に活用するために、教師あり事前トレーニング、弱教師あり事前トレーニング、自己教師あり事前トレーニングなど、さまざまなソースからの大量のデータによってサポートされるさまざまな事前トレーニング戦略が提案されています。複数の事前トレーニング戦略とさまざまなモダリティ/ソースからのデータを組み合わせることで、大規模モデルのトレーニングを大幅に強化できることが証明されています。ただし、現在の作業では多段階の事前トレーニングシステムが採用されており、複雑なパイプラインによって事前トレーニングの不確実性と不安定性が高まる可能性があります。したがって、これらの戦略を一段階で統合できることが望ましい。この論文では、最初に、統一された最適化ターゲットとして一般的なマルチモーダル相互情報量公式を提案し、既存のすべてのアプローチがフレームワークの特殊なケースであることを示します。この統一された視点の下で、マルチモーダル相互情報事前トレーニングの最大化 (M3I 事前トレーニング) という名前のオールインワンの単一段階の事前トレーニングアプローチを提案します。私たちのアプローチは、ImageNet 分類、COCO オブジェクト検出、LVIS ロングテールオブジェクト検出、ADE20k セマンティックセグメンテーションなど、さまざまなビジョンベンチマークで以前の事前トレーニング方法よりも優れたパフォーマンスを達成します。特に、数十億レベルのパラメーター画像バックボーンの事前トレーニングに成功し、さまざまなベンチマークで最先端のパフォーマンスを達成しています。コードは https://github.com/OpenGVLab/M3I-Pretraining で公開されます。

To effectively exploit the potential of large-scale models, various pre-training strategies supported by massive data from different sources are proposed, including supervised pre-training, weakly-supervised pre-training, and self-supervised pre-training. It has been proved that combining multiple pre-training strategies and data from various modalities/sources can greatly boost the training of large-scale models. However, current works adopt a multi-stage pre-training system, where the complex pipeline may increase the uncertainty and instability of the pre-training. It is thus desirable that these strategies can be integrated in a single-stage manner. In this paper, we first propose a general multi-modal mutual information formula as a unified optimization target and demonstrate that all existing approaches are special cases of our framework. Under this unified perspective, we propose an all-in-one single-stage pre-training approach, named Maximizing Multi-modal Mutual Information Pre-training (M3I Pre-training). Our approach achieves better performance than previous pre-training methods on various vision benchmarks, including ImageNet classification, COCO object detection, LVIS long-tailed object detection, and ADE20k semantic segmentation. Notably, we successfully pre-train a billion-level parameter image backbone and achieve state-of-the-art performance on various benchmarks. Code shall be released at https://github.com/OpenGVLab/M3I-Pretraining.

updated: Mon Nov 21 2022 17:46:53 GMT+0000 (UTC)

published: Thu Nov 17 2022 18:59:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト