M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal Pretraining

Xiao Dong; Xunlin Zhan; Yangxin Wu; Yunchao Wei; Michael C. Kampffmeyer; Xiaoyong Wei; Minlong Lu; Yaowei Wang; Xiaodan Liang

M5Product：E-コマーシャルマルチモーダル事前トレーニングのための自己調和対照学習

補完的なデータモダリティから高度に識別可能な特徴表現を学習するマルチモーダル事前トレーニングの可能性にもかかわらず、現在の進歩は、大規模なモダリティの多様なデータセットの欠如によって遅くなっています。さまざまなモダリティが補完的なセマンティック情報をキャプチャするEコマースの自然な適合性を活用することで、大規模なマルチモーダル事前トレーニングデータセットM5Productを提供します。データセットは5つのモダリティ（画像、テキスト、表、ビデオ、オーディオ）で構成され、6,000を超えるカテゴリと5,000の属性をカバーし、同様の数のモダリティを持つ公開されている最大のデータセットよりも500大きくなっています。さらに、M5Productには、不完全なモダリティペアとノイズが含まれていますが、ロングテール分布もあり、ほとんどの現実の問題に似ています。さらに、自己調和対照学習（SCALE）を提案します。これは、適応機能融合メカニズムを通じてさまざまなモダリティを統合モデルに統合する新しい事前トレーニングフレームワークです。各モダリティの重要性は、モダリティの埋め込みから直接学習され、相互に影響を与えます。マルチモーダルトランスフォーマーモデル内のモダリティ対照学習とマスクされたタスク。現在のマルチモーダル事前トレーニングの最先端のアプローチを評価し、M5Productデータセット内の多数のモダリティに直面したときに、ラベルのないデータから学習する能力をベンチマークします。 4つのダウンストリームタスクで広範な実験を行い、SCALEモデルの優位性を実証し、データセットのスケールと多様性の重要性についての洞察を提供します。

Despite the potential of multi-modal pre-training to learn highly discriminative feature representations from complementary data modalities, current progress is being slowed by the lack of large-scale modality-diverse datasets. By leveraging the natural suitability of E-commerce, where different modalities capture complementary semantic information, we contribute a large-scale multi-modal pre-training dataset M5Product. The dataset comprises 5 modalities (image, text, table, video, and audio), covers over 6,000 categories and 5,000 attributes, and is 500 larger than the largest publicly available dataset with a similar number of modalities. Furthermore, M5Product contains incomplete modality pairs and noise while also having a long-tailed distribution, resembling most real-world problems. We further propose Self-harmonized ContrAstive LEarning (SCALE), a novel pretraining framework that integrates the different modalities into a unified model through an adaptive feature fusion mechanism, where the importance of each modality is learned directly from the modality embeddings and impacts the inter-modality contrastive learning and masked tasks within a multi-modal transformer model. We evaluate the current multi-modal pre-training state-of-the-art approaches and benchmark their ability to learn from unlabeled data when faced with the large number of modalities in the M5Product dataset. We conduct extensive experiments on four downstream tasks and demonstrate the superiority of our SCALE model, providing insights into the importance of dataset scale and diversity.

updated: Thu Mar 03 2022 12:42:49 GMT+0000 (UTC)

published: Thu Sep 09 2021 13:50:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト