M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product Downstream Tasks

Xiao Dong; Xunlin Zhan; Yangxin Wu; Yunchao Wei; Xiaoyong Wei; Minlong Lu; Xiaodan Liang

M5Product：E-コマース製品のダウンストリームタスクのためのマルチモーダル事前トレーニングベンチマーク

このホワイトペーパーでは、Eコマースに関するマルチモーダル事前トレーニングの研究を進め、その後、6,000を超えるカテゴリと5,000を超える属性をカバーする600万を超えるマルチモーダルペアで構成されるM5Productという名前の大規模なデータセットを提供することを目指しています。。一般に、既存のマルチモーダルデータセットは、規模またはモダリティの多様性が制限されています。これとは異なり、M5Productは次の側面から機能します。まず、M5Productデータセットは、同じ数のモダリティを持つパブリックマルチモーダルデータセットよりも500倍大きく、利用可能な最大のテキスト画像クロスモーダルデータセットと比較してほぼ2倍大きくなっています。次に、データセットには、画像、テキスト、表、ビデオ、オーディオなどの複数のモダリティの豊富な情報が含まれています。各モダリティは、セマンティック情報（カテゴリ、属性、アフォーダンス、ブランド、好みなど）のさまざまなビューをキャプチャし、他のモダリティを補完します。第三に、現実世界の問題によりよく対応するために、M5Productのいくつかの部分には、現実世界のシナリオとよく一致するロングテール分布を持ちながら、不完全なモダリティペアとノイズが含まれています。最後に、ベースラインモデルM5-MMTを提供します。これは、セマンティックアラインメントの大きな課題に対処するために、機能融合のためにさまざまなモダリティ構成を統合モデルに統合する最初の試みを行います。また、M5Productデータセットのさまざまな数のモダリティの下で、ラベルのないデータから学習する能力をベンチマークするために、さまざまなマルチモデルの事前トレーニングの最先端技術を評価します。私たちは4つの下流のタスクで広範な実験を行い、これらのモダリティに関するいくつかの興味深い発見を提供します。データセットと関連コードはhttps://xiaodongsuper.github.io/M5Product_datasetで入手できます。

In this paper, we aim to advance the research of multi-modal pre-training on E-commerce and subsequently contribute a large-scale dataset, named M5Product, which consists of over 6 million multimodal pairs, covering more than 6,000 categories and 5,000 attributes. Generally, existing multi-modal datasets are either limited in scale or modality diversity. Differently, our M5Product is featured from the following aspects. First, the M5Product dataset is 500 times larger than the public multimodal dataset with the same number of modalities and nearly twice larger compared with the largest available text-image cross-modal dataset. Second, the dataset contains rich information of multiple modalities including image, text, table, video and audio, in which each modality can capture different views of semantic information (e.g. category, attributes, affordance, brand, preference) and complements the other. Third, to better accommodate with real-world problems, a few portion of M5Product contains incomplete modality pairs and noises while having the long-tailed distribution, which aligns well with real-world scenarios. Finally, we provide a baseline model M5-MMT that makes the first attempt to integrate the different modality configuration into an unified model for feature fusion to address the great challenge for semantic alignment. We also evaluate various multi-model pre-training state-of-the-arts for benchmarking their capabilities in learning from unlabeled data under the different number of modalities on the M5Product dataset. We conduct extensive experiments on four downstream tasks and provide some interesting findings on these modalities. Our dataset and related code are available at https://xiaodongsuper.github.io/M5Product_dataset.

updated: Thu Sep 09 2021 13:50:22 GMT+0000 (UTC)

published: Thu Sep 09 2021 13:50:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト