e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce

Wonyoung Shin; Jonghun Park; Taekang Woo; Yongwoo Cho; Kwangjin Oh; Hwanjun Song

e-CLIP：大規模なビジョン-Eコマースにおける言語表現学習

製品コンテンツのビジョンと言語表現を理解することは、eコマースの検索および推奨アプリケーションにとって不可欠です。オンラインショッピングプラットフォームのバックボーンとして、表現学習研究の最近の成功に触発されて、ラベルのない生の製品テキストと画像を使用して言語と視覚モデルを調整する対照的な学習フレームワークを提案します。大規模な表現学習モデルをトレーニングし、ドメイン固有の課題に対処するソリューションを共有するために使用した手法を紹介します。カテゴリ分類、属性抽出、製品マッチング、製品クラスタリング、成人向け製品認識など、さまざまなダウンストリームタスクのバックボーンとして、事前にトレーニングされたモデルを使用してパフォーマンスを調査します。実験結果は、提案された方法が、単一のモダリティと複数のモダリティの両方に関して、各ダウンストリームタスクのベースラインを上回っていることを示しています。

Understanding vision and language representations of product content is vital for search and recommendation applications in e-commerce. As a backbone for online shopping platforms and inspired by the recent success in representation learning research, we propose a contrastive learning framework that aligns language and visual models using unlabeled raw product text and images. We present techniques we used to train large-scale representation learning models and share solutions that address domain-specific challenges. We study the performance using our pre-trained model as backbones for diverse downstream tasks, including category classification, attribute extraction, product matching, product clustering, and adult product recognition. Experimental results show that our proposed method outperforms the baseline in each downstream task regarding both single modality and multiple modalities.

updated: Mon Aug 22 2022 14:25:14 GMT+0000 (UTC)

published: Fri Jul 01 2022 05:16:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト