Efficient Large-Scale Visual Representation Learning And Evaluation

Eden Dolev; Alaa Awad; Denisa Roberts; Zahra Ebrahimzadeh; Marcin Mejran; Vaibhav Malpani; Mahir Yavuz

大規模な視覚表現の効率的な学習と評価

大規模なレコメンデーションには、アイテムの視覚表現を効率的に学習することが不可欠です。この記事では、畳み込みニューラルネットワーク (CNN) とビジョントランスフォーマー (ViT) ファミリの両方で、いくつかの事前トレーニング済みの効率的なバックボーンアーキテクチャを比較します。大規模な e コマースビジョンアプリケーションの課題について説明し、視覚的表現を効率的にトレーニング、評価、提供する方法に焦点を当てます。我々は、いくつかの下流タスクにおける視覚的表現を評価するアブレーション研究を紹介します。この目的を達成するために、視覚的に類似した推奨システムのための、新しい多言語テキストから画像への生成オフライン評価方法を提案します。最後に、大規模な電子商取引プラットフォームの実稼働環境にデプロイされた機械学習システムからのオンライン結果を含めます。

Efficiently learning visual representations of items is vital for large-scale recommendations. In this article we compare several pretrained efficient backbone architectures, both in the convolutional neural network (CNN) and in the vision transformer (ViT) family. We describe challenges in e-commerce vision applications at scale and highlight methods to efficiently train, evaluate, and serve visual representations. We present ablation studies evaluating visual representations in several downstream tasks. To this end, we present a novel multilingual text-to-image generative offline evaluation method for visually similar recommendation systems. Finally, we include online results from deployed machine learning systems in production on a large scale e-commerce platform.

updated: Tue Aug 01 2023 21:01:04 GMT+0000 (UTC)

published: Mon May 22 2023 18:25:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト