Attr2Style: A Transfer Learning Approach for Inferring Fashion Styles via Apparel Attributes

Rajdeep Hazra Banerjee; Abhinav Ravi; Ujjal Kr Dutta

Attr2Style：アパレル属性を介してファッションスタイルを推測するための転移学習アプローチ

人気のあるファッションeコマースプラットフォームは、ほとんどの場合、製品の詳細ページでアパレルの低レベルの属性（首のタイプ、ドレスの長さ、襟のタイプなど）に関する詳細を提供します。ただし、顧客は通常、スタイル情報に基づいてアパレルを購入するか、簡単に言えば、機会（パーティー、スポーツ、カジュアルウェアなど）を購入することを好みます。スタイルベースのキャプションの形式でグラウンドトゥルースアノテーションを取得することは困難であるため、教師付き画像キャプションモデルを適用してスタイルベースの画像キャプションを生成することは制限されています。これは、スタイルベースのキャプションに注釈を付けるには、ある程度のファッションドメインの専門知識が必要であり、コストと手作業が増えるためです。それどころか、低レベルの属性ベースの注釈ははるかに簡単に利用できます。この問題に対処するために、十分な属性ベースのグラウンドトゥルースキャプションを備えたソースデータセットでトレーニングされ、ターゲットデータセットのスタイルベースのキャプションを予測するために使用される、転送学習ベースの画像キャプションモデルを提案します。ターゲットデータセットには、スタイルベースのグラウンドトゥルースキャプションが付いた限られた量の画像しかありません。私たちのアプローチの主な動機は、ほとんどの場合、アパレルの低レベルの属性と高レベルのスタイルの間に相関関係があるという事実から来ています。この事実を活用し、アテンションメカニズムを使用してエンコーダーデコーダーベースのフレームワークでモデルをトレーニングします。特に、モデルのエンコーダーは、最初にソースデータセットでトレーニングされ、低レベルの属性をキャプチャする潜在的な表現を取得します。トレーニングされたモデルは、ターゲットデータセットのスタイルベースのキャプションを生成するように微調整されています。私たちの方法の有効性を強調するために、私たちのアプローチによって生成されたキャプションが評価されたアパレルの実際のスタイル情報に近いことを定性的および定量的に示します。モデルの概念実証はMyntraで試験運用中であり、フィードバックのために一部の内部ユーザーに公開されています。

Popular fashion e-commerce platforms mostly provide details about low-level attributes of an apparel (eg, neck type, dress length, collar type) on their product detail pages. However, customers usually prefer to buy apparel based on their style information, or simply put, occasion (eg, party/ sports/ casual wear). Application of a supervised image-captioning model to generate style-based image captions is limited because obtaining ground-truth annotations in the form of style-based captions is difficult. This is because annotating style-based captions requires a certain amount of fashion domain expertise, and also adds to the costs and manual effort. On the contrary, low-level attribute based annotations are much more easily available. To address this issue, we propose a transfer-learning based image captioning model that is trained on a source dataset with sufficient attribute-based ground-truth captions, and used to predict style-based captions on a target dataset. The target dataset has only a limited amount of images with style-based ground-truth captions. The main motivation of our approach comes from the fact that most often there are correlations among the low-level attributes and the higher-level styles for an apparel. We leverage this fact and train our model in an encoder-decoder based framework using attention mechanism. In particular, the encoder of the model is first trained on the source dataset to obtain latent representations capturing the low-level attributes. The trained model is fine-tuned to generate style-based captions for the target dataset. To highlight the effectiveness of our method, we qualitatively and quantitatively demonstrate that the captions generated by our approach are close to the actual style information for the evaluated apparel. A Proof Of Concept for our model is under pilot at Myntra where it is exposed to some internal users for feedback.

updated: Fri Dec 11 2020 12:03:48 GMT+0000 (UTC)

published: Wed Aug 26 2020 16:42:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト