Masked Vision-Language Transformer in Fashion

Ge-Peng Ji; Mingcheng Zhuge; Dehong Gao; Deng-Ping Fan; Christos Sakaridis; Luc Van Gool

ファッションにおけるマスクされた視覚言語トランスフォーマー

ファッション固有のマルチモーダル表現のためのマスクされた視覚言語トランスフォーマー (MVLT) を提示します。技術的には、事前トレーニングモデルの BERT を置き換えるためにビジョントランスフォーマーアーキテクチャを利用するだけで、MVLT はファッションドメインの最初のエンドツーエンドフレームワークになります。さらに、ファッションをきめ細かく理解するために、マスク画像再構成 (MIR) を設計しました。 MVLT は拡張可能で便利なアーキテクチャであり、追加の前処理モデル (ResNet など) なしで生のマルチモーダル入力を許可し、ビジョンと言語の連携を暗黙的にモデル化します。さらに重要なことに、MVLT は、さまざまなマッチングおよび生成タスクに簡単に一般化できます。実験結果では、Fashion-Gen 2018 の優勝者である Kaleido-BERT よりも、検索 (rank@5: 17%) および認識 (精度: 3%) タスクが明らかに改善されていることが示されています。コードは https://github.com/GewelsJI/MVLT で入手できます。

We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize vision transformer architecture for replacing the BERT in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at https://github.com/GewelsJI/MVLT.

updated: Thu Oct 27 2022 01:44:08 GMT+0000 (UTC)

published: Thu Oct 27 2022 01:44:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト