FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

Xiao Han; Xiatian Zhu; Licheng Yu; Li Zhang; Yi-Zhe Song; Tao Xiang

FAME-ViL: 異種ファッションタスクのためのマルチタスクビジョン言語モデル

ファッション分野では、クロスモーダル検索、テキストガイド付き画像検索、マルチモーダル分類、画像キャプションなど、さまざまな視覚と言語 (V+L) タスクが存在します。個々の入出力形式とデータセットのサイズが大きく異なります。タスク固有のモデルを設計し、事前にトレーニングされた V+L モデル (CLIP など) とは別に微調整するのが一般的です。これにより、パラメーターが非効率になり、タスク間の関連性を利用できなくなります。このような問題に対処するために、この作業では、ビジョンと言語のタスク (FAME-ViL) の新しいファッションに焦点を当てたマルチタスク効率的な学習方法を提案します。既存のアプローチと比較して、FAME-ViL は複数の異種ファッションタスクに単一のモデルを適用するため、パラメーター効率が大幅に向上します。これは、2 つの新しいコンポーネントによって可能になります。(1) クロスアテンションアダプターとタスク固有のアダプターが統合された V+L モデルに統合されたタスク多目的アーキテクチャ、および (2) サポートする安定した効果的なマルチタスクトレーニング戦略。異種データから学習し、負の伝達を防ぎます。 4 つのファッションタスクに関する広範な実験により、当社の FAME-ViL は、従来の独立してトレーニングされた単一タスクモデルよりも大幅に優れている一方で、代替よりも 61.5% のパラメーターを節約できることが示されています。コードは https://github.com/BrandonHanx/FAME-ViL で入手できます。

In the fashion domain, there exists a variety of vision-and-language (V+L) tasks, including cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image captioning. They differ drastically in each individual input/output format and dataset size. It has been common to design a task-specific model and fine-tune it independently from a pre-trained V+L model (e.g., CLIP). This results in parameter inefficiency and inability to exploit inter-task relatedness. To address such issues, we propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL) in this work. Compared with existing approaches, FAME-ViL applies a single model for multiple heterogeneous fashion tasks, therefore being much more parameter-efficient. It is enabled by two novel components: (1) a task-versatile architecture with cross-attention adapters and task-specific adapters integrated into a unified V+L model, and (2) a stable and effective multi-task training strategy that supports learning from heterogeneous data and prevents negative transfer. Extensive experiments on four fashion tasks show that our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models. Code is available at https://github.com/BrandonHanx/FAME-ViL.

updated: Sat Mar 04 2023 19:07:48 GMT+0000 (UTC)

published: Sat Mar 04 2023 19:07:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト