CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification

Huidong Liu; Shaoyuan Xu; Jinmiao Fu; Yang Liu; Ning Xie; Chien-Chih Wang; Bryan Wang; Yi Sun

CMA-CLIP：画像-テキスト分類のためのクロスモダリティ注意CLIP

ソーシャルメディアやeコマースなどの最新のWebシステムには、画像やテキストで表現された豊富なコンテンツが含まれています。マルチモダリティからの情報を活用することで、分類や推奨などの機械学習タスクのパフォーマンスを向上させることができます。本論文では、クロスモダリティ注意対照言語-画像事前トレーニング（CMA-CLIP）を提案します。これは、2種類のクロスモダリティ注意、シーケンスごとの注意とモダリティごとの注意を統合して効果的に融合する新しいフレームワークです。画像とテキストのペアからの情報。シーケンスごとの注意により、フレームワークは画像パッチとテキストトークンの間のきめ細かい関係をキャプチャできます。一方、モダリティごとの注意は、ダウンストリームタスクとの関連性によって各モダリティを評価します。さらに、タスク固有のモダリティに関する注意と多層パーセプトロンを追加することにより、提案されたフレームワークは、マルチモダリティでマルチタスク分類を実行することができます。主要小売ウェブサイト製品属性（MRWPA）データセットと2つの公開データセット、Food101とFashion-Genで実験を行います。結果は、CMA-CLIPが、マルチタスク分類のMRWPAデータセットで同じレベルの精度で、事前にトレーニングおよび微調整されたCLIPよりも平均11.9％リコールで優れていることを示しています。また、Fashion-Genデータセットの最先端の方法を5.5％精度で上回り、Food101データセットで競争力のあるパフォーマンスを実現します。詳細なアブレーション研究を通じて、クロスモダリティ注意モジュールの有効性と、実際の一般的な課題である画像およびテキスト入力のノイズに対するメソッドの堅牢性をさらに実証します。

Modern Web systems such as social media and e-commerce contain rich contents expressed in images and text. Leveraging information from multi-modalities can improve the performance of machine learning tasks such as classification and recommendation. In this paper, we propose the Cross-Modality Attention Contrastive Language-Image Pre-training (CMA-CLIP), a new framework which unifies two types of cross-modality attentions, sequence-wise attention and modality-wise attention, to effectively fuse information from image and text pairs. The sequence-wise attention enables the framework to capture the fine-grained relationship between image patches and text tokens, while the modality-wise attention weighs each modality by its relevance to the downstream tasks. In addition, by adding task specific modality-wise attentions and multilayer perceptrons, our proposed framework is capable of performing multi-task classification with multi-modalities. We conduct experiments on a Major Retail Website Product Attribute (MRWPA) dataset and two public datasets, Food101 and Fashion-Gen. The results show that CMA-CLIP outperforms the pre-trained and fine-tuned CLIP by an average of 11.9% in recall at the same level of precision on the MRWPA dataset for multi-task classification. It also surpasses the state-of-the-art method on Fashion-Gen Dataset by 5.5% in accuracy and achieves competitive performance on Food101 Dataset. Through detailed ablation studies, we further demonstrate the effectiveness of both cross-modality attention modules and our method's robustness against noise in image and text inputs, which is a common challenge in practice.

updated: Thu Dec 09 2021 06:57:24 GMT+0000 (UTC)

published: Tue Dec 07 2021 08:23:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト