UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot Vision-Language Tasks

Yanan Sun; Zihan Zhong; Qi Fan; Chi-Keung Tang; Yu-Wing Tai

UniBoost: ゼロショット視覚言語タスクを強化するための教師なしユニモーダル事前トレーニング

CLIP などのマルチモーダルモデルの大規模な共同トレーニングは、多くの視覚言語タスクで優れたパフォーマンスを示しています。ただし、事前トレーニング用の画像とテキストのペアは画像とテキストの交差部分に限定されているため、実世界のデータの大規模な分布をカバーする能力が制限されており、前処理中に位置ずれしたペアとしてノイズが発生する可能性もあります。逆に、教師なし手法を通じてテキストまたは画像データのみでトレーニングされた単峰性モデルは、現実世界の多様なデータをより広範囲にカバーでき、画像とテキストが同時に存在するという要件に制約されません。この論文では、事前トレーニングとして大規模な教師なし単峰モデルを使用すると、画像とテキストのペアモデルのゼロショットパフォーマンスを向上できることを示します。私たちの徹底的な研究により、事前トレーニングされたモデルが両方のモダリティの豊富な表現を学習し、画像とテキストが相互にどのように関連しているかを理解する能力が向上することが検証されています。私たちの実験では、ユニモーダル事前トレーニングが最先端の CLIP ベースのモデルよりも PASCAL-5^i で 6.5% (52.3% → 58.8%)、COCO-20 で 6.2% (27.2% → 33.4%) 優れていることが示されています。 ^i はそれぞれゼロショット設定でのセマンティックセグメンテーションです。両方のモダリティの表現を学習することにより、ユニモーダル事前トレーニングは、より広い範囲をカバーし、位置ずれエラーを削減し、実世界のデータのより複雑な特徴やパターンをキャプチャする機能を提供し、その結果、特にゼロショットのビジョン言語タスクのパフォーマンスが向上します。

Large-scale joint training of multimodal models, e.g., CLIP, have demonstrated great performance in many vision-language tasks. However, image-text pairs for pre-training are restricted to the intersection of images and texts, limiting their ability to cover a large distribution of real-world data, where noise can also be introduced as misaligned pairs during pre-processing. Conversely, unimodal models trained on text or image data alone through unsupervised techniques can achieve broader coverage of diverse real-world data and are not constrained by the requirement of simultaneous presence of image and text. In this paper, we demonstrate that using large-scale unsupervised unimodal models as pre-training can enhance the zero-shot performance of image-text pair models. Our thorough studies validate that models pre-trained as such can learn rich representations of both modalities, improving their ability to understand how images and text relate to each other. Our experiments show that unimodal pre-training outperforms state-of-the-art CLIP-based models by 6.5% (52.3% → 58.8%) on PASCAL-5^i and 6.2% (27.2% → 33.4%) on COCO-20^i semantic segmentation under zero-shot setting respectively. By learning representations of both modalities, unimodal pre-training offers broader coverage, reduced misalignment errors, and the ability to capture more complex features and patterns in the real-world data resulting in better performance especially for zero-shot vision-language tasks.

updated: Wed Jun 07 2023 18:26:22 GMT+0000 (UTC)

published: Wed Jun 07 2023 18:26:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト