Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks

David Junhao Zhang; Mutian Xu; Chuhui Xue; Wenqing Zhang; Xiaoguang Han; Song Bai; Mike Zheng Shou

Free-ATM: 無料のアテンションマスクを使用した拡散生成画像の教師なし学習の探索

視覚表現における教師なし学習は急速に進歩しているにもかかわらず、大規模なデータセットでのトレーニングが必要であり、コストのかかるデータ収集が必要であり、データプライバシーに関する懸念によりさらなる課題が生じています。最近、テキストから画像への拡散モデルによって生成された合成画像が、画像認識に役立つ大きな可能性を示しています。有望ではありますが、拡散生成画像の教師なし学習に特化した調査は不十分でした。これに対処するために、拡散モデルのクロスアテンションレイヤーが本質的に、生成された画像上の対応するテキスト入力に合わせてアノテーションのないアテンションマスクを提供していることを明らかにすることから始めます。次に、3 つの一般的な教師なし学習手法 (対照学習、マスクモデリング、視覚言語事前トレーニング) の問題を調査し、前述のフリーアテンションマスクを最大限に活用してカスタマイズされたソリューションを導入します。私たちのアプローチは、画像分類、検出、セグメンテーション、画像テキスト検索などのさまざまな下流タスクにわたってベースラインモデルの一貫した改善を示す広範な実験を通じて検証されています。私たちの手法を利用することで、合成データの教師なし事前トレーニングと現実世界のシナリオの間のパフォーマンスのギャップを埋めることができます。

Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recently, synthetic images generated by text-to-image diffusion models, have shown great potential for benefiting image recognition. Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images. To address this, we start by uncovering that diffusion models' cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent unsupervised learning techniques ( i.e., contrastive learning, masked modeling, and vision-language pretraining) and introduce customized solutions by fully exploiting the aforementioned free attention masks. Our approach is validated through extensive experiments that show consistent improvements in baseline models across various downstream tasks, including image classification, detection, segmentation, and image-text retrieval. By utilizing our method, it is possible to close the performance gap between unsupervised pretraining on synthetic data and real-world scenarios.

updated: Sun Aug 13 2023 10:07:46 GMT+0000 (UTC)

published: Sun Aug 13 2023 10:07:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト