Will Large-scale Generative Models Corrupt Future Datasets?

Ryuichiro Hataya; Han Bao; Hiromi Arai

大規模な生成モデルは将来のデータセットを破壊しますか?

DALL・E 2、Midjourney、StableDiffusion など、最近提案された大規模なテキストから画像への生成モデルは、ユーザーのプロンプトから高品質でリアルな画像を生成できます。研究コミュニティに限らず、一般のインターネットユーザーもこれらの生成モデルを楽しんでおり、その結果、生成された膨大な量の画像がインターネット上で共有されています。一方、コンピュータービジョン分野における今日のディープラーニングの成功は、インターネットから収集された画像に大きく負っています。これらの傾向は、「そのように生成された画像は、将来のデータセットの品質やコンピュータービジョンモデルのパフォーマンスにプラスまたはマイナスの影響を与えるでしょうか?」という研究上の疑問につながります。この論文は、汚染をシミュレーションすることによって、この質問に経験的に答えます。つまり、最先端の生成モデルを使用して ImageNet スケールおよび COCO スケールのデータセットを生成し、画像分類や画像生成などのさまざまなタスクで「汚染された」データセットでトレーニングされたモデルを評価します。実験を通じて、生成された画像は下流のパフォーマンスに悪影響を与えるものの、重要性はタスクと生成された画像の量に依存すると結論付けました。生成されたデータセットと実験用のコードは、将来の研究のために公開される予定です。生成されたデータセットとソースコードは、https://github.com/moskomule/dataset-contamination から入手できます。

Recently proposed large-scale text-to-image generative models such as DALL∙E 2, Midjourney, and StableDiffusion can generate high-quality and realistic images from users' prompts. Not limited to the research community, ordinary Internet users enjoy these generative models, and consequently, a tremendous amount of generated images have been shared on the Internet. Meanwhile, today's success of deep learning in the computer vision field owes a lot to images collected from the Internet. These trends lead us to a research question: "will such generated images impact the quality of future datasets and the performance of computer vision models positively or negatively?" This paper empirically answers this question by simulating contamination. Namely, we generate ImageNet-scale and COCO-scale datasets using a state-of-the-art generative model and evaluate models trained with "contaminated" datasets on various tasks, including image classification and image generation. Throughout experiments, we conclude that generated images negatively affect downstream performance, while the significance depends on tasks and the amount of generated images. The generated datasets and the codes for experiments will be publicly released for future research. Generated datasets and source codes are available from https://github.com/moskomule/dataset-contamination.

updated: Thu Aug 10 2023 00:22:27 GMT+0000 (UTC)

published: Tue Nov 15 2022 12:25:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト