CREPE: Can Vision-Language Foundation Models Reason Compositionally?

Zixian Ma; Jerry Hong; Mustafa Omer Gul; Mona Gandhi; Irena Gao; Ranjay Krishna

CREPE: 視覚言語基盤モデルは構成的に推論できますか?

人間の視覚と自然言語の両方に共通する基本的な特徴は、それらの構成的な性質です。それでも、大規模なビジョンと言語の事前トレーニングによってパフォーマンスが向上したにもかかわらず、大規模なデータセットで 4 つのアルゴリズムを使用してトレーニングされた 6 つのアーキテクチャ全体で、構成性がほとんど示されていないことがわかりました。この結論に到達するために、認知科学の文献によって特定された構成性の2つの重要な側面である体系性と生産性を測定する新しい構成性評価ベンチマークCREPEを紹介します。体系性を測定するために、CREPE は 3 つのテストデータセットで構成されています。 3 つのテストセットは、CC-12M、YFCC-15M、LAION-400M の 3 つの一般的なトレーニングデータセットでトレーニングされたモデルをテストするように設計されています。 385K、385K、および 373K の画像とテキストのペアと、237K、210K、および 178K のハードネガティブキャプションが含まれています。生産性をテストするために、CREPE には、9 つの異なる複雑さを持つ 17,000 の画像とテキストのペアと、アトミック、スワッピング、および否定フォイルを使用した 246,000 のハードネガティブキャプションが含まれています。データセットは、Visual Genome シーングラフと領域の説明を転用し、手作りのテンプレートと GPT-3 を適用することによって生成されます。体系性については、新しい構成が検索セットを支配する場合、モデルのパフォーマンスが一貫して低下し、Recall@1 が最大 8% 低下することがわかりました。生産性については、複雑さが増すにつれてモデルの検索の成功率が低下し、高度な複雑さではランダムチャンスに近づくことがよくあります。これらの結果は、モデルとトレーニングデータセットのサイズに関係なく保持されます。

A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that - across 6 architectures trained with 4 algorithms on massive datasets - they exhibit little compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark CREPE which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of three test datasets. The three test sets are designed to test models trained on three of the popular training datasets: CC-12M, YFCC-15M, and LAION-400M. They contain 385K, 385K, and 373K image-text pairs and 237K, 210K, and 178K hard negative captions. To test productivity, CREPE contains 17K image-text pairs with nine different complexities plus 246K hard negative captions with atomic, swapping, and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to 8%. For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.

updated: Sat Jan 07 2023 07:57:12 GMT+0000 (UTC)

published: Tue Dec 13 2022 19:17:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト