Partial success in closing the gap between human and machine vision

Robert Geirhos; Kantharaju Narayanappa; Benjamin Mitzkus; Tizian Thieringer; Matthias Bethge; Felix A. Wichmann; Wieland Brendel

人間と機械のビジョンの間のギャップを埋めることに部分的に成功

数年前、最初のCNNはImageNetで人間のパフォーマンスを上回りました。しかし、マシンがより困難なテストケースで堅牢性に欠けていることがすぐに明らかになりました。これは、マシンを「実際に」展開したり、人間の視覚のより良い計算モデルを取得したりする上での大きな障害です。ここで私たちは尋ねます：私たちは人間と機械のビジョンの間のギャップを埋めることで進歩を遂げていますか？この質問に答えるために、私たちは、90人の参加者にわたる85,120の精神物理学的試験を記録し、広範囲の分布外（OOD）データセットで人間の観察者をテストしました。次に、目的関数（自己監視、敵対的に訓練された、CLIP言語-画像トレーニング）、アーキテクチャ（ビジョントランスフォーマーなど）、データセットサイズ（範囲1Mから1B）。私たちの調査結果は3つあります。（1.）人間とCNNの間の長年の歪みロバスト性のギャップは縮まりつつあり、現在、調査されたOODデータセットのほとんどで最良のモデルが人間のフィードフォワードパフォーマンスを上回っています。（2.）画像レベルの一貫性にはまだかなりのギャップがあります。つまり、人間はモデルとは異なるエラーを犯します。対照的に、ほとんどのモデルは、対照的な自己教師ありモデルと標準教師ありモデルのように実質的に異なるモデルでさえ、分類エラーについて体系的に一致しています。（3.）多くの場合、トレーニングデータセットのサイズを1〜3桁増やすと、人間とモデルの一貫性が向上します。私たちの結果は、慎重な楽観論の理由を示しています。まだ改善の余地はたくさんありますが、人間とマシンビジョンの行動の違いは狭くなっています。将来の進捗状況を測定するために、画像レベルの人間の行動データと評価コードを含む17のOODデータセットが、ツールボックスとベンチマークとしてhttps://github.com/bethgelab/model-vs-human/で提供されています。

A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines "in the wild" and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision? To answer this question, we tested human observers on a broad range of out-of-distribution (OOD) datasets, recording 85,120 psychophysical trials across 90 participants. We then investigated a range of promising machine learning developments that crucially deviate from standard supervised CNNs along three axes: objective function (self-supervised, adversarially trained, CLIP language-image training), architecture (e.g. vision transformers), and dataset size (ranging from 1M to 1B). Our findings are threefold. (1.) The longstanding distortion robustness gap between humans and CNNs is closing, with the best models now exceeding human feedforward performance on most of the investigated OOD datasets. (2.) There is still a substantial image-level consistency gap, meaning that humans make different errors than models. In contrast, most models systematically agree in their categorisation errors, even substantially different ones like contrastive self-supervised vs. standard supervised models. (3.) In many cases, human-to-model consistency improves when training dataset size is increased by one to three orders of magnitude. Our results give reason for cautious optimism: While there is still much room for improvement, the behavioural difference between human and machine vision is narrowing. In order to measure future progress, 17 OOD datasets with image-level human behavioural data and evaluation code are provided as a toolbox and benchmark at: https://github.com/bethgelab/model-vs-human/

updated: Mon Oct 25 2021 09:44:25 GMT+0000 (UTC)

published: Mon Jun 14 2021 13:23:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト