Does Progress On Object Recognition Benchmarks Improve Real-World Generalization?

Megan Richards; Polina Kirichenko; Diane Bouchacourt; Mark Ibrahim

物体認識ベンチマークの進歩は現実世界の一般化を改善しますか?

10 年以上にわたり、研究者は ImageNet-A、-C、-R などの ImageNet ベースの一般化ベンチマークでオブジェクト認識の進歩を測定してきました。基礎モデルの最近の進歩により、桁違いに多くのデータでトレーニングされ、これらの標準ベンチマークは飽和し始めていますが、実際には脆弱なままです。これは、事前定義された変更や人工的な変更に焦点を当てがちな標準ベンチマークが、現実世界の一般化を測定するには不十分である可能性があることを示唆しています。したがって、私たちは、世界中の世帯からのオブジェクトの 2 つのデータセットを使用して、進歩のより現実的な尺度として、地理全体にわたる一般化を研究することを提案します。当社では、最新の基礎モデルに至るまで、100 近くのビジョンモデルにわたる進歩について広範な実証的評価を実施しています。まず、標準ベンチマークと現実世界の地理的変化との間の進歩のギャップを特定します。ImageNet の進歩により、標準一般化ベンチマークでは現実世界の分布の変化よりも最大 2.5 倍の進歩が生じます。 2 番目に、現実世界の一般化のより詳細な尺度である、地域間のパフォーマンスの格差を測定することにより、地域をまたいだモデルの一般化を研究します。すべてのモデルには大きな地理的差異があり、基盤となる CLIP モデルであっても、地域間の精度に 7 ～ 20% の違いがあることが観察されています。現代の直感に反して、標準ベンチマークの進歩は地理的格差を改善できず、しばしば悪化させていることがわかりました。最もパフォーマンスの低いモデルと今日の最高のモデルとの間の地理的格差は 3 倍以上になっています。私たちの結果は、現実世界の分布の変化に対する一貫した堅牢性を実現するには、スケーリングだけでは不十分であることを示唆しています。最後に、初期の実験では、より代表的で厳選されたデータを使用したシンプルな最終層の再トレーニングが、将来の研究の有望な方向性としてスケーリングをどのように補完し、両方のベンチマークの地理的格差を 3 分の 2 以上削減できるかを強調しました。

For more than a decade, researchers have measured progress in object recognition on ImageNet-based generalization benchmarks such as ImageNet-A, -C, and -R. Recent advances in foundation models, trained on orders of magnitude more data, have begun to saturate these standard benchmarks, but remain brittle in practice. This suggests standard benchmarks, which tend to focus on predefined or synthetic changes, may not be sufficient for measuring real world generalization. Consequently, we propose studying generalization across geography as a more realistic measure of progress using two datasets of objects from households across the globe. We conduct an extensive empirical evaluation of progress across nearly 100 vision models up to most recent foundation models. We first identify a progress gap between standard benchmarks and real-world, geographical shifts: progress on ImageNet results in up to 2.5x more progress on standard generalization benchmarks than real-world distribution shifts. Second, we study model generalization across geographies by measuring the disparities in performance across regions, a more fine-grained measure of real world generalization. We observe all models have large geographic disparities, even foundation CLIP models, with differences of 7-20% in accuracy between regions. Counter to modern intuition, we discover progress on standard benchmarks fails to improve geographic disparities and often exacerbates them: geographic disparities between the least performant models and today's best models have more than tripled. Our results suggest scaling alone is insufficient for consistent robustness to real-world distribution shifts. Finally, we highlight in early experiments how simple last layer retraining on more representative, curated data can complement scaling as a promising direction of future work, reducing geographic disparity on both benchmarks by over two-thirds.

updated: Mon Jul 24 2023 21:29:48 GMT+0000 (UTC)

published: Mon Jul 24 2023 21:29:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト