Progress and limitations of deep networks to recognize objects in unusual poses

Amro Abbas; Stéphane Deny

異常なポーズのオブジェクトを認識するためのディープネットワークの進歩と限界

ディープネットワークは、ハイステークスの実世界のアプリケーション（自動運転車など）に正常に展開される場合、まれなイベントに対して堅牢である必要があります。ここでは、異常なポーズのオブジェクトを認識するディープネットワークの機能を研究します。異常な方向のオブジェクトの画像の合成データセットを作成し、画像分類のための38の最近の競合するディープネットワークのコレクションの堅牢性を評価します。これらの画像の分類は、テストされたすべてのネットワークにとって依然として課題であり、オブジェクトが直立して提示された場合と比較して、平均精度が29.5％低下することを示しています。この脆弱性は、トレーニング損失（たとえば、監視ありと自己監視）、アーキテクチャ（たとえば、畳み込みネットワークとトランスフォーマー）、データセットモダリティ（たとえば、画像と画像とテキストのペア）など、さまざまなネットワーク設計の選択による影響をほとんど受けません。、およびデータ拡張スキーム。ただし、非常に大規模なデータセットでトレーニングされたネットワークは、他のネットワークよりも大幅に優れており、JFT-300Mx2014でトレーニングされた最高のネットワークtestedx2014Noisy Student EfficentNet-L2は、異常なポーズでわずか14.5％の比較的小さな精度低下を示しています。それにもかかわらず、騒々しい学生の失敗の視覚的検査は、人間の視覚系との頑健性の残りのギャップを明らかにします。さらに、複数のオブジェクト変換x20143D-rotationsとscalingx2014を組み合わせると、すべてのネットワークのパフォーマンスがさらに低下します。全体として、私たちの結果は、実世界でネットワークを使用する際に考慮することが重要な、ディープネットワークの堅牢性の別の測定値を提供します。コードとデータセットはhttps://github.com/amro-kamal/ObjectPoseで入手できます。

Deep networks should be robust to rare events if they are to be successfully deployed in high-stakes real-world applications (e.g., self-driving cars). Here we study the capability of deep networks to recognize objects in unusual poses. We create a synthetic dataset of images of objects in unusual orientations, and evaluate the robustness of a collection of 38 recent and competitive deep networks for image classification. We show that classifying these images is still a challenge for all networks tested, with an average accuracy drop of 29.5% compared to when the objects are presented upright. This brittleness is largely unaffected by various network design choices, such as training losses (e.g., supervised vs. self-supervised), architectures (e.g., convolutional networks vs. transformers), dataset modalities (e.g., images vs. image-text pairs), and data-augmentation schemes. However, networks trained on very large datasets substantially outperform others, with the best network testedx2014Noisy Student EfficentNet-L2 trained on JFT-300Mx2014showing a relatively small accuracy drop of only 14.5% on unusual poses. Nevertheless, a visual inspection of the failures of Noisy Student reveals a remaining gap in robustness with the human visual system. Furthermore, combining multiple object transformationsx20143D-rotations and scalingx2014further degrades the performance of all networks. Altogether, our results provide another measurement of the robustness of deep networks that is important to consider when using them in the real world. Code and datasets are available at https://github.com/amro-kamal/ObjectPose.

updated: Sat Jul 16 2022 23:03:35 GMT+0000 (UTC)

published: Sat Jul 16 2022 23:03:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト