An Impartial Take to the CNN vs Transformer Robustness Contest

Francesco Pinto; Philip H. S. Torr; Puneet K. Dokania

CNN対トランスフォーマーの堅牢性コンテストへの公平な取り組み

コンピュータビジョンにおけるトランスフォーマーの人気の急上昇に続いて、いくつかの研究は、それらが畳み込みニューラルネットワーク（CNN）よりも分布シフトに対してよりロバストであり、より良い不確実性推定を提供できるかどうかを決定しようとしました。ほぼ満場一致の結論は、それらがそうであるということであり、この想定される優位性の理由は、自己注意メカニズムに起因するものであると多かれ少なかれ明確に推測されることがよくあります。このホワイトペーパーでは、最近の最先端のCNN（特に、ConvNeXt）が、現在の最先端のトランスフォーマーと同じくらい堅牢で信頼性が高く、場合によってはそれ以上であることを示す広範な経験的分析を実行します。ただし、明確な勝者はありません。したがって、あるアーキテクチャファミリの別のファミリに対する決定的な優位性を述べたくなりますが、テクスチャ、背景、単純さのバイアスなどの同様の脆弱性に悩まされながら、さまざまなタスクで同様の並外れたパフォーマンスを楽しんでいるようです。

Following the surge of popularity of Transformers in Computer Vision, several studies have attempted to determine whether they could be more robust to distribution shifts and provide better uncertainty estimates than Convolutional Neural Networks (CNNs). The almost unanimous conclusion is that they are, and it is often conjectured more or less explicitly that the reason of this supposed superiority is to be attributed to the self-attention mechanism. In this paper we perform extensive empirical analyses showing that recent state-of-the-art CNNs (particularly, ConvNeXt) can be as robust and reliable or even sometimes more than the current state-of-the-art Transformers. However, there is no clear winner. Therefore, although it is tempting to state the definitive superiority of one family of architectures over another, they seem to enjoy similar extraordinary performances on a variety of tasks while also suffering from similar vulnerabilities such as texture, background, and simplicity biases.

updated: Fri Jul 22 2022 21:34:37 GMT+0000 (UTC)

published: Fri Jul 22 2022 21:34:37 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト