Vision Transformers for Dense Prediction

René Ranftl; Alexey Bochkovskiy; Vladlen Koltun

高密度予測のためのビジョントランスフォーマー

高密度予測タスクのバックボーンとして畳み込みネットワークの代わりにビジョントランスフォーマーを活用するアーキテクチャであるデンスビジョントランスフォーマーを紹介します。ビジョントランスフォーマーのさまざまな段階からのトークンをさまざまな解像度で画像のような表現に組み立て、畳み込みデコーダーを使用してそれらを段階的にフル解像度の予測に結合します。トランスフォーマーバックボーンは、一定の比較的高い解像度で表現を処理し、すべての段階でグローバルな受容野を持っています。これらの特性により、高密度ビジョントランスフォーマーは、完全畳み込みネットワークと比較した場合に、よりきめ細かく、よりグローバルにコヒーレントな予測を提供できます。私たちの実験では、特に大量のトレーニングデータが利用可能な場合、このアーキテクチャによって高密度の予測タスクが大幅に改善されることが示されています。単眼深度推定では、最先端の完全畳み込みネットワークと比較した場合、相対パフォーマンスが最大28％向上することがわかります。セマンティックセグメンテーションに適用すると、高密度ビジョントランスフォーマーは49.02％mIoUでADE20Kに新しい最先端を設定します。さらに、NYUv2、KITTI、Pascal Contextなどの小さなデータセットでアーキテクチャを微調整できることを示します。このデータセットは、新しい最先端技術も設定します。当社のモデルはhttps://github.com/intel-isl/DPTで入手できます。

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at https://github.com/intel-isl/DPT.

updated: Wed Mar 24 2021 18:01:17 GMT+0000 (UTC)

published: Wed Mar 24 2021 18:01:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト