HRFormer: High-Resolution Transformer for Dense Prediction

Yuhui Yuan; Rao Fu; Lang Huang; Weihong Lin; Chao Zhang; Xilin Chen; Jingdong Wang

HRFormer：高密度予測用の高解像度トランス

低解像度の表現を生成し、メモリと計算コストが高い元のVision Transformerとは対照的に、高密度の予測タスクの高解像度の表現を学習するHigh-Resolution Transformer（HRFormer）を紹介します。高解像度畳み込みネットワーク（HRNet）で導入された多重解像度並列設計と、重なり合わない小さな画像ウィンドウ上で自己注意を実行するローカルウィンドウ自己注意を利用して、メモリと計算効率を向上させます。さらに、FFNに畳み込みを導入して、切断された画像ウィンドウ間で情報を交換します。人間の姿勢推定とセマンティックセグメンテーションタスクの両方で高解像度トランスフォーマーの有効性を示します。たとえば、HRFormerはCOCOポーズ推定でSwinトランスフォーマーより1.3 AP優れており、パラメーターが50％少なく、FLOPが30％少なくなっています。コードはhttps://github.com/HRNet/HRFormerで入手できます。

We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution representations and has high memory and computational cost. We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet), along with local-window self-attention that performs self-attention over small non-overlapping image windows, for improving the memory and computation efficiency. In addition, we introduce a convolution into the FFN to exchange information across the disconnected image windows. We demonstrate the effectiveness of the High-Resolution Transformer on both human pose estimation and semantic segmentation tasks, e.g., HRFormer outperforms Swin transformer by 1.3 AP on COCO pose estimation with 50% fewer parameters and 30% fewer FLOPs. Code is available at: https://github.com/HRNet/HRFormer.

updated: Sun Nov 07 2021 14:39:41 GMT+0000 (UTC)

published: Mon Oct 18 2021 15:37:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト