UniFormer: Unifying Convolution and Self-attention for Visual Recognition

Kunchang Li; Yali Wang; Junhao Zhang; Peng Gao; Guanglu Song; Yu Liu; Hongsheng Li; Yu Qiao

UniFormer: 視覚認識のための畳み込みと自己注意の統合

画像やビデオから識別表現を学習することは、これらの視覚データには大きなローカル冗長性と複雑なグローバル依存性があるため、困難な作業です。畳み込みニューラルネットワーク (CNN) とビジョントランスフォーマー (ViT) は、過去数年間で 2 つの主要なフレームワークとなっています。 CNN は小さな近傍内での畳み込みによってローカルの冗長性を効率的に減らすことができますが、受容野が限られているため、グローバルな依存関係を捉えることが困難になります。あるいは、ViT は自己注意を介して長距離の依存関係を効果的に捕らえることができ、すべてのトークン間の盲目的な類似性比較により高い冗長性が得られます。これらの問題を解決するために、我々は、畳み込みとセルフアテンションのメリットを簡潔なトランスフォーマ形式にシームレスに統合できる新しいユニファイドトランスフォーマ（UniFormer）を提案します。一般的なトランスフォーマーブロックとは異なり、UniFormer ブロックのリレーションアグリゲーターには、浅い層と深い層にそれぞれローカルトークンアフィニティとグローバルトークンアフィニティが装備されており、効率的かつ効果的な表現学習のために冗長性と依存性の両方に取り組むことができます。最後に、UniFormer ブロックを新しい強力なバックボーンに柔軟にスタックし、画像からビデオドメイン、分類から高密度予測までのさまざまなビジョンタスクに採用します。追加のトレーニングデータを使用せずに、当社の UniFormer は ImageNet-1K 分類で 86.3 のトップ 1 精度を達成します。 ImageNet-1K の事前トレーニングだけで、幅広いダウンストリームタスクで最先端のパフォーマンスを簡単に達成できます。たとえば、Kinetics-400/600、60.9/71.2 で 82.9/84.8 のトップ 1 精度を獲得します。 Sth-Sth V1/V2 ビデオ分類でトップ 1 の精度、COCO オブジェクト検出で 53.8 ボックス AP と 46.4 マスク AP、ADE20K セマンティックセグメンテーションで 50.8 mIoU、COCO ポーズ推定で 77.4 AP。さらに、2 ～ 4 倍のスループットを備えた効率的な UniFormer を構築します。コードは https://github.com/Sense-X/UniFormer で入手できます。

It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention, while blind similarity comparisons among all the tokens lead to high redundancy. To resolve these problems, we propose a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format. Different from the typical transformer blocks, the relation aggregators in our UniFormer block are equipped with local and global token affinity respectively in shallow and deep layers, allowing to tackle both redundancy and dependency for efficient and effective representation learning. Finally, we flexibly stack our UniFormer blocks into a new powerful backbone, and adopt it for various vision tasks from image to video domain, from classification to dense prediction. Without any extra training data, our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification. With only ImageNet-1K pre-training, it can simply achieve state-of-the-art performance in a broad range of downstream tasks, e.g., it obtains 82.9/84.8 top-1 accuracy on Kinetics-400/600, 60.9/71.2 top-1 accuracy on Sth-Sth V1/V2 video classification, 53.8 box AP and 46.4 mask AP on COCO object detection, 50.8 mIoU on ADE20K semantic segmentation, and 77.4 AP on COCO pose estimation. We further build an efficient UniFormer with 2-4x higher throughput. Code is available at https://github.com/Sense-X/UniFormer.

updated: Wed May 31 2023 09:19:23 GMT+0000 (UTC)

published: Mon Jan 24 2022 04:39:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト